Living Research Project · Updated May 2026

Human-AI K-12
Evidence Project

A systematic, AI-assisted evidence synthesis of causal research across ten core domains of K-12 education policy — continuously updated as new research emerges. 63 papers cited, 122 in the full bibliography — all rigorously verified against primary source PDFs.

122
Papers Reviewed
63
Directly Cited
10
Research Clusters
1
Replication Study

An Ongoing, Living Research Project

The empirical literature on K-12 education policy is vast, politically salient, and methodologically heterogeneous. This project applies a systematic, multi-agent AI-assisted workflow to synthesize the causal evidence across ten core research clusters — prioritizing randomized controlled trials, natural experiments, and quasi-experimental designs over observational correlations. Every quantitative claim has been verified against primary source PDFs using direct API-based text extraction.

This is not a finished product. The first literature review (10 clusters, 122 papers) and first replication study (Jackson, Johnson & Persico 2016) represent the opening phase of a longer-term project. New clusters, updated syntheses, additional replication studies, and practitioner summaries will be added as the work progresses. Suggestions for papers or corrections to existing claims are welcome via the contact page.

10 Research Clusters

Click any cluster to read the full evidence summary

Cluster 1
Teacher Quality
d = 0.10–0.20 per SD of teacher quality

Teacher quality is the single most important school-based determinant of student achievement.

Cluster 2
Early Childhood
7–12% annual ROI (Perry Preschool)

Intensive early childhood programs show high long-run returns, but modern scaled-up programs face a persistent fadeout problem.

Cluster 3
Class Size
d ≈ 0.22 (STAR, early grades)

Class size reduction produces positive effects in early grades, but is expensive and vulnerable to general equilibrium effects at scale.

Cluster 4
School Funding
+7.25% wages per 10% spending increase (JJP 2016)

Targeted school funding increases improve long-run adult outcomes, especially for low-income students. The debate has shifted from whether money matters to how it is spent.

Cluster 5
School Choice
d ≈ 0.40/year (Boston charters, math)

Urban 'No Excuses' charter schools produce large, replicable gains. Non-urban charters and large-scale voucher programs often show null or negative effects.

Cluster 6
Reading Instruction
d ≈ 0.41–0.43 (systematic phonics vs. whole-language)

Systematic, explicit phonics instruction is the scientifically validated foundation of early reading. The 'Reading Wars' have a clear empirical winner.

Cluster 7
High-Dosage Tutoring
d = 0.37 (pooled average, Nickow et al. 2020)

High-dosage tutoring is one of the most effective and reliably replicable interventions in K-12 education, with average effect sizes of d = 0.37.

Cluster 8
SEL & Non-Cognitive
d = 0.27 (universal SEL, Durlak et al. 2011)

Universal SEL programs reliably improve achievement and behavior. Targeted psychological interventions (grit, growth mindset) show weak effects at scale.

Cluster 9
Out-of-School Factors
30–40% larger income-achievement gap for children born in 2001 vs. 1975 (Reardon 2011)

Family background and neighborhood poverty are primary drivers of educational inequality. High-quality schools can significantly mitigate, but not eliminate, these effects.

Cluster 10
International Systems
N/A (system-level comparative research)

High-performing education systems share selective teacher preparation, equitable funding, and centralized curricula — but translating these features to the US context is deeply challenging.

Evidence at a Glance

Effect sizes across the major K-12 interventions reviewed in this synthesis. Hover any bar for the intervention unit and primary source.

What is Cohen's d? Cohen's d measures the standardized difference between two groups — how many standard deviations apart their average outcomes are. Conventionally: 0.2 = small, 0.5 = medium, 0.8 = large. In education research, effects above 0.2 are considered meaningful. Each bar represents the effect of a specific intervention unit — hover for details and source.
d=0.00d=0.10d=0.19d=0.28d=0.38d=0.55Systematic PhonicsUrban CharterSchoolsHigh-DosageTutoringUniversal SELProgramsClass Size ReductionTeacher Quality (1SD)Summer ReadingProgramsGrowth MindsetGrit InterventionsSmallMedium

Effect sizes are illustrative summaries from meta-analytic and quasi-experimental literature. Hover each bar for the intervention unit and source. Confidence intervals and context-dependence matter — do not treat these as precise point estimates.

Cross-Cutting Themes

Patterns that emerge consistently across all ten research clusters

The Persistence of Selection Bias

In almost every domain — from teacher value-added to charter schools to early childhood education — observational estimates are routinely found to be biased upward when subjected to rigorous quasi-experimental or experimental stress tests.

The Fadeout Phenomenon

Interventions that produce large short-term gains frequently see those gains fade over time. However, fadeout in test scores does not always preclude long-term benefits in adult outcomes, suggesting that non-cognitive skills may act as a crucial, unmeasured mediator.

Implementation Trumps Intervention

The efficacy of many interventions is highly dependent on implementation fidelity and context. Interventions that scale well — like high-dosage tutoring — typically have highly structured, standardized delivery mechanisms.

Money Matters, But How It Is Spent Matters More

The debate over school funding has shifted from whether resources matter to how they are deployed. Targeted funding for high-need students and evidence-based programs yields the highest returns.

Methodology: Human-AI Collaboration

This synthesis was produced through a structured six-stage workflow combining human judgment with AI assistance. All final results, interpretations, and editorial decisions were reviewed and approved by the human author.

01
Systematic Literature ScreeningElicit

Elicit screened the literature to identify candidate papers across 10 research clusters, prioritizing RCTs, natural experiments, and quasi-experimental designs.

02
Literature Retrieval & DraftingManus AI

Manus AI served as the primary research assistant — executing retrieval, initial drafting, and cross-cluster synthesis under continuous human direction.

03
PDF-Based Quantitative VerificationClaude

Every effect size, sample size, and p-value was verified against primary source PDFs using Claude (claude-opus-4) with direct text extraction via PyMuPDF.

04
Independent Fact-CheckingPerplexity

Perplexity AI independently fact-checked quantitative claims and methodological descriptions, flagging discrepancies for human review.

05
Citation Network AnalysisSemantic Scholar

Semantic Scholar API provided programmatic citation network analysis to identify seminal papers and track citation lineages across clusters.

06
Final Review & ApprovalHuman Author

All results, interpretations, and editorial decisions were reviewed and approved by the human author. AI tools were used as assistants and auditors, not autonomous decision-makers.

Stay Updated

Get notified when new research clusters, replication studies, or major revisions are published. No spam — research updates only.

Start Exploring the Evidence

Browse the 10 research clusters, read the JJP replication note, or download the full literature review PDF.