Human-AI K-12
Evidence Project
A systematic, AI-assisted evidence synthesis of causal research across ten core domains of K-12 education policy — continuously updated as new research emerges. 63 papers cited, 122 in the full bibliography — all rigorously verified against primary source PDFs.
An Ongoing, Living Research Project
The empirical literature on K-12 education policy is vast, politically salient, and methodologically heterogeneous. This project applies a systematic, multi-agent AI-assisted workflow to synthesize the causal evidence across ten core research clusters — prioritizing randomized controlled trials, natural experiments, and quasi-experimental designs over observational correlations. Every quantitative claim has been verified against primary source PDFs using direct API-based text extraction.
This is not a finished product. The first literature review (10 clusters, 122 papers) and first replication study (Jackson, Johnson & Persico 2016) represent the opening phase of a longer-term project. New clusters, updated syntheses, additional replication studies, and practitioner summaries will be added as the work progresses. Suggestions for papers or corrections to existing claims are welcome via the contact page.
10 Research Clusters
Click any cluster to read the full evidence summary
Teacher quality is the single most important school-based determinant of student achievement.
Intensive early childhood programs show high long-run returns, but modern scaled-up programs face a persistent fadeout problem.
Class size reduction produces positive effects in early grades, but is expensive and vulnerable to general equilibrium effects at scale.
Targeted school funding increases improve long-run adult outcomes, especially for low-income students. The debate has shifted from whether money matters to how it is spent.
Urban 'No Excuses' charter schools produce large, replicable gains. Non-urban charters and large-scale voucher programs often show null or negative effects.
Systematic, explicit phonics instruction is the scientifically validated foundation of early reading. The 'Reading Wars' have a clear empirical winner.
High-dosage tutoring is one of the most effective and reliably replicable interventions in K-12 education, with average effect sizes of d = 0.37.
Universal SEL programs reliably improve achievement and behavior. Targeted psychological interventions (grit, growth mindset) show weak effects at scale.
Family background and neighborhood poverty are primary drivers of educational inequality. High-quality schools can significantly mitigate, but not eliminate, these effects.
High-performing education systems share selective teacher preparation, equitable funding, and centralized curricula — but translating these features to the US context is deeply challenging.
Evidence at a Glance
Effect sizes across the major K-12 interventions reviewed in this synthesis. Hover any bar for the intervention unit and primary source.
Effect sizes are illustrative summaries from meta-analytic and quasi-experimental literature. Hover each bar for the intervention unit and source. Confidence intervals and context-dependence matter — do not treat these as precise point estimates.
Cross-Cutting Themes
Patterns that emerge consistently across all ten research clusters
The Persistence of Selection Bias
In almost every domain — from teacher value-added to charter schools to early childhood education — observational estimates are routinely found to be biased upward when subjected to rigorous quasi-experimental or experimental stress tests.
The Fadeout Phenomenon
Interventions that produce large short-term gains frequently see those gains fade over time. However, fadeout in test scores does not always preclude long-term benefits in adult outcomes, suggesting that non-cognitive skills may act as a crucial, unmeasured mediator.
Implementation Trumps Intervention
The efficacy of many interventions is highly dependent on implementation fidelity and context. Interventions that scale well — like high-dosage tutoring — typically have highly structured, standardized delivery mechanisms.
Money Matters, But How It Is Spent Matters More
The debate over school funding has shifted from whether resources matter to how they are deployed. Targeted funding for high-need students and evidence-based programs yields the highest returns.
Methodology: Human-AI Collaboration
This synthesis was produced through a structured six-stage workflow combining human judgment with AI assistance. All final results, interpretations, and editorial decisions were reviewed and approved by the human author.
Elicit screened the literature to identify candidate papers across 10 research clusters, prioritizing RCTs, natural experiments, and quasi-experimental designs.
Manus AI served as the primary research assistant — executing retrieval, initial drafting, and cross-cluster synthesis under continuous human direction.
Every effect size, sample size, and p-value was verified against primary source PDFs using Claude (claude-opus-4) with direct text extraction via PyMuPDF.
Perplexity AI independently fact-checked quantitative claims and methodological descriptions, flagging discrepancies for human review.
Semantic Scholar API provided programmatic citation network analysis to identify seminal papers and track citation lineages across clusters.
All results, interpretations, and editorial decisions were reviewed and approved by the human author. AI tools were used as assistants and auditors, not autonomous decision-makers.
Stay Updated
Get notified when new research clusters, replication studies, or major revisions are published. No spam — research updates only.
Start Exploring the Evidence
Browse the 10 research clusters, read the JJP replication note, or download the full literature review PDF.