Validation study

Evidence.Guide extracts full focal-test payloads with human-level fidelity on real SPPS papers.

We ran the extraction API on 297 Social Psychological and Personality Science papers (2010-2012) that contain 827 expert-coded focal hypothesis-test pairs. Every test processed end-to-end; over 9 in 10 outputs are ready for meta-analysis with zero edits. Below is the full breakdown, workflow guidance, and study provenance.

92.4%

Functional accuracy

Usable extractions with no edits

100%

Coverage

827/827 focal tests across 297 papers

65.2%

Exact p-value matches

Identical to expert annotations

0.4%

Critical errors

3 wrong-test selections

What the API extracts

Each focal hypothesis becomes a structured row containing the same fields a human meta-analyst would type into a spreadsheet.

  • Hypothesis text
  • Test type (t, F, regression, chi^2, mixed models, etc.)
  • Test statistic, df, and structure
  • P-value with operators preserved
  • Sample size (N)
  • Degrees of freedom

Pipeline for this study

PDF ingestion

GROBID converts the paper into structured XML plus metadata.

Model extraction

A GPT-4-class model pulls focal hypotheses and statistics.

Schema enforcement

Strict Pydantic schemas coerce types and legal ranges.

Discrepancy analysis

Outputs compared to expert tables with confidence scoring.

Two concrete examples

Perfect alignment

Rock & Janoff-Bulman (2010) - SPPS 1(1), 26-33

10.1177/1948550609347386

  • Interaction between political orientation and motivation
  • API reproduced the quoted hypothesis verbatim
  • p = .002 and N = 223 match the expert annotation exactly

This is the common path: same hypothesis text, same statistic framing, ready for downstream use.

Same result, different label

Kuschel, Forster, & Denzler (2010) - SPPS 1(1), 4-11

10.1177/1948550609345023

  • Experiment 2 conceptual replication hypothesis recovered correctly
  • Experts described the focal statistic as an "attenuated interaction"
  • API emitted an F-test on that interaction with p < .0001

Meaning is aligned even if humans use prose labels while the API normalizes to formal statistics.

Results in detail

Coverage hit 827/827 focal tests without a single parser failure. The remaining quality signal comes from p-value fidelity, confidence scores, and manual review outcomes.

P-value breakdown

CategoryShare of testsWhat this means
Exact match65.2%Identical values
Format differences18.5%Same info, different notation (e.g., p < .05)
Rounding differences8.7%Only rounding varies
Missing p-value3.7%Numbers absent or paper ambiguous
Multiple-test confusion2.4%Several similar tests, API picks uncertainly
Other issues1.1%Misc. edge cases
Wrong test0.4% (3)Truly incorrect picks - flagged for audit

Confidence scores track quality

  • Exact matches average approx. 0.93 confidence.
  • Format or rounding differences sit around 0.89-0.90.
  • Minor issues cluster near 0.87, and the three critical errors averaged 0.85.

Even modest gaps are useful: low-confidence rows and multi-test passages become the review queue.

Ground truth caveats

Expert P-curve tables are the benchmark, but they occasionally disagree or contain slips. When the API better reflects the PDF, it is still counted as wrong, so the measured 92.4% functional accuracy is a conservative lower bound.

In messy papers, humans and models both reason through ambiguous passages, which is why reviewable audit trails matter more than chasing an illusory 100%.

Operational guidance

  • Hybrid queue

    Process every paper automatically, then flag low-confidence rows, multi-test passages, and unusual statistics for human review.

  • Sampling & audits

    Spot-check 5-10% of extractions for exploratory projects; tighten thresholds and review rates for policy or clinical decisions.

  • Domain-specific knobs

    Calibrate per journal or field before scaling-SPPS 2010-2012 is a conservative baseline, not a universal guarantee.

  • Post-processing

    Normalize notation, enforce valid ranges (0 <= p <= 1, integer dfs, non-negative N), and surface any residual anomalies.

Limitations & next steps

  • Single journal window

    All 297 papers are SPPS articles from 2010-2012; styles elsewhere require their own checks.

  • Focal tests only

    P-curve disclosure tables track the primary hypothesis-test pairs, not every robustness check reported in a paper.

  • Reporting quality matters

    Messy tables or ambiguous prose still create missing values or low-confidence flags for both humans and the model.

  • Ground truth isn't perfect

    Whenever the API quietly fixes a human slip, the metric still counts it as wrong-so 92.4% is a lower bound.

We are running companion validations in broader psychology, medicine, and public policy. Each domain receives its own benchmark, prompt tuning, and discrepancy taxonomy before the numbers appear here.

Study details

Dataset
P-curve disclosure tables (unpublished project)
Journal
Social Psychological and Personality Science
Time frame
2010-2012
Papers
297
Focal tests
827 hypothesis-test pairs
Pipeline
GROBID -> GPT-4-class model -> Pydantic schemas -> discrepancy analysis & confidence calibration

Want the prompts, evaluation notebooks, or taxonomy definitions? Reach out to the team and we are happy to walk through the validation in more detail.