We don’t claim 99 % accuracy.
We report the metrics the field accepts.

Assessment tools are evaluated on reliability and validity, not accuracy. Reliability is whether the same input produces the same output across raters and across time. Validity is whether the score measures what the rubric claims to measure. UniRubric is benchmarked on the same public datasets the academic literature uses, with the same metrics, computed in the open against gold-standard human ratings.

This page describes what we measure, the benchmarks we use, and how you can reproduce any number we publish. When pilot results come in, they go below — with the full methodology, sample size, model versions, and raw data references.

For the companion page on how UniRubric defends against every named class of AI fabrication — temporal, citation, rubric drift, prompt injection, and more — read the anti-hallucination architecture.

  1. 01

    Humans don't agree with each other

    Two trained markers on the same essay typically reach inter-rater reliability of κ = 0.40–0.70. There is no single correct human grade to be accurate against. Comparing AI to humans is meaningful only when you specify which humans, how trained, and at what level of inter-rater agreement they already had.

  2. 02

    Accuracy is a folk concept, not a psychometric one

    In assessment science, the accepted properties are reliability (whether the same input produces the same output) and validity (whether the score measures what the rubric claims). Accuracy doesn't translate. A 99%-accurate-to-rater-X tool can be 60%-accurate-to-rater-Y and 80%-accurate-to-the-class-average. The number tells you nothing without naming the gold standard.

  3. 03

    The published literature already disagrees with itself

    The studies we cite on the research page report ICC = 0.43 on critical-thinking essays (Taylor & Francis, 2025), κ = 0.18 between GPT-4 and master’s-level markers (Lundgren, 2024), proportional grading bias (Wetzler et al., 2024), and reliability matching teachers on Chinese EFL essays (Nature, 2024). The same model produces wildly different reliability depending on prompt, rubric, and population. A single “UniRubric is N % accurate” number, even if we could justify it, would be wrong somewhere within a week.

Inter-rater reliability (IRR)

Quadratic Weighted Kappa, Cohen's κ, ICC(2,1)

How much UniRubric agrees with human raters on the same submission. The choice of metric depends on whether the score is ordinal (QWK), categorical (κ), or continuous (ICC). For ASAP and most rubric work this is QWK — off-by-one disagreements count for much less than off-by-three, which matches the way assessment actually works.

Intra-rater reliability (test-retest)

ICC(2,1) approximation

If you grade the same essay 10 times, you'll get the same score 10 times — yes? Humans don't have this property; a marker at hour 6 grades differently than they did at hour 1. We measure UniRubric's test-retest by running the same submission across multiple sessions and days, against the same locked prompt, and reporting the ICC. This is one place AI tools should comfortably beat humans.

Bias and fairness

Bland-Altman, subgroup ICC, mean signed error

Does UniRubric systematically grade differently across L1/L2 status, gender, prior-attainment band, or institution? Where pilot data permits, we publish subgroup analyses. Where it doesn't, we say so plainly. Subgroup ICC and Bland-Altman plots are the standard tools.

Construct validity

Convergent / discriminant correlation

Does our “critical thinking” score correlate with other accepted critical-thinking measures (convergent), and NOT correlate with things it shouldn’t — like word count alone (discriminant)? This is study-specific and lives in the Stage 3 work of a research partnership. See how we collaborate for the design of those studies.

ASAP — Automated Student Assessment Prize

Validation in progress

Hewlett Foundation 2012

8 prompts, ~17,000 essays, scored by 2 human raters with adjudication. Grades 7–10 secondary education, English-language argumentative and source-based writing. QWK is the agreed reporting metric per the original Kaggle competition. We disclose our subsample size, prompt selection, and model versions for every number we publish from this dataset.

ASAP-SAS — Short Answer Scoring

Planned

Hewlett Foundation 2013

Companion dataset to ASAP, focused on short-answer constructed-response items rather than essays. Useful for rubric-style criteria where evidence is short and discrete. We'll benchmark after ASAP.

TOEFL11 — second-language essays

Pending license

ETS via LDC

Second-language essays from 11 native-language groups. The closest publicly cited proxy for L1/L2 bilingual research relevant to Gulf-region partners. Requires LDC license; we will publish if and when we secure access.

Customer calibration anchor sets

Per pilot

Per-study, never published without consent

During Stage 1 of a research collaboration, your team and UniRubric both grade 10–20 anchor papers. The resulting κ between us is the most relevant reliability number for your study — more relevant than ASAP, because it's measured on your rubric, your cohort, your language pair. We publish these numbers ONLY with explicit written consent from the partnering institution.

A note on dataset fit: ASAP is secondary-school English-language essays. UniRubric is built for higher-education and TAFE settings, often in languages other than English, on lecturer-set rubrics that the dataset does not contain. ASAP performance is a starting credibility floor, not the headline number a university should plan its adoption around. Stage 1 calibration on your own cohort is.

If this is the level you want from your AI marker

See the system in action against your own rubric.

Bring one assignment to a free institutional sandbox, or pilot UniRubric on a single subject for a term.

Status — May 2026

ASAP validation in progress. Numbers forthcoming.

Our first benchmark batch runs UniRubric through the ASAP public set using the same production pipeline every customer uses. We publish QWK per prompt, overall QWK, mean signed bias and standard deviation, test-retest ICC across 3 repeated runs per essay, and total compute cost. When the validation batch completes, results land here with the methodology block — sample size, prompt selection rule, model versions, locked prompt hash, raw data references — so anyone can reproduce.

No vendor-favourable preview. No cherry-picked prompts. Whatever the run produces, that’s what gets published.

When a research collaboration enters Stage 1, your team and UniRubric independently grade a shared anchor set of 10–20 sample submissions from your discipline and language pair. We compute κ between our scores and yours, tune rubric descriptors to converge, and lock the prompt. The resulting Calibration Report — a 2–3 page bound document with the methodology, descriptor tuning log, prompt hash, anchor-set κ, and model/version provenance — is yours to attach to your study’s methods section or ethics submission.

We do not publish these reports without explicit written consent from the institution. Anonymised aggregate statistics across multiple consenting pilots may appear here once we have enough partnerships to constitute a meaningful aggregate. Until then: not.

Same pipeline as production
Benchmark runs go through the same six-step grading pipeline that paying customers use. Same model routing (Haiku for normalise and tone, Sonnet for comprehension, feedback, synthesis; Opus for evaluation), same recompute step, same database schema. We do not run a separate research-only pipeline that could be tuned to perform better on benchmarks than on real submissions.
Score normalisation
ASAP prompts use different score ranges (some 0–3, some 2–60). For QWK we map each rubric to UniRubric's underlying 0–100 internal scale, then bin to integer scores matching the original prompt's resolution before computing QWK. Mapping rules and bin boundaries are documented per prompt in the published methodology block.
Prompt lock and hash
The system prompt used during a benchmark run is hashed (SHA-256) and recorded with every eval_run row. If we tune the prompt later, the prior batch's hash still verifies the earlier numbers. No prompt swap mid-batch.
Test-retest sample
We re-run a randomly-selected subset of essays (typically 50 of the larger N) three times across separate model invocations, days apart where possible. ICC is computed across the three repeats per essay. This is independent of the IRR-vs-human comparison.
Cost transparency
Every published number includes total Anthropic API cost for the batch. Researchers (and competitors) can decide whether the per-run economics are sustainable. We don't hide this.
Raw data references
For each published batch, we publish the batch_tag, dataset slug, sample size, prompt selection rule, model version strings, prompt hash, and total cost. Researchers can run the same parameters in their own UniRubric workspace and compare.

If we’ve published a number and you can’t reproduce it, that’s a bug. Tell us.

We treat reliability claims like any other engineering claim: if it doesn’t reproduce, it’s wrong, and we’d rather you tell us than not. Researchers who spot methodological issues or want access to our evaluation harness for replication should get in touch via the research collaboration form.