A research partner for research programs
studying AI impact in education.

UniRubric runs a dedicated program for universities, TAFE institutes, and government entities running research into AI impact in higher education. We provide the instrument; you provide the study design. The collaboration follows a three-stage protocol you can lift straight into a methods section — with bilingual rubrics, full per-criterion data export, locked-prompt reproducibility, and the right to cite UniRubric in any resulting publication.

The published literature on AI versus human feedback is mixed and active. We treat that as the reason for our research program, not a reason to avoid it. See the state of the field below.

Apply for a research collaboration Read the state of the field ↓

01
ICC = 0.43 between AI and humans on critical-thinking essays.
Mixed-methods comparative analysis suggested AI-based grading can support but not replace human judgement on critical-thinking tasks. Educators provided clearer, deeper interpretations of student responses.
Taylor & Francis · Critical Thinking Assessment in Higher Education · 2025
02
Cohen's κ = 0.18 for GPT-4 vs teachers on master's essays.
GPT-4 aligned with human graders on mean scores but with low inter-rater reliability and a risk-averse middle-range scoring pattern. Prompt engineering did not significantly improve alignment.
Lundgren · University of Gothenburg (arXiv 2406.16510) · 2024
03
AI tends to grade leniently low, harshly high — bias is proportional.
Wetzler et al. found generative AI grading shows a consistent proportional bias, recommending against standalone AI scoring for nuanced writing requiring creativity and depth of thought.
Ohio State University ASC · Wetzler et al. · 2024
04
Rubric integration substantially improves AI–human alignment.
162 Turkish EFL essays scored by ChatGPT, DeepSeek, and two human raters. With explicit rubric criteria, ChatGPT showed stronger sensitivity and tighter agreement. Opinion essays aligned best; argumentative writing aligned least.
ResearchGate · ChatGPT for Writing Evaluation · 2024
05
GPT-4 matched teacher reliability on Chinese EFL essays.
Generalizability theory analysis of 30 CET-4 essays found GPT-4's scoring reliability matched or exceeded the four college English teachers. ChatGPT's qualitative feedback was rated more relevant than teacher feedback across language, content, and organization.
Nature · Humanities & Social Sciences Communications · 2024
06
AI feedback led to higher revision scores than no feedback.
Chinese university EFL learners (n=35) who revised essays using ChatGPT feedback scored significantly higher than the no-feedback group. Students rated teacher feedback as more useful and easier to apply.
ScienceDirect · Computers and Composition · 2024

These are starting points, not a settled consensus. Selection bias, prompt sensitivity, model version drift, and the small samples typical of pilot studies are all live methodological concerns. A research program needs its own design, its own cohort, and its own analysis.

Bilingual rubrics — across first-language (L1) and second-language (L2) writing
The grading model works natively in 100+ languages, including Arabic. Per-criterion descriptors, evidence quotes, and student-facing feedback all preserve the language of submission. Right-to-left rendering supported at the UI layer.
Per-criterion data export
Every grading run exports as CSV or JSON: scores per criterion, the descriptor band each was matched to, the rubric quote, the evidence quote pulled from the student's text, lecturer-edited final score, and an approval audit timestamp. Researchers analyse the raw data, not screenshots.
Citation rights
If your study cites UniRubric — by name, version, and date — that is welcome and expected. We ask only that you include the model and version string we record on every grading run for reproducibility.
Lecturer-in-the-loop on every grade
No grade is released to a student until a lecturer reviews and approves. This is hard-coded, not a setting. The approval gate is exactly the affordance your research design will want to study.
A program tailored to your study
Sample size, rubric design, language pair, marker training, IRB protocols — we work with the design you submitted to your grant body, not against a generic pilot template.

01
Reliability
Whether the same input produces the same output across raters and across time. Reported via quadratic weighted kappa (QWK) for ordinal scores, Cohen's κ for categorical, ICC for continuous.
In the field literature
Published reliability of generative AI grading varies enormously by methodology. Lundgren (2024) reports κ = 0.18 between GPT-4 and master's-level markers. A Nature paper the same year reports GPT-4 reliability matching teachers on Chinese EFL essays. The Taylor & Francis 2025 paper reports ICC = 0.43 on critical-thinking essays. The numbers are not the model — they are the protocol.
What UniRubric provides
Calibration on your anchor set before the study window opens (Stage 1). Test-retest measurement on the same essay across multiple invocations (lecturer-in-the-loop preserves both grade sets so you can compute κ yourself). Server-side recompute of overall_score from criterion_results, not the LLM's claimed total — auditable, reproducible.
- Lundgren · Gothenburg 2024
- Our methodology ↗
02
Validity
Whether the score measures what the rubric claims to measure. Construct validity is measured via convergent correlation (does the AI score correlate with other accepted measures of the same construct?) and discriminant correlation (does it NOT correlate with things it shouldn't, like word count alone?).
In the field literature
Wetzler et al. (Ohio State 2024) document proportional bias: AI grades lenient at the low end and harsh at the high end. The bias is systematic, not random. ResearchGate 2024 reports that rubric integration substantially improves AI–human alignment — descriptor specificity matters more than model choice.
What UniRubric provides
Rubric descriptors are normalised and persisted to rubric_criteria with explicit performance-band quotes (Step 1 of the pipeline). Per-criterion evidence quotes are pulled from the student's text and validated before persistence. Construct-validity studies live in Stage 3 of a research partnership where your team designs and we provide the data.
- Wetzler · OSU 2024
- Our pipeline architecture ↗
03
Reproducibility
Whether an independent researcher can rerun your method with the same data and arrive at the same numbers. Without reproducibility, a claim about an AI tool is not a research finding — it is a one-time observation.
In the field literature
Model version drift is a known threat in the AI scoring literature. A grading run from May produces different scores than the same essay graded in October if the model version changes silently. Almost no commercial AI grading tool today records the exact model and prompt used per grade.
What UniRubric provides
Every grading_runs row records the model_used JSONB (per-step model and version string), cost_usd, started_at, completed_at, and full step_outputs JSONB containing every intermediate LLM response. The system prompt for a calibrated study is SHA-256 hashed and locked. If we update the prompt, your locked hash still verifies your historical grades.
- View grading_runs schema ↗
04
Transparency
Whether the tool's reasoning is auditable by the researcher, not a black box. Per AERA standards: 'evidence supporting score interpretation should be made publicly available'.
In the field literature
Generic chat-style AI grading tools typically output a score and a paragraph of feedback with no underlying structure. Researchers cannot audit how the score was reached, which rubric criterion was matched to which descriptor band, or which sentence in the student's text was used as evidence.
What UniRubric provides
Every grade exports per-criterion: score, descriptor band, exact rubric quote that anchored the grade, exact evidence quote from the student's text, lecturer-edited final score, and approval timestamp. CSV or JSON export. No grade exists without all six fields persisted to criterion_results.
- Per-criterion export schema ↗
05
Fairness
Whether the tool grades equivalently across subgroups: L1 / L2 status, gender, prior-attainment band, institution. Subgroup analysis (subgroup ICC, Bland-Altman plots) is the standard tool. Increasingly required by ethics boards and grant funders.
In the field literature
The published L1/L2 fairness analysis on AI essay grading is thin. Most studies are English-only and grade-7-to-10 secondary. The ResearchGate Turkish-EFL study is one of the few to look at second-language writing explicitly. This is a gap UniRubric exists to help close.
What UniRubric provides
Bilingual rubrics in 100+ languages, including Arabic with right-to-left rendering. Subgroup analysis is part of Stage 3 of a research partnership: per-criterion deltas by L1/L2, language of submission, marker experience, prior-attainment band. Your study design determines which subgroups; we provide the data and the comparison view.
- ResearchGate · Turkish EFL 2024

We do not claim N % accuracy. The published literature already disagrees with itself on what that would mean. We claim that UniRubric is engineered to satisfy each of the five properties above, and that any specific reliability number for your study should come from your own calibration on your own cohort. See reliability & validity for the full methodology, benchmarks, and forthcoming public-dataset results.

Property

Generic AI grading tool

UniRubric

Reliability

Generic AI grading tool Score and paragraph of feedback. No mechanism to measure inter-rater agreement with a human marker. Test-retest unmeasured.

UniRubric Calibration κ measured on Stage 1 anchor set. Test-retest captured in production via lecturer-in-the-loop. Server-computed score from criterion_results, not LLM-reported.

Validity

Generic AI grading tool Single overall score with no decomposition. Cannot tell whether the model rewarded the rubric criteria you specified or proxies (word count, vocabulary range, formal register).

UniRubric Per-criterion scores, descriptor bands, and rubric quotes persisted to criterion_results. Discriminant analysis is straightforward — your study can show whether 'critical thinking' score correlates with word count.

Reproducibility

Generic AI grading tool Model version may change silently between grading sessions. System prompt is undisclosed. Re-running the same essay in 6 months may produce a different score.

UniRubric Model version recorded per grading run (step_1 Haiku, step_3 Opus, etc.). System prompt SHA-256 hashed and locked for calibrated studies. Re-running 6 months later either reproduces or fails-loudly.

Transparency

Generic AI grading tool Score appears with no audit trail. Researcher cannot trace which sentence in the student's text produced which scoring decision.

UniRubric Every grade exports: per-criterion score, descriptor band, rubric quote, evidence quote from student text, lecturer-edited final score, approval timestamp. Available as CSV or JSON.

Fairness

Generic AI grading tool English-language default. Right-to-left rendering often broken. No subgroup analysis. No way to test whether the tool grades L2 writers differently than L1 writers in the same cohort.

UniRubric 100+ languages including Arabic with native RTL. Subgroup analysis built into Stage 3 dashboards. Per-criterion deltas by L1/L2, language of submission, marker experience.

The contrast is not about which model is smarter. It is about whether the surrounding software treats a grading event as something to be audited, reproduced, and defended in a methods section — or as a one-time opaque output.

01
Before the study window opens
Calibration & blinding
Your team and UniRubric both grade a shared anchor set of 10-20 sample submissions in the language pair and discipline of your study. Discrepancies are reviewed; rubric descriptors are tuned to the institution's local register; prompt configuration is locked. The locked prompt is hashed and recorded so the study can be reproduced.
For your research team
Gives your methods section a defensible calibration protocol. Reviewers will ask how the AI was tuned to your context — this is the answer.
On our side
We learn the rubric and the cohort's voice. Tunings stay scoped to your study; nothing crosses cohorts.
02
During the study window
Parallel marking — independent and blinded
Lecturers mark normally and release grades to students through their LMS, as they would without us. UniRubric grades the same submissions independently and silently — lecturers do not see the AI grade until after they've submitted their own. Both grade sets are recorded with timestamps, model and version strings, and the rubric quotes each criterion's grade was anchored to.
For your research team
This is the data collection. Export everything as CSV or JSON at any cadence — daily, weekly, end-of-term. Your statistical analysis happens in whatever tool you already use; UniRubric is the instrument, not the analyst.
On our side
Lecturer-in-the-loop is preserved. The student experience does not change. Your study runs alongside teaching rather than disrupting it.
03
Optional, Year 2 of a typical grant
Embedded comparison & moderation tools
Once the study has run a term or two, UniRubric ships an in-product comparison view scoped to your cohort: per-criterion deltas, ICC and Cohen's kappa rolled up per marker and per language, drift detection over the term, and Bland-Altman plots for any criterion pair you select. Subgroup analysis by L1 / L2, language of submission, marker experience.
For your research team
Faster cycles between analysis and the next iteration of your study design. Useful when your grant's deliverable includes practical recommendations for adoption, not only the statistical comparison itself.
On our side
This is the harder build — usually a Year 2 collaboration funded as part of the grant or as a research-services arrangement.

The protocol is the same whether you’re comparing human and AI feedback in a single language, across an L1/L2 pair, or across two institutions. Specifics — sample size, IRB protocol, citation language, export schema — get written into the program proposal we send back after your first email.

Stage 01 — Calibration Fixed fee per study: Senior engineering time, Research Advisor review, the locked-prompt anchor runs, and the bound Calibration Report you append to your methods section. Quoted as a one-off at the start of the collaboration.
Stage 02 — Parallel marking Our standard B2B per-student pricing: Whatever volume your study cohort uses, billed at the same per-student-capped rates as any institutional deployment. See contracting for the labelled tiers. No research surcharge, and your study cohort can roll directly into a paid deployment after the grant window closes if institutional adoption is part of your deliverables.
Stage 03 — Embedded comparison & moderation Quoted per study: This is bespoke engineering — cohort-scoped dashboards, ICC and Cohen’s κ rolled up to your specification, drift detection, Bland-Altman plots, subgroup analysis. Quoted as a research-services arrangement, typically Year 2 of the grant. We welcome being a named subcontractor on the grant application itself.

We treat research collaborations as a programme, not a transaction. If your funding body has constraints on vendor relationships or requires a sliding scale for consortium budgets, say so in the first email — we will work with the constraint, not against it.

Dr Roheena Tahir

Research Advisor, UniRubric

PhD. Lecturer at Australian public and private universities, teaching in both English and Arabic. Master of Educational Leadership. Peer reviewer at BMC and a published author. Dr Tahir advises UniRubric’s research collaboration program — supporting university and government research teams in shaping their study design, data plan, and publication path.

ORCID: 0000-0002-7735-6414 ↗

Institution name *Contact email *Research topic *Brief description *

A research partner for research programsstudying AI impact in education.

ICC = 0.43 between AI and humans on critical-thinking essays.

Cohen's κ = 0.18 for GPT-4 vs teachers on master's essays.

AI tends to grade leniently low, harshly high — bias is proportional.

Rubric integration substantially improves AI–human alignment.

GPT-4 matched teacher reliability on Chinese EFL essays.

AI feedback led to higher revision scores than no feedback.

Bilingual rubrics — across first-language (L1) and second-language (L2) writing

Per-criterion data export

Citation rights

Lecturer-in-the-loop on every grade

A program tailored to your study

Reliability

Validity

Reproducibility

Transparency

Fairness

Reliability

Validity

Reproducibility

Transparency

Fairness

Calibration & blinding

Parallel marking — independent and blinded

Embedded comparison & moderation tools

Stage 01 — Calibration

Stage 02 — Parallel marking

Stage 03 — Embedded comparison & moderation

Dr Roheena Tahir

A research partner for research programs
studying AI impact in education.