Stuff a rubric and an essay into a prompt and three things break at scale: consistency (the same essay scores differently on Thursday), cost (a million long-context frontier-model calls is a budget line nobody approved), and defensibility (“the model said 7/10” survives no appeals process). All three are fixed by the same architectural move — grounding every grade in retrieved, versioned, human-scored reference material.
The pipeline: batch-embed every submission, retrieve the nearest rubric anchors and scored exemplars by vector search, have the LLM grade against that retrieved context with structured output at temperature zero, gate low-confidence results to human reviewers, and feed every override back into the exemplar store. The flywheel makes the system measurably better exactly where it was weakest.
94% criterion-level agreement with expert graders is a realistic mature-state target — comparable to human inter-rater agreement on many rubrics — and it must be measured weekly against blind samples, per criterion and per cohort. For the broader landscape of what is production-ready in LLM data work, see our companion piece on LLM-augmented pipelines.
01 · Why Naive LLM Grading Fails
The demo is seductive: paste rubric, paste essay, receive a plausible grade with plausible reasoning in four seconds. The production failure modes arrive in week two:
| Failure | Mechanism | Who notices |
|---|---|---|
| Inconsistency | Sampling variance, context drift, silent model updates — same essay, different Tuesday, different grade | The two students who compared notes |
| Cost blowout | 1M submissions × full-rubric contexts × frontier model | Finance, one quarter late |
| Indefensibility | No evidence chain — a score with vibes attached | The appeals committee, then the regulator |
| Drift toward verbosity | Ungrounded models reward length and fluency over rubric criteria | Nobody, which is the problem |
Every fix below is a special case of one principle: the model never grades from its own opinion; it grades against retrieved, versioned, human-scored evidence.
02 · The Pipeline, Stage by Stage
Each stage is independently scalable and independently auditable — which matters, because Section 07's governance story is built from these seams. The event spine doubles as the audit log: every submission's journey (embedding version, retrieved exemplar IDs, model version, votes, gate decision, reviewer) is reconstructible by ID.
03 · Vector Search Is the Consistency Mechanism
Retrieval is doing the real work in this architecture. By anchoring every grade to the same rubric criteria and the nearest human-scored exemplars, two similar essays are judged against the same evidence — which is precisely what consistency means operationally:
retrieval — pgvector, assignment-partitioned (SQL)-- exemplars and submissions share one embedding space, per assignment SELECT e.exemplar_id, e.human_score, e.criterion_scores, e.excerpt, 1 - (e.embedding <=> :submission_embedding) AS similarity FROM grading.exemplars e WHERE e.assignment_id = :assignment_id -- cross-assignment neighbours are noise AND e.embedding_version = :current_version -- mixed versions = meaningless distance ORDER BY e.embedding <=> :submission_embedding LIMIT 6;
Three rules with sharp edges: partition the index by assignment — a brilliant history essay is noise when grading chemistry; store rubric criteria in the same vector space so the prompt assembles criteria-plus-exemplars coherently; and re-embed the entire store on embedding-model upgrades, because similarity between mixed-version vectors is quietly meaningless — the failure is silent and the grades just get worse. Pin versions everywhere; treat an embedding upgrade like a schema migration.
04 · The Grading Call: Structured, Grounded, Deterministic
grading request — the shape that survives audits (Python, simplified)result = llm.generate( model=route(submission), # small model first; escalation in section 06 temperature=0, response_format=GradeSchema, # structured output: no regex archaeology prompt=render( rubric=rubric_v(assignment.rubric_version), # versioned, cited criteria=retrieved.criteria, exemplars=[e.excerpt_with_scores for e in retrieved.exemplars], # excerpts! submission=submission.text, )) # GradeSchema: per-criterion {score, evidence_quote, exemplar_refs[]} # + overall, + abstain:bool — the model may decline; abstention routes to humans votes = [llm.generate(...) for _ in range(3)] # self-consistency confidence = agreement(votes) * retrieval_confidence(retrieved)
The choices that matter: temperature 0 and structured output remove two gratuitous variance sources; per-criterion evidence quotes force the model to ground each score in the submission's own text (and give reviewers something fast to verify); the abstain field converts the model's uncertainty from a wrong grade into a routing decision; and self-consistency votes turn residual variance into a measurable confidence signal instead of hidden noise. Grading prompts are versioned artifacts — a prompt change is a deployment, with a calibration run attached.
05 · Human-in-the-Loop Is the Product, Not the Fallback
Confidence gating decides which grades a human sees: low vote agreement, large distance to nearest exemplars (a genuinely novel answer), scores near grade boundaries, or model abstention. Two design rules keep the loop honest:
- Reviewers see the evidence, not just the score. The review UI shows per-criterion quotes and the retrieved exemplars beside the submission. Review time drops from minutes to seconds, and reviewer agreement becomes measurable against the same evidence the model used.
- Every override is an exemplar. Corrections re-enter the store with the reviewer's criterion scores — so the system improves precisely where it was weakest. This is the flywheel: review rates start at 20–30% and fall as the exemplar store matures, and the falling curve is itself a health metric worth graphing on the wall.
06 · Cost-per-Inference Engineering
The difference between an affordable and an unaffordable system is rarely the model choice alone — it is the pipeline around it:
| Lever | Mechanism | Typical effect |
|---|---|---|
| Batch embeddings | Batch API vs per-call | 10–100× on the embedding line |
| Small-model-first routing | Capable small model grades all; gate escalates the uncertain ~20–30% to a frontier model | 60–80% off the scoring line |
| Prompt economy | Exemplar excerpts with scores, not full essays; cached rubric tokens | 2–5× context reduction |
| Structured output | No re-parse failures, no retry loops | Kills the long tail of waste |
| Abstention | Uncertain → human directly, not three escalation attempts | Caps worst-case per-item cost |
Track cost per graded submission as a first-class SLO beside agreement rate — blended across auto and reviewed items, it lands in single-digit cents in reference scenarios, an order of magnitude below naive frontier-for-everything designs. The two SLOs trade against each other through the gate threshold, which makes the threshold a business dial, not an engineering constant: tightening it buys agreement with review hours, and the dashboard should show exactly that exchange rate.
07 · Measurement and Governance: What Makes It Deployable
Institutions do not adopt AI grading because it is cheap; they adopt it when it is more auditable than the manual process it replaces. The regime that clears that bar:
- Weekly blind calibration. A stratified sample graded by experts who never see model scores; agreement tracked per rubric criterion — aggregate agreement hides criterion-level drift, and criterion drift is how trust dies quietly.
- Fairness audits on the same cadence. Agreement sliced by cohort, language background, and submission length. Embedding spaces encode biases that aggregate metrics launder; the per-slice table is where you find them before a journalist does.
- The appeals path reveals the chain. Rubric version, retrieved exemplars, evidence quotes, votes, gate decision, reviewer identity if reviewed. An appeal is a replay, not a debate.
- Drift alarms on the inputs. Submission-length distributions, retrieval-distance distributions, abstention rates — when the incoming population shifts (new cohort, new prompt-injection fashion), the input monitors fire before the agreement metric sags.
Mature-State Target
Submission (Reference)
Exemplar Store Matures
Replayable Evidence Chain
08 · Lessons Learned: The Hard Truths
- The exemplar store is the product; the model is a component. Teams obsess over model selection and starve exemplar curation. A mediocre model with a rich, well-maintained exemplar store beats a frontier model grading from vibes — repeatably.
- Mixed embedding versions fail silently and badly. Our worst calibration regression traced to a partial re-embedding after a model upgrade — half the store in each space, similarity meaningless, grades plausible. Pin, migrate atomically, verify.
- Aggregate agreement is a vanity metric. 94% overall coexisted with 71% on the "synthesis" criterion for non-native speakers. Per-criterion, per-cohort tables are the real dashboard; everything else is investor relations.
- Reviewers drift too. Human graders disagree with each other at known rates; without periodic reviewer-vs-reviewer calibration, the "ground truth" feeding your flywheel wanders. Calibrate the calibrators.
- Length bias is the default failure. Ungrounded LLMs reward word count. Evidence-quote grounding suppressed it; the weekly fairness slice by submission length is what proves it stays suppressed.
- The abstain field saved more trust than any accuracy gain. A system that says "this needs a human" at the right moments earns institutional confidence faster than one that is right slightly more often but never doubts itself.
09 · Key Takeaways for Practitioners
Rubric anchors + human-scored exemplars retrieved per submission. The model judges against evidence, never opinion.
Embedding model, rubric, prompt, LLM — recorded per grade. An upgrade is a migration with a calibration run.
Vote agreement × retrieval distance × boundary proximity. The threshold is a business dial — show its exchange rate.
Every human correction improves tomorrow's retrieval exactly where the system was weakest. Graph the review-rate curve.
Batch embeddings, small-model-first, prompt economy, abstention. Single-digit cents blended, tracked weekly.
Blind weekly calibration, fairness slices, drift alarms, replayable appeals. More auditable than manual — that is the bar.
For the wider map of what is production-ready in LLM data engineering — and what is still demo-ware — see LLM-Augmented Data Pipelines. The streaming and platform foundations come from the same production practice as our LXP streaming engagement; sector context on the EdTech industry page.