Vipra Software Articles AI Grading at Scale
LLM Pipelines Vector Search EdTech RAG Human-in-the-Loop MLOps

AI Grading at Scale:
Vector Search + LLM Pipelines for 1M+ Submissions

Grading a million open-ended submissions by hand does not scale; grading them with raw LLM calls does not survive cost, consistency, or an appeals process. The production answer is a pipeline, not a prompt: retrieval-grounded scoring, confidence-gated human review, and a measurement regime that makes 94% expert agreement a reachable, checkable number.

Domain
AI/LLM · EdTech · Assessment
Scale Target
1M+ submissions/term
Consistency Target
94% expert agreement
Blended Cost Target
Single-digit ¢/submission
Stack
Embeddings · pgvector · LLM
Published
June 2026
Executive Summary

Stuff a rubric and an essay into a prompt and three things break at scale: consistency (the same essay scores differently on Thursday), cost (a million long-context frontier-model calls is a budget line nobody approved), and defensibility (“the model said 7/10” survives no appeals process). All three are fixed by the same architectural move — grounding every grade in retrieved, versioned, human-scored reference material.

The pipeline: batch-embed every submission, retrieve the nearest rubric anchors and scored exemplars by vector search, have the LLM grade against that retrieved context with structured output at temperature zero, gate low-confidence results to human reviewers, and feed every override back into the exemplar store. The flywheel makes the system measurably better exactly where it was weakest.

94% criterion-level agreement with expert graders is a realistic mature-state target — comparable to human inter-rater agreement on many rubrics — and it must be measured weekly against blind samples, per criterion and per cohort. For the broader landscape of what is production-ready in LLM data work, see our companion piece on LLM-augmented pipelines.

01 · Why Naive LLM Grading Fails

The demo is seductive: paste rubric, paste essay, receive a plausible grade with plausible reasoning in four seconds. The production failure modes arrive in week two:

FailureMechanismWho notices
InconsistencySampling variance, context drift, silent model updates — same essay, different Tuesday, different gradeThe two students who compared notes
Cost blowout1M submissions × full-rubric contexts × frontier modelFinance, one quarter late
IndefensibilityNo evidence chain — a score with vibes attachedThe appeals committee, then the regulator
Drift toward verbosityUngrounded models reward length and fluency over rubric criteriaNobody, which is the problem

Every fix below is a special case of one principle: the model never grades from its own opinion; it grades against retrieved, versioned, human-scored evidence.

02 · The Pipeline, Stage by Stage

ingest
Submissions → Kafka. Schema-enforced, deduped on submission ID, PII-minimised at the gate.
embed
Batch embedding. Batch API (10–100× cheaper than per-call); embedding model version pinned and recorded per row.
retrieve
Vector search. Nearest rubric anchors + human-scored exemplars, filtered by assignment, then ranked. pgvector or BigQuery vector search.
grade
LLM scoring. Small model first; structured JSON (per-criterion score + cited evidence); temperature 0; self-consistency vote.
gate & learn
Confidence routing. Low confidence → human review; every override becomes a new scored exemplar. The flywheel.

Each stage is independently scalable and independently auditable — which matters, because Section 07's governance story is built from these seams. The event spine doubles as the audit log: every submission's journey (embedding version, retrieved exemplar IDs, model version, votes, gate decision, reviewer) is reconstructible by ID.

03 · Vector Search Is the Consistency Mechanism

Retrieval is doing the real work in this architecture. By anchoring every grade to the same rubric criteria and the nearest human-scored exemplars, two similar essays are judged against the same evidence — which is precisely what consistency means operationally:

retrieval — pgvector, assignment-partitioned (SQL)
-- exemplars and submissions share one embedding space, per assignment SELECT e.exemplar_id, e.human_score, e.criterion_scores, e.excerpt, 1 - (e.embedding <=> :submission_embedding) AS similarity FROM grading.exemplars e WHERE e.assignment_id = :assignment_id -- cross-assignment neighbours are noise AND e.embedding_version = :current_version -- mixed versions = meaningless distance ORDER BY e.embedding <=> :submission_embedding LIMIT 6;

Three rules with sharp edges: partition the index by assignment — a brilliant history essay is noise when grading chemistry; store rubric criteria in the same vector space so the prompt assembles criteria-plus-exemplars coherently; and re-embed the entire store on embedding-model upgrades, because similarity between mixed-version vectors is quietly meaningless — the failure is silent and the grades just get worse. Pin versions everywhere; treat an embedding upgrade like a schema migration.

04 · The Grading Call: Structured, Grounded, Deterministic

grading request — the shape that survives audits (Python, simplified)
result = llm.generate( model=route(submission), # small model first; escalation in section 06 temperature=0, response_format=GradeSchema, # structured output: no regex archaeology prompt=render( rubric=rubric_v(assignment.rubric_version), # versioned, cited criteria=retrieved.criteria, exemplars=[e.excerpt_with_scores for e in retrieved.exemplars], # excerpts! submission=submission.text, )) # GradeSchema: per-criterion {score, evidence_quote, exemplar_refs[]} # + overall, + abstain:bool — the model may decline; abstention routes to humans votes = [llm.generate(...) for _ in range(3)] # self-consistency confidence = agreement(votes) * retrieval_confidence(retrieved)

The choices that matter: temperature 0 and structured output remove two gratuitous variance sources; per-criterion evidence quotes force the model to ground each score in the submission's own text (and give reviewers something fast to verify); the abstain field converts the model's uncertainty from a wrong grade into a routing decision; and self-consistency votes turn residual variance into a measurable confidence signal instead of hidden noise. Grading prompts are versioned artifacts — a prompt change is a deployment, with a calibration run attached.

05 · Human-in-the-Loop Is the Product, Not the Fallback

Confidence gating decides which grades a human sees: low vote agreement, large distance to nearest exemplars (a genuinely novel answer), scores near grade boundaries, or model abstention. Two design rules keep the loop honest:

  • Reviewers see the evidence, not just the score. The review UI shows per-criterion quotes and the retrieved exemplars beside the submission. Review time drops from minutes to seconds, and reviewer agreement becomes measurable against the same evidence the model used.
  • Every override is an exemplar. Corrections re-enter the store with the reviewer's criterion scores — so the system improves precisely where it was weakest. This is the flywheel: review rates start at 20–30% and fall as the exemplar store matures, and the falling curve is itself a health metric worth graphing on the wall.
💡Seed the exemplar store before launch: have experts score a stratified sample (including deliberately excellent, mediocre, and odd submissions) per assignment. Cold-start retrieval against three exemplars produces confident nonsense — the minimum viable store is ~30–50 scored exemplars per assignment.

06 · Cost-per-Inference Engineering

The difference between an affordable and an unaffordable system is rarely the model choice alone — it is the pipeline around it:

LeverMechanismTypical effect
Batch embeddingsBatch API vs per-call10–100× on the embedding line
Small-model-first routingCapable small model grades all; gate escalates the uncertain ~20–30% to a frontier model60–80% off the scoring line
Prompt economyExemplar excerpts with scores, not full essays; cached rubric tokens2–5× context reduction
Structured outputNo re-parse failures, no retry loopsKills the long tail of waste
AbstentionUncertain → human directly, not three escalation attemptsCaps worst-case per-item cost

Track cost per graded submission as a first-class SLO beside agreement rate — blended across auto and reviewed items, it lands in single-digit cents in reference scenarios, an order of magnitude below naive frontier-for-everything designs. The two SLOs trade against each other through the gate threshold, which makes the threshold a business dial, not an engineering constant: tightening it buys agreement with review hours, and the dashboard should show exactly that exchange rate.

07 · Measurement and Governance: What Makes It Deployable

Institutions do not adopt AI grading because it is cheap; they adopt it when it is more auditable than the manual process it replaces. The regime that clears that bar:

  • Weekly blind calibration. A stratified sample graded by experts who never see model scores; agreement tracked per rubric criterion — aggregate agreement hides criterion-level drift, and criterion drift is how trust dies quietly.
  • Fairness audits on the same cadence. Agreement sliced by cohort, language background, and submission length. Embedding spaces encode biases that aggregate metrics launder; the per-slice table is where you find them before a journalist does.
  • The appeals path reveals the chain. Rubric version, retrieved exemplars, evidence quotes, votes, gate decision, reviewer identity if reviewed. An appeal is a replay, not a debate.
  • Drift alarms on the inputs. Submission-length distributions, retrieval-distance distributions, abstention rates — when the incoming population shifts (new cohort, new prompt-injection fashion), the input monitors fire before the agreement metric sags.
🔩Write the model-update runbook before the first model update: re-embed store, recalibrate against the blind set, compare per-criterion agreement, then cut over. Teams that skip this discover that "the same pipeline on a newer model" is a different grader with the same name.
94%
Expert Agreement —
Mature-State Target
~5¢
Blended Cost per
Submission (Reference)
20→8%
Review Rate Curve as
Exemplar Store Matures
100%
Grades With a
Replayable Evidence Chain

08 · Lessons Learned: The Hard Truths

  • The exemplar store is the product; the model is a component. Teams obsess over model selection and starve exemplar curation. A mediocre model with a rich, well-maintained exemplar store beats a frontier model grading from vibes — repeatably.
  • Mixed embedding versions fail silently and badly. Our worst calibration regression traced to a partial re-embedding after a model upgrade — half the store in each space, similarity meaningless, grades plausible. Pin, migrate atomically, verify.
  • Aggregate agreement is a vanity metric. 94% overall coexisted with 71% on the "synthesis" criterion for non-native speakers. Per-criterion, per-cohort tables are the real dashboard; everything else is investor relations.
  • Reviewers drift too. Human graders disagree with each other at known rates; without periodic reviewer-vs-reviewer calibration, the "ground truth" feeding your flywheel wanders. Calibrate the calibrators.
  • Length bias is the default failure. Ungrounded LLMs reward word count. Evidence-quote grounding suppressed it; the weekly fairness slice by submission length is what proves it stays suppressed.
  • The abstain field saved more trust than any accuracy gain. A system that says "this needs a human" at the right moments earns institutional confidence faster than one that is right slightly more often but never doubts itself.

09 · Key Takeaways for Practitioners

🧭
Ground every grade

Rubric anchors + human-scored exemplars retrieved per submission. The model judges against evidence, never opinion.

📌
Pin all the versions

Embedding model, rubric, prompt, LLM — recorded per grade. An upgrade is a migration with a calibration run.

🚪
Gate on confidence

Vote agreement × retrieval distance × boundary proximity. The threshold is a business dial — show its exchange rate.

🔁
Overrides are exemplars

Every human correction improves tomorrow's retrieval exactly where the system was weakest. Graph the review-rate curve.

💰
Cost is an SLO

Batch embeddings, small-model-first, prompt economy, abstention. Single-digit cents blended, tracked weekly.

🔍
Audit per criterion, per cohort

Blind weekly calibration, fairness slices, drift alarms, replayable appeals. More auditable than manual — that is the bar.

For the wider map of what is production-ready in LLM data engineering — and what is still demo-ware — see LLM-Augmented Data Pipelines. The streaming and platform foundations come from the same production practice as our LXP streaming engagement; sector context on the EdTech industry page.

FAQ · Frequently Asked Questions

How accurate is LLM grading compared to human graders?
With retrieval grounding and a maturing exemplar store, ~94% criterion-level agreement with expert graders is a realistic target — comparable to inter-rater agreement between two trained humans on many rubrics. The number must be measured weekly against blind human-graded samples, per criterion and per cohort.
Why does the pipeline need vector search at all?
Retrieval is the consistency mechanism: anchoring every grade to the same rubric criteria and the nearest human-scored exemplars means similar work is judged against identical evidence. Without it, scores drift across time, model versions, and phrasing — the failure mode that kills institutional trust.
What does grading cost per submission?
Engineered properly — batch embeddings, small-model-first with gated escalation, prompt economy, cached rubric context — blended cost lands in single-digit cents per submission in reference scenarios: an order of magnitude below naive frontier-model designs. Track cost-per-graded-submission as an SLO beside agreement rate.
How do you handle appeals and academic integrity?
Every grade carries its evidence chain: rubric criteria cited, exemplars retrieved, criterion-level reasoning, and model/embedding versions. Appeals review that chain; low-confidence and boundary cases were already human-reviewed. The system should be more auditable than manual grading — that's the adoption bar.