AI Grading at Scale: Vector Search + LLM Pipelines for a Million Student Submissions

Executive Summary

Stuff a rubric and an essay into a prompt and three things break at scale: consistency (the same essay scores differently on Thursday), cost (a million long-context frontier-model calls is a budget line nobody approved), and defensibility (“the model said 7/10” survives no appeals process). All three are fixed by the same architectural move — grounding every grade in retrieved, versioned, human-scored reference material.

The pipeline: batch-embed every submission, retrieve the nearest rubric anchors and scored exemplars by vector search, have the LLM grade against that retrieved context with structured output at temperature zero, gate low-confidence results to human reviewers, and feed every override back into the exemplar store. The flywheel makes the system measurably better exactly where it was weakest.

94% criterion-level agreement with expert graders is a realistic mature-state target — comparable to human inter-rater agreement on many rubrics — and it must be measured weekly against blind samples, per criterion and per cohort. For the broader landscape of what is production-ready in LLM data work, see our companion piece on LLM-augmented pipelines.

01 · Why Naive LLM Grading Fails

The demo is seductive: paste rubric, paste essay, receive a plausible grade with plausible reasoning in four seconds. The production failure modes arrive in week two:

Failure	Mechanism	Who notices
Inconsistency	Sampling variance, context drift, silent model updates — same essay, different Tuesday, different grade	The two students who compared notes
Cost blowout	1M submissions × full-rubric contexts × frontier model	Finance, one quarter late
Indefensibility	No evidence chain — a score with vibes attached	The appeals committee, then the regulator
Drift toward verbosity	Ungrounded models reward length and fluency over rubric criteria	Nobody, which is the problem

Every fix below is a special case of one principle: the model never grades from its own opinion; it grades against retrieved, versioned, human-scored evidence.

02 · The Pipeline, Stage by Stage

ingest

→

Submissions → Kafka. Schema-enforced, deduped on submission ID, PII-minimised at the gate.

embed

→

Batch embedding. Batch API (10–100× cheaper than per-call); embedding model version pinned and recorded per row.

retrieve

→

Vector search. Nearest rubric anchors + human-scored exemplars, filtered by assignment, then ranked. pgvector or BigQuery vector search.

grade

→

LLM scoring. Small model first; structured JSON (per-criterion score + cited evidence); temperature 0; self-consistency vote.

gate & learn

→

Confidence routing. Low confidence → human review; every override becomes a new scored exemplar. The flywheel.

Each stage is independently scalable and independently auditable — which matters, because Section 07's governance story is built from these seams. The event spine doubles as the audit log: every submission's journey (embedding version, retrieved exemplar IDs, model version, votes, gate decision, reviewer) is reconstructible by ID.

03 · Vector Search Is the Consistency Mechanism

Retrieval is doing the real work in this architecture. By anchoring every grade to the same rubric criteria and the nearest human-scored exemplars, two similar essays are judged against the same evidence — which is precisely what consistency means operationally:

retrieval — pgvector, assignment-partitioned (SQL)
-- exemplars and submissions share one embedding space, per assignment
SELECT e.exemplar_id, e.human_score, e.criterion_scores, e.excerpt,
       1 - (e.embedding <=> :submission_embedding) AS similarity
FROM   grading.exemplars e
WHERE  e.assignment_id = :assignment_id          -- cross-assignment neighbours are noise
  AND  e.embedding_version = :current_version    -- mixed versions = meaningless distance
ORDER  BY e.embedding <=> :submission_embedding
LIMIT  6;

Three rules with sharp edges: partition the index by assignment — a brilliant history essay is noise when grading chemistry; store rubric criteria in the same vector space so the prompt assembles criteria-plus-exemplars coherently; and re-embed the entire store on embedding-model upgrades, because similarity between mixed-version vectors is quietly meaningless — the failure is silent and the grades just get worse. Pin versions everywhere; treat an embedding upgrade like a schema migration.

04 · The Grading Call: Structured, Grounded, Deterministic

grading request — the shape that survives audits (Python, simplified)
result = llm.generate(
    model=route(submission),            # small model first; escalation in section 06
    temperature=0,
    response_format=GradeSchema,        # structured output: no regex archaeology
    prompt=render(
        rubric=rubric_v(assignment.rubric_version),     # versioned, cited
        criteria=retrieved.criteria,
        exemplars=[e.excerpt_with_scores for e in retrieved.exemplars],  # excerpts!
        submission=submission.text,
    ))
# GradeSchema: per-criterion {score, evidence_quote, exemplar_refs[]}
# + overall, + abstain:bool — the model may decline; abstention routes to humans
votes = [llm.generate(...) for _ in range(3)]           # self-consistency
confidence = agreement(votes) * retrieval_confidence(retrieved)

The choices that matter: temperature 0 and structured output remove two gratuitous variance sources; per-criterion evidence quotes force the model to ground each score in the submission's own text (and give reviewers something fast to verify); the abstain field converts the model's uncertainty from a wrong grade into a routing decision; and self-consistency votes turn residual variance into a measurable confidence signal instead of hidden noise. Grading prompts are versioned artifacts — a prompt change is a deployment, with a calibration run attached.

05 · Human-in-the-Loop Is the Product, Not the Fallback

Confidence gating decides which grades a human sees: low vote agreement, large distance to nearest exemplars (a genuinely novel answer), scores near grade boundaries, or model abstention. Two design rules keep the loop honest:

Reviewers see the evidence, not just the score. The review UI shows per-criterion quotes and the retrieved exemplars beside the submission. Review time drops from minutes to seconds, and reviewer agreement becomes measurable against the same evidence the model used.
Every override is an exemplar. Corrections re-enter the store with the reviewer's criterion scores — so the system improves precisely where it was weakest. This is the flywheel: review rates start at 20–30% and fall as the exemplar store matures, and the falling curve is itself a health metric worth graphing on the wall.

💡Seed the exemplar store before launch: have experts score a stratified sample (including deliberately excellent, mediocre, and odd submissions) per assignment. Cold-start retrieval against three exemplars produces confident nonsense — the minimum viable store is ~30–50 scored exemplars per assignment.

06 · Cost-per-Inference Engineering

The difference between an affordable and an unaffordable system is rarely the model choice alone — it is the pipeline around it:

Lever	Mechanism	Typical effect
Batch embeddings	Batch API vs per-call	10–100× on the embedding line
Small-model-first routing	Capable small model grades all; gate escalates the uncertain ~20–30% to a frontier model	60–80% off the scoring line
Prompt economy	Exemplar excerpts with scores, not full essays; cached rubric tokens	2–5× context reduction
Structured output	No re-parse failures, no retry loops	Kills the long tail of waste
Abstention	Uncertain → human directly, not three escalation attempts	Caps worst-case per-item cost

Track cost per graded submission as a first-class SLO beside agreement rate — blended across auto and reviewed items, it lands in single-digit cents in reference scenarios, an order of magnitude below naive frontier-for-everything designs. The two SLOs trade against each other through the gate threshold, which makes the threshold a business dial, not an engineering constant: tightening it buys agreement with review hours, and the dashboard should show exactly that exchange rate.

07 · Measurement and Governance: What Makes It Deployable

Institutions do not adopt AI grading because it is cheap; they adopt it when it is more auditable than the manual process it replaces. The regime that clears that bar:

Weekly blind calibration. A stratified sample graded by experts who never see model scores; agreement tracked per rubric criterion — aggregate agreement hides criterion-level drift, and criterion drift is how trust dies quietly.
Fairness audits on the same cadence. Agreement sliced by cohort, language background, and submission length. Embedding spaces encode biases that aggregate metrics launder; the per-slice table is where you find them before a journalist does.
The appeals path reveals the chain. Rubric version, retrieved exemplars, evidence quotes, votes, gate decision, reviewer identity if reviewed. An appeal is a replay, not a debate.
Drift alarms on the inputs. Submission-length distributions, retrieval-distance distributions, abstention rates — when the incoming population shifts (new cohort, new prompt-injection fashion), the input monitors fire before the agreement metric sags.

🔩Write the model-update runbook before the first model update: re-embed store, recalibrate against the blind set, compare per-criterion agreement, then cut over. Teams that skip this discover that "the same pipeline on a newer model" is a different grader with the same name.

94%

Expert Agreement —
Mature-State Target

~5¢

Blended Cost per
Submission (Reference)

20→8%

Review Rate Curve as
Exemplar Store Matures

100%

Grades With a
Replayable Evidence Chain

08 · Lessons Learned: The Hard Truths

The exemplar store is the product; the model is a component. Teams obsess over model selection and starve exemplar curation. A mediocre model with a rich, well-maintained exemplar store beats a frontier model grading from vibes — repeatably.
Mixed embedding versions fail silently and badly. Our worst calibration regression traced to a partial re-embedding after a model upgrade — half the store in each space, similarity meaningless, grades plausible. Pin, migrate atomically, verify.
Aggregate agreement is a vanity metric. 94% overall coexisted with 71% on the "synthesis" criterion for non-native speakers. Per-criterion, per-cohort tables are the real dashboard; everything else is investor relations.
Reviewers drift too. Human graders disagree with each other at known rates; without periodic reviewer-vs-reviewer calibration, the "ground truth" feeding your flywheel wanders. Calibrate the calibrators.
Length bias is the default failure. Ungrounded LLMs reward word count. Evidence-quote grounding suppressed it; the weekly fairness slice by submission length is what proves it stays suppressed.
The abstain field saved more trust than any accuracy gain. A system that says "this needs a human" at the right moments earns institutional confidence faster than one that is right slightly more often but never doubts itself.

09 · Key Takeaways for Practitioners

🧭

Ground every grade

Rubric anchors + human-scored exemplars retrieved per submission. The model judges against evidence, never opinion.

📌

Pin all the versions

Embedding model, rubric, prompt, LLM — recorded per grade. An upgrade is a migration with a calibration run.

🚪

Gate on confidence

Vote agreement × retrieval distance × boundary proximity. The threshold is a business dial — show its exchange rate.

🔁

Overrides are exemplars

Every human correction improves tomorrow's retrieval exactly where the system was weakest. Graph the review-rate curve.

💰

Cost is an SLO

Batch embeddings, small-model-first, prompt economy, abstention. Single-digit cents blended, tracked weekly.

🔍

Audit per criterion, per cohort

Blind weekly calibration, fairness slices, drift alarms, replayable appeals. More auditable than manual — that is the bar.

For the wider map of what is production-ready in LLM data engineering — and what is still demo-ware — see LLM-Augmented Data Pipelines. The streaming and platform foundations come from the same production practice as our LXP streaming engagement; sector context on the EdTech industry page.

FAQ · Frequently Asked Questions

How accurate is LLM grading compared to human graders?

With retrieval grounding and a maturing exemplar store, ~94% criterion-level agreement with expert graders is a realistic target — comparable to inter-rater agreement between two trained humans on many rubrics. The number must be measured weekly against blind human-graded samples, per criterion and per cohort.

Why does the pipeline need vector search at all?

Retrieval is the consistency mechanism: anchoring every grade to the same rubric criteria and the nearest human-scored exemplars means similar work is judged against identical evidence. Without it, scores drift across time, model versions, and phrasing — the failure mode that kills institutional trust.

What does grading cost per submission?

Engineered properly — batch embeddings, small-model-first with gated escalation, prompt economy, cached rubric context — blended cost lands in single-digit cents per submission in reference scenarios: an order of magnitude below naive frontier-model designs. Track cost-per-graded-submission as an SLO beside agreement rate.

How do you handle appeals and academic integrity?

Every grade carries its evidence chain: rubric criteria cited, exemplars retrieved, criterion-level reasoning, and model/embedding versions. Appeals review that chain; low-confidence and boundary cases were already human-reviewed. The system should be more auditable than manual grading — that's the adoption bar.

AI Grading at Scale:Vector Search + LLM Pipelines for 1M+ Submissions

01 · Why Naive LLM Grading Fails

02 · The Pipeline, Stage by Stage

03 · Vector Search Is the Consistency Mechanism

04 · The Grading Call: Structured, Grounded, Deterministic

05 · Human-in-the-Loop Is the Product, Not the Fallback

06 · Cost-per-Inference Engineering

07 · Measurement and Governance: What Makes It Deployable

08 · Lessons Learned: The Hard Truths

09 · Key Takeaways for Practitioners

FAQ · Frequently Asked Questions

AI Grading at Scale:
Vector Search + LLM Pipelines for 1M+ Submissions