The agent demo is a while-loop around an LLM with tool calls; the agent platform is what remains after that loop meets production: tools that fail mid-sequence, memory that grows without governance, reasoning chains nobody can debug, actions that needed a human's sign-off, and a token bill arriving with no attribution. Every one of those is a data engineering problem, and almost none of the agent literature addresses them.
This is the production guide: an event-sourced execution spine where every agent step is a recorded event; tool contracts with schemas, idempotency, and budgets; vector-backed memory with retention tiers and provenance; trace observability that makes a 40-step reasoning chain debuggable in minutes; an autonomy ladder with human approval gates at consequence boundaries; and per-execution cost tracking that makes agent strategies A/B-testable like any other product surface.
This article speaks from practice: Vipra builds and operates VipraGo, an AI Workflow Operating System, and the platform patterns here — event spines, feature/memory stores, human-review calibration, cost SLOs — are the same documented disciplines running across our streaming (1B+ events/hour) and LLM pipeline work. Scale figures for the worked example are labelled reference values.
01 · Why Agent Demos Die in Production
The gap between demo and platform is a list of unanswered questions, and every team that ships agents meets all of them in the first quarter:
| Production question | Demo answer | Platform answer |
|---|---|---|
| The tool failed at step 7 of 12 — now what? | The loop retries or crashes | Event-sourced state; resume, compensate, or escalate per tool contract |
| Why did the agent do that? | Scroll the logs and guess | Full trace: every prompt, retrieval, tool I/O, and decision, replayable |
| Who approved the refund it issued? | Nobody — it was autonomous | Approval gate at the consequence boundary, with evidence pack |
| What did this run cost? | The monthly invoice, eventually | Per-step metering rolled to per-execution, per-strategy, per-tenant |
| Is the new prompt actually better? | Vibes | Strategy A/B on live traffic with pre-registered metrics |
| What does the agent remember about this user? | Everything, forever, ungoverned | Tiered memory with provenance, TTLs, and a subject-access query |
Notice what every platform answer has in common: it is a data engineering artifact — an event log, a trace, a contract, a meter, an experiment, a governed store. The model is maybe 20% of an agent platform. This article is about the other 80%.
02 · The Architecture: An Operating System, Not a Loop
The load-bearing decision is the first row: event sourcing is not optional. Agents are long-running, fallible, multi-step processes — exactly the workload event sourcing was invented for. Resume-after-crash, audit, replay, debugging, cost attribution, and experiments all fall out of the same log. This is VipraGo's architectural spine, and it is the same discipline as our exactly-once streaming work: the log is the truth; everything else is a view.
03 · The Execution Data Flow: Every Step Is an Event
Two properties to defend in review: outputs are referenced, not embedded — tool results land in object storage with an output_ref in the event, keeping the spine lean at scale; and the trace ID is the universal join key — cost, latency, gate decisions, experiment assignment, and user feedback all key on it, which is what makes Section 08's economics queryable at all.
04 · Tool-Use Orchestration: The Contract Layer
An agent's tools are an integration surface that happens to be called by a model — and integration surfaces need contracts, not descriptions:
tool contract — what the orchestrator enforces (registry entry)tool: issue_customer_refund schema: input: {order_id: string, amount: money, reason: enum[...]} output: {refund_id: string, status: enum[issued, queued, rejected]} side_effect: irreversible # read_only | reversible | irreversible consequence_tier: 3 # ties into the autonomy ladder (§07) idempotency: required # key = (order_id, amount, exec_id) budgets: {rate: 20/min per tenant, cost_est: $0.002/call} failure_policy: retryable: [timeout, rate_limited] # with backoff, same idempotency key compensation: none # irreversible ⇒ gate before, not undo after on_exhaust: escalate_to_gate
The orchestrator enforces what the model cannot be trusted to remember: schema validation both directions (malformed model output never reaches a production API; malformed tool output never reaches the next prompt unvalidated); idempotency keys on every side-effecting call, because agents retry and networks lie — the duplicate-refund incident is the agent platform's rite of passage, and idempotency is how you skip it; and failure policy as data — retryable-with-backoff vs compensate vs escalate is declared per tool, so step-7-of-12 failures resolve by policy, not by panic. This is the data-contract discipline from our contracts playbook pointed at a new producer: the model.
05 · Agent Memory: Governed, Not Accumulated
The default agent memory design — embed everything, retrieve top-k, forever — is a compliance incident with a vector index. The governed design separates three stores with different physics:
| Tier | Store | Write rule | Retention |
|---|---|---|---|
| Working | Execution context (spine events) | Automatic, execution-scoped | Dies with the execution; archived in trace |
| Episodic | Vector DB (pgvector/managed) | Distilled summaries only, provenance-tagged, PII-screened at write | Tiered TTLs; per-tenant; subject-access queryable |
| Semantic | Curated knowledge (lakehouse + index) | Human-reviewed promotion from episodic patterns | Versioned like documentation |
The write rule is where platforms diverge from demos: raw conversation never enters long-term memory. A distillation step extracts durable facts ("tenant prefers CSV exports", "order flow X requires VAT field") with provenance pointing back to the source trace — so every memory is auditable, correctable, and deletable. The disciplines are familiar from this series: embedding versions pinned and migrated atomically (LLM grading), retention as a store property with subject-access as a tested deliverable (learner telemetry). Memory is a feature store for agents; govern it like one.
06 · Observability: Tracing Multi-Step Reasoning
A 40-step agent execution that produced a wrong answer is undebuggable from application logs. The trace makes it a five-minute investigation:
the debugging session — trace-first, not log-archaeology-- "Execution 7f3a gave a customer the wrong delivery date. Why?" SELECT step_n, event_type, summary FROM traces.steps WHERE trace_id = '7f3a'; -- step 12: memory.retrieved → chunk #c91 (score 0.71): a summary written -- by an execution that predates the carrier policy change -- step 13: plan reasoned from stale memory; tools were never consulted -- root cause: episodic memory staleness, not model failure. -- fix: TTL tier for carrier-policy facts + freshness check in retrieval. -- regression test: replay exec 7f3a against the fixed memory layer: CALL agent.replay('7f3a', memory_version => 'v2'); -- correct date produced
The platform SLOs that follow from trace data, all per strategy version: outcome rate (did the execution achieve its goal — measured against typed success criteria, not vibes), gate-escalation rate, step-count and token distributions (a drifting P95 step count is an early warning that prompts and reality have diverged), tool-failure rates per contract, and retrieval-relevance sampling. Replay is the killer feature — the same property our streaming and lakehouse platforms treat as foundational: any historical execution re-runnable against new prompts, new memory, new models, with diffs. It converts "I think the new prompt is better" into a backtest.
07 · Approval Gates and the Autonomy Ladder
Autonomy is not a property of the agent; it is a property of each action's consequence. The ladder, enforced by the orchestrator reading tool contracts:
| Tier | Action class | Policy | Example |
|---|---|---|---|
| 0 | Read-only | Fully autonomous | Query a warehouse, fetch a doc |
| 1 | Reversible writes | Autonomous + audit trail | Draft a report, stage a file |
| 2 | Visible actions | Autonomous within budget, sampled review | Send an internal notification |
| 3 | Irreversible / external | Approval gate, always | Issue refund, send customer email, modify prod data |
The gate itself is engineered like the review queues in our anti-cheat and grading platforms: an evidence pack assembles automatically (goal, the reasoning steps that led here, the exact action payload, relevant memory with provenance), decisions take seconds not minutes, approver agreement is calibrated, and every verdict feeds back as training signal. Two operational rules earned in practice: gates must be fast or they get bypassed culturally (queue SLAs are platform SLOs), and tier demotion is automatic — an agent strategy whose gate-rejection rate spikes loses autonomy tiers until a human re-certifies it. Trust is earned per strategy version, measured, and revocable.
08 · Cost Engineering and A/B Testing Agent Strategies
Agent economics die in the dark: a strategy that solves the task in 9 steps at $0.04 and one that solves it in 31 steps at $0.31 look identical in the demo. The meter makes them comparable:
per-execution economics — the query that runs the platformSELECT strategy_version, COUNT(*) AS executions, AVG(outcome_success::int) AS success_rate, APPROX_PERCENTILE(total_cost, 0.5) AS p50_cost, APPROX_PERCENTILE(total_cost, 0.95) AS p95_cost, APPROX_PERCENTILE(steps_n, 0.95) AS p95_steps, AVG(gate_escalations) AS gates_per_exec, SUM(total_cost) / NULLIF(SUM(outcome_success::int),0) AS cost_per_success FROM gold.executions WHERE started_at >= CURRENT_DATE - 14 GROUP BY 1 ORDER BY cost_per_success; -- the number that decides the A/B
Cost-per-success — not cost-per-call, not tokens-per-day — is the platform's economic unit, and it is only computable because every step event carried its metering and every execution carried its typed outcome. Strategy A/B testing then works exactly like the experiment discipline in our recommendation and clickstream pieces: strategies (prompt version × model × tool policy × memory config) assigned per execution, pre-registered metrics, live readout on the spine, permanent holdback to keep claims honest — plus one agent-specific rule: cap experiments by consequence tier. New strategies earn tier-3 autonomy through gated production performance, never through offline evals alone.
The Non-Negotiable
Gates at Consequence
The Economic Unit
Production Headroom
09 · Lessons Learned & Takeaways
- Event-source from day one or rebuild in month four. Every team that started with mutable execution state hit the wall at resume/audit/replay and rebuilt. The log-first design costs a week early and saves a quarter later.
- The duplicate side-effect is the rite of passage — skip it. Agents retry; networks lie; the double refund happens to everyone who skipped idempotency keys. Contract enforcement is cheaper than the apology.
- Stale memory causes more wrong answers than weak models. Our trace analyses keep finding the same root cause: confident reasoning over outdated episodic memory. Freshness checks in retrieval and TTL tiers by fact volatility fixed more failures than any model upgrade.
- Gates that are slow get bypassed; gates that are fast get trusted. The approval queue is a product with an SLA. When decisions took seconds with good evidence packs, teams added gates voluntarily; when they took hours, they architected around them.
- Demos optimise capability; platforms optimise cost-per-success. The 31-step strategy that wows the room loses to the 9-step one in production every time the meter is on. Turn the meter on.
- Autonomy is earned per strategy version. Tier promotion through measured production performance, automatic demotion on rejection-rate spikes. Trust is a number with a trend, not a launch decision.
Event-sourced execution spine; state is a fold; resume, replay, audit, and cost fall out for free.
Schemas both ways, idempotency, side-effect class, budgets, failure policy as data. The model is just another producer.
Distilled writes with provenance, tiered TTLs, pinned embeddings, subject-access as a deliverable. Never raw, never forever.
100% step coverage; the trace ID joins cost, gates, and experiments; replay turns prompt opinions into backtests.
Four-tier autonomy ladder read from tool contracts; fast evidence-packed approvals; automatic tier demotion.
Per-step metering → per-execution economics → strategy A/Bs with holdbacks, capped by consequence tier.
This is the practice behind VipraGo, our AI Workflow Operating System — and it composes every discipline in this series: the event-spine and exactly-once patterns, the LLM pipeline and review calibration, the feature/memory store parity, and the contract culture. For the sober map of what is production-ready in LLM data work, start with LLM-Augmented Data Pipelines.