The Agentic Data Platform: Engineering Pipelines for Autonomous AI Agents in Production

Executive Summary

The agent demo is a while-loop around an LLM with tool calls; the agent platform is what remains after that loop meets production: tools that fail mid-sequence, memory that grows without governance, reasoning chains nobody can debug, actions that needed a human's sign-off, and a token bill arriving with no attribution. Every one of those is a data engineering problem, and almost none of the agent literature addresses them.

This is the production guide: an event-sourced execution spine where every agent step is a recorded event; tool contracts with schemas, idempotency, and budgets; vector-backed memory with retention tiers and provenance; trace observability that makes a 40-step reasoning chain debuggable in minutes; an autonomy ladder with human approval gates at consequence boundaries; and per-execution cost tracking that makes agent strategies A/B-testable like any other product surface.

This article speaks from practice: Vipra builds and operates VipraGo, an AI Workflow Operating System, and the platform patterns here — event spines, feature/memory stores, human-review calibration, cost SLOs — are the same documented disciplines running across our streaming (1B+ events/hour) and LLM pipeline work. Scale figures for the worked example are labelled reference values.

01 · Why Agent Demos Die in Production

The gap between demo and platform is a list of unanswered questions, and every team that ships agents meets all of them in the first quarter:

Production question	Demo answer	Platform answer
The tool failed at step 7 of 12 — now what?	The loop retries or crashes	Event-sourced state; resume, compensate, or escalate per tool contract
Why did the agent do that?	Scroll the logs and guess	Full trace: every prompt, retrieval, tool I/O, and decision, replayable
Who approved the refund it issued?	Nobody — it was autonomous	Approval gate at the consequence boundary, with evidence pack
What did this run cost?	The monthly invoice, eventually	Per-step metering rolled to per-execution, per-strategy, per-tenant
Is the new prompt actually better?	Vibes	Strategy A/B on live traffic with pre-registered metrics
What does the agent remember about this user?	Everything, forever, ungoverned	Tiered memory with provenance, TTLs, and a subject-access query

Notice what every platform answer has in common: it is a data engineering artifact — an event log, a trace, a contract, a meter, an experiment, a governed store. The model is maybe 20% of an agent platform. This article is about the other 80%.

02 · The Architecture: An Operating System, Not a Loop

spine

→

Event-sourced execution log (Kafka). Every step — plan, tool call, result, retrieval, decision, gate — is an immutable event. Execution state is a fold over events, never a mutable row.

tools

→

Tool registry & contract layer. Schemas, idempotency keys, side-effect class, cost estimates, rate budgets, consequence tier — orchestration reads the contract, not the vibe.

memory

→

Governed memory. Working (execution-scoped), episodic (vector DB, provenance-tagged), semantic (curated knowledge) — each with retention, access, and write rules.

control

→

Autonomy ladder + gates. Consequence-tiered actions; approval queues with evidence packs at the boundaries; kill-switches per agent, per tool, per tenant.

observe

→

Trace store + cost meter + experiments. OTel-style traces to the lakehouse; per-step token/tool metering; strategy assignment and readout riding the same spine.

The load-bearing decision is the first row: event sourcing is not optional. Agents are long-running, fallible, multi-step processes — exactly the workload event sourcing was invented for. Resume-after-crash, audit, replay, debugging, cost attribution, and experiments all fall out of the same log. This is VipraGo's architectural spine, and it is the same discipline as our exactly-once streaming work: the log is the truth; everything else is a view.

03 · The Execution Data Flow: Every Step Is an Event

trigger (user request · schedule · upstream event) │ ▼ ┌─ EXECUTION SPINE (Kafka, event-sourced) ─────────────────────────────────┐ │ exec.started {goal, strategy_version, budget, tenant} │ │ ├─► plan.proposed {steps[], model, prompt_version} │ │ ├─► memory.retrieved {query, chunks[], scores, provenance[]} │ │ ├─► tool.called {tool, args, idempotency_key, est_cost} │ │ │ └─► tool.result {status, output_ref, actual_cost, latency} │ │ ├─► gate.requested {action, consequence_tier, evidence_pack} │ │ │ └─► gate.decided {approver, verdict, latency} ◄── human │ │ ├─► step.completed … (repeat per step) │ │ └─► exec.finished {outcome, steps_n, tokens, cost, trace_id} │ └──────────┬────────────────────────────────────────────────────┬──────────┘ ▼ ▼ TRACE STORE (lakehouse) LIVE CONSUMERS replay · debug · audit cost meter · SLO alarms training data · experiments approval queues · kill-switch

Two properties to defend in review: outputs are referenced, not embedded — tool results land in object storage with an output_ref in the event, keeping the spine lean at scale; and the trace ID is the universal join key — cost, latency, gate decisions, experiment assignment, and user feedback all key on it, which is what makes Section 08's economics queryable at all.

04 · Tool-Use Orchestration: The Contract Layer

An agent's tools are an integration surface that happens to be called by a model — and integration surfaces need contracts, not descriptions:

tool contract — what the orchestrator enforces (registry entry)
tool: issue_customer_refund
schema:
  input:  {order_id: string, amount: money, reason: enum[...]}
  output: {refund_id: string, status: enum[issued, queued, rejected]}
side_effect: irreversible            # read_only | reversible | irreversible
consequence_tier: 3                  # ties into the autonomy ladder (§07)
idempotency: required                # key = (order_id, amount, exec_id)
budgets: {rate: 20/min per tenant, cost_est: $0.002/call}
failure_policy:
  retryable: [timeout, rate_limited] # with backoff, same idempotency key
  compensation: none                 # irreversible ⇒ gate before, not undo after
  on_exhaust: escalate_to_gate

The orchestrator enforces what the model cannot be trusted to remember: schema validation both directions (malformed model output never reaches a production API; malformed tool output never reaches the next prompt unvalidated); idempotency keys on every side-effecting call, because agents retry and networks lie — the duplicate-refund incident is the agent platform's rite of passage, and idempotency is how you skip it; and failure policy as data — retryable-with-backoff vs compensate vs escalate is declared per tool, so step-7-of-12 failures resolve by policy, not by panic. This is the data-contract discipline from our contracts playbook pointed at a new producer: the model.

05 · Agent Memory: Governed, Not Accumulated

The default agent memory design — embed everything, retrieve top-k, forever — is a compliance incident with a vector index. The governed design separates three stores with different physics:

Tier	Store	Write rule	Retention
Working	Execution context (spine events)	Automatic, execution-scoped	Dies with the execution; archived in trace
Episodic	Vector DB (pgvector/managed)	Distilled summaries only, provenance-tagged, PII-screened at write	Tiered TTLs; per-tenant; subject-access queryable
Semantic	Curated knowledge (lakehouse + index)	Human-reviewed promotion from episodic patterns	Versioned like documentation

The write rule is where platforms diverge from demos: raw conversation never enters long-term memory. A distillation step extracts durable facts ("tenant prefers CSV exports", "order flow X requires VAT field") with provenance pointing back to the source trace — so every memory is auditable, correctable, and deletable. The disciplines are familiar from this series: embedding versions pinned and migrated atomically (LLM grading), retention as a store property with subject-access as a tested deliverable (learner telemetry). Memory is a feature store for agents; govern it like one.

⚠️Memory poisoning is a real attack surface: a user who can talk an agent into "remembering" a falsehood has injected persistent state into every future execution. Distillation review thresholds, provenance, and per-tenant isolation are security controls, not tidiness.

06 · Observability: Tracing Multi-Step Reasoning

A 40-step agent execution that produced a wrong answer is undebuggable from application logs. The trace makes it a five-minute investigation:

the debugging session — trace-first, not log-archaeology
-- "Execution 7f3a gave a customer the wrong delivery date. Why?"
SELECT step_n, event_type, summary FROM traces.steps WHERE trace_id = '7f3a';
-- step 12: memory.retrieved → chunk #c91 (score 0.71): a summary written
--          by an execution that predates the carrier policy change
-- step 13: plan reasoned from stale memory; tools were never consulted

-- root cause: episodic memory staleness, not model failure.
-- fix: TTL tier for carrier-policy facts + freshness check in retrieval.
-- regression test: replay exec 7f3a against the fixed memory layer:
CALL agent.replay('7f3a', memory_version => 'v2');   -- correct date produced

The platform SLOs that follow from trace data, all per strategy version: outcome rate (did the execution achieve its goal — measured against typed success criteria, not vibes), gate-escalation rate, step-count and token distributions (a drifting P95 step count is an early warning that prompts and reality have diverged), tool-failure rates per contract, and retrieval-relevance sampling. Replay is the killer feature — the same property our streaming and lakehouse platforms treat as foundational: any historical execution re-runnable against new prompts, new memory, new models, with diffs. It converts "I think the new prompt is better" into a backtest.

07 · Approval Gates and the Autonomy Ladder

Autonomy is not a property of the agent; it is a property of each action's consequence. The ladder, enforced by the orchestrator reading tool contracts:

Tier	Action class	Policy	Example
0	Read-only	Fully autonomous	Query a warehouse, fetch a doc
1	Reversible writes	Autonomous + audit trail	Draft a report, stage a file
2	Visible actions	Autonomous within budget, sampled review	Send an internal notification
3	Irreversible / external	Approval gate, always	Issue refund, send customer email, modify prod data

The gate itself is engineered like the review queues in our anti-cheat and grading platforms: an evidence pack assembles automatically (goal, the reasoning steps that led here, the exact action payload, relevant memory with provenance), decisions take seconds not minutes, approver agreement is calibrated, and every verdict feeds back as training signal. Two operational rules earned in practice: gates must be fast or they get bypassed culturally (queue SLAs are platform SLOs), and tier demotion is automatic — an agent strategy whose gate-rejection rate spikes loses autonomy tiers until a human re-certifies it. Trust is earned per strategy version, measured, and revocable.

08 · Cost Engineering and A/B Testing Agent Strategies

Agent economics die in the dark: a strategy that solves the task in 9 steps at $0.04 and one that solves it in 31 steps at $0.31 look identical in the demo. The meter makes them comparable:

per-execution economics — the query that runs the platform
SELECT strategy_version,
       COUNT(*)                              AS executions,
       AVG(outcome_success::int)             AS success_rate,
       APPROX_PERCENTILE(total_cost, 0.5)    AS p50_cost,
       APPROX_PERCENTILE(total_cost, 0.95)   AS p95_cost,
       APPROX_PERCENTILE(steps_n, 0.95)      AS p95_steps,
       AVG(gate_escalations)                 AS gates_per_exec,
       SUM(total_cost) / NULLIF(SUM(outcome_success::int),0) AS cost_per_success
FROM gold.executions
WHERE started_at >= CURRENT_DATE - 14
GROUP BY 1 ORDER BY cost_per_success;   -- the number that decides the A/B

Cost-per-success — not cost-per-call, not tokens-per-day — is the platform's economic unit, and it is only computable because every step event carried its metering and every execution carried its typed outcome. Strategy A/B testing then works exactly like the experiment discipline in our recommendation and clickstream pieces: strategies (prompt version × model × tool policy × memory config) assigned per execution, pre-registered metrics, live readout on the spine, permanent holdback to keep claims honest — plus one agent-specific rule: cap experiments by consequence tier. New strategies earn tier-3 autonomy through gated production performance, never through offline evals alone.

100%

Agent Steps Traced —
The Non-Negotiable

Autonomy Tiers —
Gates at Consequence

$/✓

Cost-per-Success —
The Economic Unit

1B+/hr

Vipra Event-Spine
Production Headroom

09 · Lessons Learned & Takeaways

Event-source from day one or rebuild in month four. Every team that started with mutable execution state hit the wall at resume/audit/replay and rebuilt. The log-first design costs a week early and saves a quarter later.
The duplicate side-effect is the rite of passage — skip it. Agents retry; networks lie; the double refund happens to everyone who skipped idempotency keys. Contract enforcement is cheaper than the apology.
Stale memory causes more wrong answers than weak models. Our trace analyses keep finding the same root cause: confident reasoning over outdated episodic memory. Freshness checks in retrieval and TTL tiers by fact volatility fixed more failures than any model upgrade.
Gates that are slow get bypassed; gates that are fast get trusted. The approval queue is a product with an SLA. When decisions took seconds with good evidence packs, teams added gates voluntarily; when they took hours, they architected around them.
Demos optimise capability; platforms optimise cost-per-success. The 31-step strategy that wows the room loses to the 9-step one in production every time the meter is on. Turn the meter on.
Autonomy is earned per strategy version. Tier promotion through measured production performance, automatic demotion on rejection-rate spikes. Trust is a number with a trend, not a launch decision.

📜

The log is the agent

Event-sourced execution spine; state is a fold; resume, replay, audit, and cost fall out for free.

🔧

Tools have contracts

Schemas both ways, idempotency, side-effect class, budgets, failure policy as data. The model is just another producer.

🧠

Memory is governed

Distilled writes with provenance, tiered TTLs, pinned embeddings, subject-access as a deliverable. Never raw, never forever.

🔍

Trace everything, replay anything

100% step coverage; the trace ID joins cost, gates, and experiments; replay turns prompt opinions into backtests.

🚦

Gates at consequence boundaries

Four-tier autonomy ladder read from tool contracts; fast evidence-packed approvals; automatic tier demotion.

💸

Cost-per-success decides

Per-step metering → per-execution economics → strategy A/Bs with holdbacks, capped by consequence tier.

This is the practice behind VipraGo, our AI Workflow Operating System — and it composes every discipline in this series: the event-spine and exactly-once patterns, the LLM pipeline and review calibration, the feature/memory store parity, and the contract culture. For the sober map of what is production-ready in LLM data work, start with LLM-Augmented Data Pipelines.

FAQ · Frequently Asked Questions

What makes an agent platform different from an agent framework?

Frameworks give you the loop — planning, tool calls, retries. Platforms answer the production questions: event-sourced state that survives failure, traces that make reasoning debuggable, contracts that prevent duplicate side-effects, governed memory, approval gates at consequence boundaries, and per-execution cost attribution. The model is ~20% of the system; the platform is the rest.

How should agent memory be architected?

Three governed tiers: working memory scoped to the execution; episodic memory in a vector DB holding distilled, provenance-tagged, PII-screened summaries (never raw conversation) with TTLs by fact volatility; and semantic memory as human-curated, versioned knowledge. Embedding versions are pinned and migrated atomically, and subject-access/deletion are tested queries.

When does an agent action need human approval?

When it crosses a consequence boundary: the four-tier ladder runs from read-only (fully autonomous) through reversible and visible actions to irreversible/external actions (tier 3 — always gated). The tier lives on the tool contract, gates assemble evidence packs automatically, and strategies earn or lose autonomy tiers based on measured gate-rejection rates.

How do you A/B test agent strategies safely?

Strategies (prompt × model × tool policy × memory config) are assigned per execution on the event spine, with pre-registered metrics and a permanent holdback — and experiments are capped by consequence tier: new strategies earn high-autonomy deployment through gated production performance, never offline evals alone. The deciding metric is cost-per-success, computable because every step is metered.

The Agentic Data Platform:Engineering Pipelines for Autonomous AI Agents in Production

01 · Why Agent Demos Die in Production

02 · The Architecture: An Operating System, Not a Loop

03 · The Execution Data Flow: Every Step Is an Event

04 · Tool-Use Orchestration: The Contract Layer

05 · Agent Memory: Governed, Not Accumulated

06 · Observability: Tracing Multi-Step Reasoning

07 · Approval Gates and the Autonomy Ladder

08 · Cost Engineering and A/B Testing Agent Strategies

09 · Lessons Learned & Takeaways

FAQ · Frequently Asked Questions

The Agentic Data Platform:
Engineering Pipelines for Autonomous AI Agents in Production