Change Data Capture reads the database's write-ahead log and streams row-level changes; full load re-scans and overwrites on a schedule. The demos make CDC look free and full load look primitive. Production teaches the opposite nuance: CDC carries four standing operational costs nobody itemizes, and full load over partitioned scans is embarrassingly cheap, reliable, and schema-change-proof for a large class of tables.
This is a decision framework, not a tooling debate: per table, score freshness need, size, churn pattern, intermediate-state requirements, and your team's streaming operations capacity. The honest output for most estates is a mix — CDC where minutes matter and re-scans are impossible, full load everywhere it quietly wins.
Vipra runs both at scale: documented sub-3-minute CDC platforms where freshness pays for the operations, and partitioned full-load fleets where it doesn't. The operational hard parts of the CDC half are the companion piece — this article decides whether you need them at all.
01 · The Question Nobody Asks Before the Quickstart
The Debezium quickstart is so good it skips the architecture review: an afternoon later, changes are streaming, and a strategic decision has been made by default. The unasked question — does this table's downstream need to know about changes within minutes, or just by morning? — determines whether you have purchased real-time freshness or an unnecessary distributed system with a pager attached.
Be precise about what each strategy is: CDC is a standing subscription to a database's change log — always on, stateful, operationally alive. Full load is a stateless batch query — runs, overwrites, exits, and holds no state between runs that can corrupt. The second property is worth more than the demo makes it look.
02 · The Two Architectures, Side by Side
03 · What CDC Actually Costs You
Four standing costs, none on the quickstart page:
| Standing cost | What it means monthly | Who pays it |
|---|---|---|
| Connector fleet operations | Monitoring, restarts, upgrades, config-as-deployment discipline | Platform on-call, forever |
| Offset & snapshot state | State that must survive every failure mode — crash, redeploy, slot loss | The engineer running the four restart drills |
| Schema-evolution handling | Every DDL a source team ships without telling you | Whoever is paged when the topic goes quiet |
| Concentrated expertise | The streaming knowledge that walks out the door with one resignation | The org, discovered at the worst time |
None of these is an argument against CDC — our production platforms pay all four happily where sub-3-minute freshness creates product value. They are an argument for itemizing: the table-by-table framework in Section 06 charges these costs against each table's actual freshness requirement, and roughly half the CDC we audit in the wild fails that test.
04 · The Postgres Failure Modes That Bite at Scale
Failure 3 is the quiet budget killer because nothing is broken: the pipeline hums, the bill compounds. A churn audit — events emitted per distinct row per day — names the tables where CDC is paying streaming prices for batch outcomes. Status-flag and queue tables routinely emit 30–100 events per row per day; their downstream reads end-of-day state only. That is a full-load table wearing a CDC costume.
05 · When Full Load Is Genuinely Cheaper
the partitioned full load — boring on purpose (Spark)# parallel read, partitioned by PK ranges — minutes for tens of millions of rows df = (spark.read.format("jdbc") .option("url", PG_URL) .option("dbtable", "public.order_status") .option("partitionColumn", "id") .option("numPartitions", 16) .option("lowerBound", bounds.min).option("upperBound", bounds.max) .load()) (df.write.format("delta") .mode("overwrite") # idempotent by construction: .option("replaceWhere", f"ds = '{ds}'") # re-run = same result, no state .save(path)) # schema change upstream? the next run just... loads it. # failure recovery? re-run. operational training? this paragraph.
Full load wins decisively when: tables are under ~5M rows (the scan finishes in minutes); freshness of hours satisfies every consumer (most finance, HR, and reference data); schemas churn (the re-run absorbs DDL that would stall a connector for a day); churn-to-size ratios are pathological (Section 04's failure 3); or sources are third-party databases where log access is unavailable anyway. The hybrid worth knowing: high-water-mark incremental loads (WHERE updated_at > last_run) buy most of full load's simplicity at a fraction of the scan — when the source has a trustworthy updated_at, which is a real "when."
06 · The Decision Framework, Table by Table
| Question | Points toward CDC | Points toward full load |
|---|---|---|
| Downstream freshness need? | Minutes — and a consumer can name why | Hours/daily satisfies everyone who actually answers |
| Table size vs scan cost? | Re-scan measured in hours or source pain | Scan finishes inside a coffee |
| Intermediate states needed? | Audit, event sourcing, change analytics | End-of-period state is the product |
| Churn per row per day? | Low — most rows change rarely | 30×+ — CDC pays streaming prices for batch outcomes |
| Schema stability? | Stable, contract-governed | Source team ships DDL on vibes |
| Team streaming ops capacity? | Connector fleet has a real owner | The "Kafka person" is one resignation deep |
Score honestly and most estates land on a mix — which is the correct answer. The freshness question deserves rigor: ask the consumer what decision changes with minutes-fresh data. "The dashboard would be fresher" is not a decision; "we re-price inventory intra-day" is. Vague freshness requirements are how half the world's unnecessary CDC got deployed.
07 · Production Evidence: Both Strategies, Same Estate
Vipra's documented work runs both, deliberately. CDC where it earns its keep: the real-time LXP platform (Kafka + CDC + BigQuery, sub-3-minute end-to-end, replacing nightly batch for millions of learners — the freshness was the product) and the fraud-detection class of architectures, where intermediate states are the signal. Full load where it quietly wins: the same estates' reference tables, finance marts, and third-party sources run partitioned overwrites on schedules, operated by the broader team without streaming expertise. The legacy modernization engagement (10h → sub-2h nightly) is largely disciplined full-load engineering — proof that batch, done well, is a performance story too. The operational hard parts of the CDC half are documented in the companion article.
Vipra Production
Legacy Modernization
the Framework (Audit Est.)
Itemize Before Deploying
08 · Lessons Learned: The Hard Truths
- The quickstart makes the decision before the architect does. CDC estates mostly grow by default, not by framework. Run the table-by-table audit once a year; demote the tables that fail it.
- "Real-time" requirements evaporate under one question. Ask what decision changes with minutes-fresh data. In our audits, most named consumers couldn't — and their tables moved to schedules without complaint.
- WAL retention is the failure that takes down the source. Every other CDC failure hurts the pipeline; an unmonitored slot hurts the production database. It is the first alert to configure, before the first topic.
- Churn audits embarrass every estate the first time. The status-flag table emitting 80 events/row/day for an end-of-day consumer is in your estate too. Events-per-row-per-day is a one-query audit; run it.
- Full load's testability is an underpriced asset. Stateless, idempotent, re-runnable jobs are testable by juniors and debuggable at 9am. The operational maturity CDC demands is real money; full load's absence of it is too.
- Hybrids age badly without ownership. High-water-mark loads silently miss rows when updated_at lies (bulk updates, restored backups). Trust the column or don't — and verify with a weekly full-load reconciliation either way.
09 · Key Takeaways for Practitioners
"What decision changes with minutes-fresh data?" Vague answers route the table to a schedule.
Connector ops, state recovery, schema handling, concentrated expertise — charged per table, not per platform.
Events per distinct row per day, one query. Status flags and queues at 30×+ are full-load tables in costume.
max_slot_wal_keep_size + WAL alerts before the first topic. The unmonitored slot takes down the source, not the pipeline.
Partitioned overwrite: idempotent, schema-proof, junior-operable. A performance story when done well — 10h → 2h documented.
CDC where minutes and intermediate states pay; schedules everywhere else. Re-audit annually; demote without sentiment.
Decided on CDC for a table? The operational survival guide is Debezium + Kafka + Flink: The Hard Parts. The documented platforms: sub-3-minute LXP streaming and the full-load modernization that cut nights 80%.