CDC vs Full Load: When Each Strategy Actually Hurts You

Q: When should I use CDC instead of full table loads?

Use CDC when consumers need minutes-fresh data, the table is too large to re-scan economically, or you need every intermediate row state for audit or event-driven consumers. If none of those hold, a scheduled full or incremental batch load is usually cheaper to build and far cheaper to operate.

Q: What is the biggest production risk of Postgres CDC?

Replication-slot WAL retention: if the CDC connector stops, Postgres retains write-ahead log for the slot indefinitely and can fill the primary's disk, taking down the source database. Set max_slot_wal_keep_size, alert on WAL volume, and treat a stopped connector as an incident.

Q: Is CDC more expensive than batch loading?

Per-byte moved, CDC is efficient; per-month operated, it carries standing costs — connector infrastructure, offset management, schema-registry governance, and on-call expertise. For small or slowly-consumed tables, those standing costs exceed what daily full loads would ever cost.

Q: Can I mix CDC and full loads in one platform?

Yes — that is the pattern mature platforms converge on: CDC for the handful of large, hot, freshness-critical tables; scheduled batch for the long tail. One orchestrator, two ingestion patterns, each where it is cheapest.

Executive Summary

Change Data Capture reads the database's write-ahead log and streams row-level changes; full load re-scans and overwrites on a schedule. The demos make CDC look free and full load look primitive. Production teaches the opposite nuance: CDC carries four standing operational costs nobody itemizes, and full load over partitioned scans is embarrassingly cheap, reliable, and schema-change-proof for a large class of tables.

This is a decision framework, not a tooling debate: per table, score freshness need, size, churn pattern, intermediate-state requirements, and your team's streaming operations capacity. The honest output for most estates is a mix — CDC where minutes matter and re-scans are impossible, full load everywhere it quietly wins.

Vipra runs both at scale: documented sub-3-minute CDC platforms where freshness pays for the operations, and partitioned full-load fleets where it doesn't. The operational hard parts of the CDC half are the companion piece — this article decides whether you need them at all.

01 · The Question Nobody Asks Before the Quickstart

The Debezium quickstart is so good it skips the architecture review: an afternoon later, changes are streaming, and a strategic decision has been made by default. The unasked question — does this table's downstream need to know about changes within minutes, or just by morning? — determines whether you have purchased real-time freshness or an unnecessary distributed system with a pager attached.

Be precise about what each strategy is: CDC is a standing subscription to a database's change log — always on, stateful, operationally alive. Full load is a stateless batch query — runs, overwrites, exits, and holds no state between runs that can corrupt. The second property is worth more than the demo makes it look.

02 · The Two Architectures, Side by Side

CDC path

→

WAL → replication slot → Debezium → Kafka → stream processor → merge to sink. Five stateful components; latency in seconds-to-minutes; every intermediate row state preserved.

full-load path

→

Scheduled parallel SELECT → partitioned overwrite in the lake/warehouse. Zero standing state; latency = schedule; end-state only.

ops surface

→

CDC: connector fleet, offsets, schema registry, snapshot recovery, slot monitoring. Full load: a cron entry and a row-count check.

failure modes

→

CDC: silent stalls, WAL retention, duplicate replays, snapshot gaps. Full load: a late run — visible, retryable, boring.

unique powers

→

CDC: minutes-fresh data, every intermediate state (audit/event sourcing), no repeated source load. Full load: schema churn handled by re-run; trivially testable; junior-operable.

03 · What CDC Actually Costs You

Four standing costs, none on the quickstart page:

Standing cost	What it means monthly	Who pays it
Connector fleet operations	Monitoring, restarts, upgrades, config-as-deployment discipline	Platform on-call, forever
Offset & snapshot state	State that must survive every failure mode — crash, redeploy, slot loss	The engineer running the four restart drills
Schema-evolution handling	Every DDL a source team ships without telling you	Whoever is paged when the topic goes quiet
Concentrated expertise	The streaming knowledge that walks out the door with one resignation	The org, discovered at the worst time

None of these is an argument against CDC — our production platforms pay all four happily where sub-3-minute freshness creates product value. They are an argument for itemizing: the table-by-table framework in Section 06 charges these costs against each table's actual freshness requirement, and roughly half the CDC we audit in the wild fails that test.

04 · The Postgres Failure Modes That Bite at Scale

FAILURE 1: replication-slot WAL retention connector stops (crash · redeploy · registry stall) │ ▼ Postgres retains WAL for the slot indefinitely (default: unbounded) │ disk fills at WAL-write rate, not data-growth rate ▼ primary OLTP database down ── we have watched this happen defense: max_slot_wal_keep_size · WAL-volume alerts · stopped connector = page FAILURE 2: initial snapshot on a hot table default snapshot = full table read, lock-sensitive MySQL: global read lock stalls writers · Postgres: long txn blocks vacuum → bloat defense: incremental (watermark) snapshots · schedule like the migration it is FAILURE 3: high-churn table floods the topic row updated 50×/day → 50 events/row/day Kafka throughput + storage + merge compute …to reconstruct the same end-of-day state one full-load scan would deliver worst offenders: status flags · queue tables · session tables

Failure 3 is the quiet budget killer because nothing is broken: the pipeline hums, the bill compounds. A churn audit — events emitted per distinct row per day — names the tables where CDC is paying streaming prices for batch outcomes. Status-flag and queue tables routinely emit 30–100 events per row per day; their downstream reads end-of-day state only. That is a full-load table wearing a CDC costume.

05 · When Full Load Is Genuinely Cheaper

the partitioned full load — boring on purpose (Spark)
# parallel read, partitioned by PK ranges — minutes for tens of millions of rows
df = (spark.read.format("jdbc")
      .option("url", PG_URL)
      .option("dbtable", "public.order_status")
      .option("partitionColumn", "id")
      .option("numPartitions", 16)
      .option("lowerBound", bounds.min).option("upperBound", bounds.max)
      .load())

(df.write.format("delta")
   .mode("overwrite")                      # idempotent by construction:
   .option("replaceWhere", f"ds = '{ds}'") # re-run = same result, no state
   .save(path))
# schema change upstream? the next run just... loads it.
# failure recovery? re-run. operational training? this paragraph.

Full load wins decisively when: tables are under ~5M rows (the scan finishes in minutes); freshness of hours satisfies every consumer (most finance, HR, and reference data); schemas churn (the re-run absorbs DDL that would stall a connector for a day); churn-to-size ratios are pathological (Section 04's failure 3); or sources are third-party databases where log access is unavailable anyway. The hybrid worth knowing: high-water-mark incremental loads (WHERE updated_at > last_run) buy most of full load's simplicity at a fraction of the scan — when the source has a trustworthy updated_at, which is a real "when."

06 · The Decision Framework, Table by Table

Question	Points toward CDC	Points toward full load
Downstream freshness need?	Minutes — and a consumer can name why	Hours/daily satisfies everyone who actually answers
Table size vs scan cost?	Re-scan measured in hours or source pain	Scan finishes inside a coffee
Intermediate states needed?	Audit, event sourcing, change analytics	End-of-period state is the product
Churn per row per day?	Low — most rows change rarely	30×+ — CDC pays streaming prices for batch outcomes
Schema stability?	Stable, contract-governed	Source team ships DDL on vibes
Team streaming ops capacity?	Connector fleet has a real owner	The "Kafka person" is one resignation deep

Score honestly and most estates land on a mix — which is the correct answer. The freshness question deserves rigor: ask the consumer what decision changes with minutes-fresh data. "The dashboard would be fresher" is not a decision; "we re-price inventory intra-day" is. Vague freshness requirements are how half the world's unnecessary CDC got deployed.

07 · Production Evidence: Both Strategies, Same Estate

Vipra's documented work runs both, deliberately. CDC where it earns its keep: the real-time LXP platform (Kafka + CDC + BigQuery, sub-3-minute end-to-end, replacing nightly batch for millions of learners — the freshness was the product) and the fraud-detection class of architectures, where intermediate states are the signal. Full load where it quietly wins: the same estates' reference tables, finance marts, and third-party sources run partitioned overwrites on schedules, operated by the broader team without streaming expertise. The legacy modernization engagement (10h → sub-2h nightly) is largely disciplined full-load engineering — proof that batch, done well, is a performance story too. The operational hard parts of the CDC half are documented in the companion article.

<3min

CDC Where It Pays —
Vipra Production

10h→2h

Full Load Done Well —
Legacy Modernization

~50%

Wild CDC That Fails
the Framework (Audit Est.)

Standing Costs —
Itemize Before Deploying

08 · Lessons Learned: The Hard Truths

The quickstart makes the decision before the architect does. CDC estates mostly grow by default, not by framework. Run the table-by-table audit once a year; demote the tables that fail it.
"Real-time" requirements evaporate under one question. Ask what decision changes with minutes-fresh data. In our audits, most named consumers couldn't — and their tables moved to schedules without complaint.
WAL retention is the failure that takes down the source. Every other CDC failure hurts the pipeline; an unmonitored slot hurts the production database. It is the first alert to configure, before the first topic.
Churn audits embarrass every estate the first time. The status-flag table emitting 80 events/row/day for an end-of-day consumer is in your estate too. Events-per-row-per-day is a one-query audit; run it.
Full load's testability is an underpriced asset. Stateless, idempotent, re-runnable jobs are testable by juniors and debuggable at 9am. The operational maturity CDC demands is real money; full load's absence of it is too.
Hybrids age badly without ownership. High-water-mark loads silently miss rows when updated_at lies (bulk updates, restored backups). Trust the column or don't — and verify with a weekly full-load reconciliation either way.

09 · Key Takeaways for Practitioners

❓

Freshness needs a named decision

"What decision changes with minutes-fresh data?" Vague answers route the table to a schedule.

🧾

Itemize the standing costs

Connector ops, state recovery, schema handling, concentrated expertise — charged per table, not per platform.

🌊

Audit churn-per-row

Events per distinct row per day, one query. Status flags and queues at 30×+ are full-load tables in costume.

💾

Guard the slot first

max_slot_wal_keep_size + WAL alerts before the first topic. The unmonitored slot takes down the source, not the pipeline.

🔄

Respect boring full loads

Partitioned overwrite: idempotent, schema-proof, junior-operable. A performance story when done well — 10h → 2h documented.

⚖️

The right answer is a mix

CDC where minutes and intermediate states pay; schedules everywhere else. Re-audit annually; demote without sentiment.

Decided on CDC for a table? The operational survival guide is Debezium + Kafka + Flink: The Hard Parts. The documented platforms: sub-3-minute LXP streaming and the full-load modernization that cut nights 80%.

FAQ · Frequently Asked Questions

When should I use CDC instead of full table loads?