Debezium reads the database log, Kafka transports, Flink transforms — the quickstart works in an afternoon and the architecture is genuinely right. What the quickstart omits is that you now operate a distributed system whose failure modes live at the seams: connector↔offset, snapshot↔stream, producer↔registry, event-time↔processing-time, and Flink↔your-actual-sink.
This article walks each seam with the failure mode, the detection signal, and the defense — including the configuration blocks and the restart drills. None of it is theoretical: these are the operational scars and runbooks from Vipra's production CDC platforms.
Production base: documented Vipra streaming engagements running sub-3-minute end-to-end latency (Kafka CDC platform serving millions of learners) and 1B+ events/hour (Flink telemetry). The companion piece on the decision itself — when CDC is even worth operating — is CDC vs Full Load.
01 · The Demo Lies by Omission
Nothing in the quickstart is false. Debezium really does turn a Postgres WAL into Kafka topics in twenty minutes; Flink really does join and aggregate them with event-time correctness. The omission is operational: the demo never crashes mid-offset-flush, never receives a surprise DDL from a source team, never pauses a connector over a long weekend while WAL accumulates toward the primary's disk limit. Production is made of exactly those moments.
The honest mental model: CDC is a distributed system you operate forever, with five seams where state lives in two places at once. Every hard part below is one of those seams. Teams that map them in week one run quiet pipelines; teams that discover them in incident reviews run the same pipelines, eventually, with scars.
02 · The Architecture — Annotated with Its Failure Points
03 · Connector Restarts: At-Least-Once Means Duplicates
Debezium flushes offsets on an interval. Crash between event-publish and offset-flush and the restart replays those events — correct at-least-once behaviour that every downstream must absorb. The defense is mechanical:
idempotent sink — the dedup key is (PK + log position)-- every Debezium event carries its log coordinates; use them MERGE INTO dwh.customers t USING staging.cdc_customers s ON t.customer_id = s.customer_id WHEN MATCHED AND s.source_lsn > t._last_lsn THEN UPDATE SET ... -- newer only WHEN NOT MATCHED THEN INSERT ...; -- replayed events carry an LSN ≤ _last_lsn → no-op, by construction. -- append-only sinks instead dedupe on (pk, lsn) before merge.
Teams discover this during their first incident review, as doubled revenue in a dashboard. The second discovery, usually a month later: connector config changes are deployments — several innocuous-looking edits reset offsets or trigger re-snapshots depending on snapshot.mode. Config goes through the same review-and-staging path as code, with the restart drills of Section 04 rehearsed before they happen unrehearsed.
04 · The Snapshot/Streaming Boundary
"Initial snapshot, then streaming" sounds atomic. It isn't — it's a state machine with four restart scenarios, and a wrong snapshot.mode on restart either re-runs a full snapshot into a live topic (hours of duplicate firehose) or skips to streaming and silently misses everything since the slot was created.
Production settings that earn their keep: incremental (signal-based) snapshots for any table that would hold locks or run hours (the watermark-chunked algorithm is resumable and writer-friendly); snapshot.mode=when_needed chosen deliberately and documented; and drill #4 treated as the disaster it is — a recreated slot means a gap, and the runbook's answer is a scoped backfill (one bounded full load into the same idempotent sink), not hope.
05 · Schema Registry: Where Pipelines Stall Quietly
The failure script, verbatim from production: source team ships a NOT NULL column with no default → Debezium emits the new schema → registry rejects it under BACKWARD compatibility → connector parks in a retry loop → the topic goes quiet. No downstream errors. Just silence, discovered by a consumer's SLA breach hours later.
| Defense | Mechanism | Why it works |
|---|---|---|
| Alert on absence | No events on a hot topic for N minutes = page | Silence is this failure's only symptom; make silence loud |
| Compatibility as a negotiated contract | FULL_TRANSITIVE if you can win the fight; BACKWARD as documented floor | The mode is a producer-consumer agreement, not a platform default |
| DDL through the contract process | Source schema changes reviewed against consumer contracts in CI | You learn about the column in a PR, not in registry logs — the contracts playbook, applied upstream |
| Dead-letter with replay | Unparseable events to a DLQ with offset bookmarks | One poison message must not dam the river |
06 · Late and Out-of-Order Events in Flink
Event-time windows need watermarks; watermarks are a bet about lateness. CDC makes the bet treacherous: a paused connector, a network partition, or a source-side batch update can deliver minutes of "old" events in one burst. Bet too tight and those events drop from closed windows; too loose and every aggregate waits.
the watermark + lateness contract (Flink, production defaults)WatermarkStrategy .<ChangeEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30)) .withTimestampAssigner((e, ts) -> e.sourceTimestampMs()) // DB commit time .withIdleness(Duration.ofSeconds(60)); // ⚠ quiet table ≠ stalled clock stream.keyBy(ChangeEvent::table) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .allowedLateness(Time.minutes(5)) // connector-pause absorber .sideOutputLateData(lateTag) // beyond that: explicit fate .aggregate(new Agg(), new Emit()); // late side output → correction path (upsert the affected window in the sink) // — aggregates heal instead of silently lying
Three rules: timestamps come from the database commit time Debezium provides, never arrival time (or every connector pause rewrites history); withIdleness is mandatory, because a naturally quiet table will otherwise freeze the watermark for every joined stream; and late data beyond allowedLateness gets an explicit fate — a side-output correction path that upserts affected windows, the same pattern detailed in our fraud architecture. "Exactly-once" deserves its asterisk here too: Flink guarantees state and transactional output; your warehouse sink is idempotent (Section 03) or your guarantee is marketing.
07 · Production Evidence: The Sub-3-Minute Platforms
Every defense above is running in documented Vipra production: the real-time LXP platform — Confluent Kafka + CDC + BigQuery replacing nightly batch for a global EdTech client, holding sub-3-minute end-to-end latency for millions of learners — and the Flink telemetry platform sustaining 1B+ events/hour with sub-second detection. The watermark discipline, absence alerting, and idempotent-sink patterns in this article are those systems' runbooks, generalised. For the decision before the operations — whether CDC is worth running at all for a given table — the honest framework is in CDC vs Full Load.
Production CDC
Vipra Production
Rehearsed, Not Improvised
Lives in Two Places
08 · Lessons Learned: The Hard Truths
- A stopped connector is a page, not a ticket. Postgres retains WAL for its slot indefinitely; we have watched a paused connector fill a primary's disk and take production OLTP down with it.
max_slot_wal_keep_sizeset, WAL-volume alerts on, runbook explicit. - Doubled dashboard revenue is the at-least-once tax arriving. Idempotency on (PK + LSN) is not an optimization — it is the contract every CDC consumer signs whether they know it or not.
- Silence is the worst failure mode, so instrument for it. Stalled-topic incidents produce zero errors anywhere. Absence alerts on hot topics caught every registry stall since; nothing else would have.
- Config edits caused more incidents than code deploys. snapshot.mode, table include-lists, and connector names all carry restart semantics. Config is code; stage it, drill it.
- The first snapshot is a migration — schedule it like one. Hot multi-hundred-GB tables under default snapshots stall writers (MySQL global read locks are brutal; Postgres long transactions bloat). Incremental snapshots, off-peak, announced.
- Watermark idleness bit us through a healthy table. A naturally quiet stream froze the watermark for every stream joined to it; windows simply stopped firing. withIdleness is mandatory configuration, not an option flag.
09 · Key Takeaways for Practitioners
At-least-once is the real delivery contract. Sinks idempotent on (PK + log position), everywhere, from day one.
Clean stop, crash, config change, slot loss — rehearsed in staging with verification steps, before production improvises them.
Registry stalls are silent. No events on a hot topic for N minutes is a page — it is the only symptom you get.
DB timestamps for watermarks, withIdleness always, side-output corrections that heal windows.
WAL retention is unbounded by default. Cap it, alert on it, and treat a stopped connector as the database risk it is.
Half the CDC tables we audit would be cheaper as partitioned full loads. Run the decision framework first.
Companions: CDC vs Full Load for the upstream decision, fraud detection at 50K TPS for the Flink state discipline at depth, and the LXP streaming engineering project for the documented production result.