Real-Time CDC Pipelines with Debezium + Kafka + Flink: The Hard Parts Nobody Tells You

Q: Is Debezium reliable enough for production?

Yes — with the understanding that it is at-least-once: restarts replay events that published after the last offset flush. Production readiness means idempotent downstream consumers (keyed upserts), topic-silence alerting, WAL retention alarms on the source, and rehearsed restart procedures — not just a running connector.

Q: Why did my CDC topic silently stop receiving events?

Most often a schema-registry compatibility rejection: a source DDL change produced a schema the registry refuses under the configured compatibility mode, parking the connector in a retry loop with no downstream error. Alert on event absence per topic, and route source DDL through a data-contract process.

Q: Does Flink give exactly-once for CDC pipelines?

Flink checkpointing provides exactly-once state, and Kafka-to-Kafka with transactions extends it. Writing to external systems (search indexes, APIs, most warehouses) drops you to at-least-once — so design idempotent sinks with deterministic keys and run a daily source-to-sink reconciliation regardless.

Q: How should Flink handle late CDC events?

Bounded out-of-orderness watermarks plus allowed-lateness, with a side-output capturing what's still later — and an alert on the side-output volume. For CDC mirroring specifically, prefer keyed upsert state over windowed appends: updates to old rows are legitimately late by design.

Executive Summary

Debezium reads the database log, Kafka transports, Flink transforms — the quickstart works in an afternoon and the architecture is genuinely right. What the quickstart omits is that you now operate a distributed system whose failure modes live at the seams: connector↔offset, snapshot↔stream, producer↔registry, event-time↔processing-time, and Flink↔your-actual-sink.

This article walks each seam with the failure mode, the detection signal, and the defense — including the configuration blocks and the restart drills. None of it is theoretical: these are the operational scars and runbooks from Vipra's production CDC platforms.

Production base: documented Vipra streaming engagements running sub-3-minute end-to-end latency (Kafka CDC platform serving millions of learners) and 1B+ events/hour (Flink telemetry). The companion piece on the decision itself — when CDC is even worth operating — is CDC vs Full Load.

01 · The Demo Lies by Omission

Nothing in the quickstart is false. Debezium really does turn a Postgres WAL into Kafka topics in twenty minutes; Flink really does join and aggregate them with event-time correctness. The omission is operational: the demo never crashes mid-offset-flush, never receives a surprise DDL from a source team, never pauses a connector over a long weekend while WAL accumulates toward the primary's disk limit. Production is made of exactly those moments.

The honest mental model: CDC is a distributed system you operate forever, with five seams where state lives in two places at once. Every hard part below is one of those seams. Teams that map them in week one run quiet pipelines; teams that discover them in incident reviews run the same pipelines, eventually, with scars.

02 · The Architecture — Annotated with Its Failure Points

source

→

Postgres/MySQL WAL + replication slot. ⚠ Seam 1: a stopped consumer retains WAL indefinitely — unmonitored, it fills the primary's disk.

capture

→

Debezium on Kafka Connect. ⚠ Seam 2: offsets flush periodically, not per-event — crash between publish and flush replays events. ⚠ Seam 3: snapshot/streaming transitions.

transport

→

Kafka + Schema Registry. ⚠ Seam 4: incompatible DDL parks the connector in a retry loop — the topic goes silent, not loud.

process

→

Flink event-time jobs. ⚠ Seam 5: watermarks are a bet about lateness; idle partitions stall them; late events need an explicit fate.

sink

→

Warehouse / lakehouse / serving. ⚠ The exactly-once boundary: Flink's guarantee ends here — the sink must be idempotent on key + log position.

03 · Connector Restarts: At-Least-Once Means Duplicates

Debezium flushes offsets on an interval. Crash between event-publish and offset-flush and the restart replays those events — correct at-least-once behaviour that every downstream must absorb. The defense is mechanical:

idempotent sink — the dedup key is (PK + log position)
-- every Debezium event carries its log coordinates; use them
MERGE INTO dwh.customers t
USING staging.cdc_customers s
  ON t.customer_id = s.customer_id
WHEN MATCHED AND s.source_lsn > t._last_lsn THEN UPDATE SET ...   -- newer only
WHEN NOT MATCHED THEN INSERT ...;
-- replayed events carry an LSN ≤ _last_lsn → no-op, by construction.
-- append-only sinks instead dedupe on (pk, lsn) before merge.

Teams discover this during their first incident review, as doubled revenue in a dashboard. The second discovery, usually a month later: connector config changes are deployments — several innocuous-looking edits reset offsets or trigger re-snapshots depending on snapshot.mode. Config goes through the same review-and-staging path as code, with the restart drills of Section 04 rehearsed before they happen unrehearsed.

04 · The Snapshot/Streaming Boundary

"Initial snapshot, then streaming" sounds atomic. It isn't — it's a state machine with four restart scenarios, and a wrong snapshot.mode on restart either re-runs a full snapshot into a live topic (hours of duplicate firehose) or skips to streaming and silently misses everything since the slot was created.

┌──────────────┐ signal table row ┌──────────────────┐ connector start ───► │ SNAPSHOT │ ──────────────────► │ INCREMENTAL │ │ (full scan) │ (big tables: │ SNAPSHOT │ └──────┬───────┘ don't full-scan) │ chunked, resum- │ │ complete │ able, low-lock │ ▼ └────────┬─────────┘ ┌──────────────┐ │ │ STREAMING │ ◄────────────────────────────┘ │ (from WAL) │ └──────┬───────┘ the four restart drills to rehearse in staging: 1 clean stop/start → resumes at offset verify: zero gap 2 crash mid-stream → replays since last flush verify: dedup absorbs 3 config change → may reset offsets/resnapshot verify BEFORE prod 4 slot lost / recreated → WAL gap = missed changes requires backfill run

Production settings that earn their keep: incremental (signal-based) snapshots for any table that would hold locks or run hours (the watermark-chunked algorithm is resumable and writer-friendly); snapshot.mode=when_needed chosen deliberately and documented; and drill #4 treated as the disaster it is — a recreated slot means a gap, and the runbook's answer is a scoped backfill (one bounded full load into the same idempotent sink), not hope.

05 · Schema Registry: Where Pipelines Stall Quietly

The failure script, verbatim from production: source team ships a NOT NULL column with no default → Debezium emits the new schema → registry rejects it under BACKWARD compatibility → connector parks in a retry loop → the topic goes quiet. No downstream errors. Just silence, discovered by a consumer's SLA breach hours later.

Defense	Mechanism	Why it works
Alert on absence	No events on a hot topic for N minutes = page	Silence is this failure's only symptom; make silence loud
Compatibility as a negotiated contract	FULL_TRANSITIVE if you can win the fight; BACKWARD as documented floor	The mode is a producer-consumer agreement, not a platform default
DDL through the contract process	Source schema changes reviewed against consumer contracts in CI	You learn about the column in a PR, not in registry logs — the contracts playbook, applied upstream
Dead-letter with replay	Unparseable events to a DLQ with offset bookmarks	One poison message must not dam the river

06 · Late and Out-of-Order Events in Flink

Event-time windows need watermarks; watermarks are a bet about lateness. CDC makes the bet treacherous: a paused connector, a network partition, or a source-side batch update can deliver minutes of "old" events in one burst. Bet too tight and those events drop from closed windows; too loose and every aggregate waits.

the watermark + lateness contract (Flink, production defaults)
WatermarkStrategy
    .<ChangeEvent>forBoundedOutOfOrderness(Duration.ofSeconds(30))
    .withTimestampAssigner((e, ts) -> e.sourceTimestampMs())   // DB commit time
    .withIdleness(Duration.ofSeconds(60));   // ⚠ quiet table ≠ stalled clock

stream.keyBy(ChangeEvent::table)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .allowedLateness(Time.minutes(5))        // connector-pause absorber
    .sideOutputLateData(lateTag)             // beyond that: explicit fate
    .aggregate(new Agg(), new Emit());
// late side output → correction path (upsert the affected window in the sink)
// — aggregates heal instead of silently lying

Three rules: timestamps come from the database commit time Debezium provides, never arrival time (or every connector pause rewrites history); withIdleness is mandatory, because a naturally quiet table will otherwise freeze the watermark for every joined stream; and late data beyond allowedLateness gets an explicit fate — a side-output correction path that upserts affected windows, the same pattern detailed in our fraud architecture. "Exactly-once" deserves its asterisk here too: Flink guarantees state and transactional output; your warehouse sink is idempotent (Section 03) or your guarantee is marketing.

07 · Production Evidence: The Sub-3-Minute Platforms

Every defense above is running in documented Vipra production: the real-time LXP platform — Confluent Kafka + CDC + BigQuery replacing nightly batch for a global EdTech client, holding sub-3-minute end-to-end latency for millions of learners — and the Flink telemetry platform sustaining 1B+ events/hour with sub-second detection. The watermark discipline, absence alerting, and idempotent-sink patterns in this article are those systems' runbooks, generalised. For the decision before the operations — whether CDC is worth running at all for a given table — the honest framework is in CDC vs Full Load.

<3min

End-to-End — Vipra
Production CDC

1B+/hr

Flink Telemetry —
Vipra Production

Restart Drills —
Rehearsed, Not Improvised

Seams Where State
Lives in Two Places

08 · Lessons Learned: The Hard Truths

A stopped connector is a page, not a ticket. Postgres retains WAL for its slot indefinitely; we have watched a paused connector fill a primary's disk and take production OLTP down with it. max_slot_wal_keep_size set, WAL-volume alerts on, runbook explicit.
Doubled dashboard revenue is the at-least-once tax arriving. Idempotency on (PK + LSN) is not an optimization — it is the contract every CDC consumer signs whether they know it or not.
Silence is the worst failure mode, so instrument for it. Stalled-topic incidents produce zero errors anywhere. Absence alerts on hot topics caught every registry stall since; nothing else would have.
Config edits caused more incidents than code deploys. snapshot.mode, table include-lists, and connector names all carry restart semantics. Config is code; stage it, drill it.
The first snapshot is a migration — schedule it like one. Hot multi-hundred-GB tables under default snapshots stall writers (MySQL global read locks are brutal; Postgres long transactions bloat). Incremental snapshots, off-peak, announced.
Watermark idleness bit us through a healthy table. A naturally quiet stream froze the watermark for every stream joined to it; windows simply stopped firing. withIdleness is mandatory configuration, not an option flag.

09 · Key Takeaways for Practitioners

🔁

Design for replays

At-least-once is the real delivery contract. Sinks idempotent on (PK + log position), everywhere, from day one.

📸

Drill the four restarts

Clean stop, crash, config change, slot loss — rehearsed in staging with verification steps, before production improvises them.

🔇

Alert on absence

Registry stalls are silent. No events on a hot topic for N minutes is a page — it is the only symptom you get.

⏱️

Commit time, idleness on, late path explicit

DB timestamps for watermarks, withIdleness always, side-output corrections that heal windows.

💾

Respect the slot

WAL retention is unbounded by default. Cap it, alert on it, and treat a stopped connector as the database risk it is.

⚖️

Decide before you operate

Half the CDC tables we audit would be cheaper as partitioned full loads. Run the decision framework first.

Companions: CDC vs Full Load for the upstream decision, fraud detection at 50K TPS for the Flink state discipline at depth, and the LXP streaming engineering project for the documented production result.

FAQ · Frequently Asked Questions

Is Debezium reliable enough for production?