Real-Time CDC Pipelines with Debezium + Kafka + Flink: The Hard Parts Nobody Tells You

TL;DR — Direct Answer

The demo stack (Debezium → Kafka → Flink) takes a day to stand up and a year to operate well. The production hard parts: connector restarts that replay or skip depending on offset-flush timing, snapshot-vs-streaming transitions that can drop a window of changes when mishandled, schema-registry compatibility conflicts that silently stall topics when a source team ships DDL, late/out-of-order events that quietly corrupt aggregates without watermark discipline, and the truth that "exactly-once" ends at the sink you actually write to. Everything below is from pipelines we run at sub-3-minute end-to-end latency in production.

Connector restarts: at-least-once means duplicates, plan for them

Debezium flushes offsets periodically, not per-event. Crash between event-publish and offset-flush → on restart you get those events again. This is correct at-least-once behavior, and every downstream consumer must be idempotent: key your sinks on primary key + LSN/GTID position, upsert don't append. Teams discover this during their first incident review, in the form of doubled revenue in a dashboard. Also: treat connector config changes as deployments — several config edits silently reset offsets or trigger re-snapshots depending on snapshot.mode.

The snapshot/streaming boundary

Initial snapshot then streaming sounds atomic; it isn't. Misconfigured snapshot.mode on restart can re-run a full snapshot into a live topic (hours of duplicate firehose) or skip straight to streaming and miss everything since the slot was created. Use incremental snapshots (signal-based) for big tables, pin snapshot.mode=when_needed deliberately, and rehearse the restart scenarios in staging — all four of them: clean stop, crash, config change, and slot loss.

Schema registry: where pipelines go to stall quietly

Source team adds a NOT NULL column with no default → Debezium emits a new schema → registry rejects it under BACKWARD compatibility → connector parks in a retry loop → topic goes quiet. No errors downstream — just silence and an SLA breach discovered by a consumer. Defenses: alert on absence (no events on a hot topic for N minutes is a page), agree compatibility mode with producing teams (FULL_TRANSITIVE if you can win that fight), and route DDL through the data-contract process rather than discovering it in the registry logs.

Late and out-of-order events in Flink

Event-time windows need watermarks; watermarks are a bet about lateness. Bet too aggressive and late events (connector pause, network partition, source-side batch update) get dropped from closed windows — aggregates are silently wrong. Bet too conservative and latency balloons. Production pattern: bounded watermarks + allowed-lateness + side-output for the truly late, with the side-output count alerting — late events should be measured, not ignored. For CDC specifically, remember updates to old rows are legitimately "late" by design; keyed upsert state, not windowed appends, is usually the right Flink model.

"Exactly-once" — the asterisks

Flink's checkpointing gives exactly-once state; Kafka transactions extend it to Kafka sinks. The moment you write to anything else — ElasticSearch, a REST API, most warehouses — you are back to at-least-once plus idempotent writes. Design the sink contract first: deterministic keys, upsert semantics, and a reconciliation job that compares source counts to sink counts daily. That reconciliation job has caught more real incidents for us than any streaming metric.

The operations checklist we actually run

Alert on topic silence per table, not just connector status (status lies during retry loops).
WAL/binlog retention alarms on the source DB (a paused connector can fill the primary's disk).
Restart rehearsals quarterly: all four stop/crash/config/slot-loss scenarios, in staging, timed.
Schema-change drill: DDL lands → who is paged, what is the registry decision tree, who unblocks.
Daily source↔sink reconciliation with tolerance bands and trend alerting.
Consumer-lag SLOs per topic tied to the business freshness promise (ours: 3 minutes end-to-end — see the LXP streaming case study).

Frequently Asked Questions

Is Debezium reliable enough for production?

Yes — with the understanding that it is at-least-once: restarts replay events that published after the last offset flush. Production readiness means idempotent downstream consumers (keyed upserts), topic-silence alerting, WAL retention alarms on the source, and rehearsed restart procedures — not just a running connector.

Why did my CDC topic silently stop receiving events?

Most often a schema-registry compatibility rejection: a source DDL change produced a schema the registry refuses under the configured compatibility mode, parking the connector in a retry loop with no downstream error. Alert on event absence per topic, and route source DDL through a data-contract process.

Does Flink give exactly-once for CDC pipelines?

Flink checkpointing provides exactly-once state, and Kafka-to-Kafka with transactions extends it. Writing to external systems (search indexes, APIs, most warehouses) drops you to at-least-once — so design idempotent sinks with deterministic keys and run a daily source-to-sink reconciliation regardless.

How should Flink handle late CDC events?

Bounded out-of-orderness watermarks plus allowed-lateness, with a side-output capturing what's still later — and an alert on the side-output volume. For CDC mirroring specifically, prefer keyed upsert state over windowed appends: updates to old rows are legitimately late by design.

Real-Time CDC with Debezium + Kafka + Flink: The Hard Parts