TL;DR — Direct Answer
The demo stack (Debezium → Kafka → Flink) takes a day to stand up and a year to operate well. The production hard parts: connector restarts that replay or skip depending on offset-flush timing, snapshot-vs-streaming transitions that can drop a window of changes when mishandled, schema-registry compatibility conflicts that silently stall topics when a source team ships DDL, late/out-of-order events that quietly corrupt aggregates without watermark discipline, and the truth that "exactly-once" ends at the sink you actually write to. Everything below is from pipelines we run at sub-3-minute end-to-end latency in production.
Connector restarts: at-least-once means duplicates, plan for them
Debezium flushes offsets periodically, not per-event. Crash between event-publish and offset-flush → on restart you get those events again. This is correct at-least-once behavior, and every downstream consumer must be idempotent: key your sinks on primary key + LSN/GTID position, upsert don't append. Teams discover this during their first incident review, in the form of doubled revenue in a dashboard. Also: treat connector config changes as deployments — several config edits silently reset offsets or trigger re-snapshots depending on snapshot.mode.
The snapshot/streaming boundary
Initial snapshot then streaming sounds atomic; it isn't. Misconfigured snapshot.mode on restart can re-run a full snapshot into a live topic (hours of duplicate firehose) or skip straight to streaming and miss everything since the slot was created. Use incremental snapshots (signal-based) for big tables, pin snapshot.mode=when_needed deliberately, and rehearse the restart scenarios in staging — all four of them: clean stop, crash, config change, and slot loss.
Schema registry: where pipelines go to stall quietly
Source team adds a NOT NULL column with no default → Debezium emits a new schema → registry rejects it under BACKWARD compatibility → connector parks in a retry loop → topic goes quiet. No errors downstream — just silence and an SLA breach discovered by a consumer. Defenses: alert on absence (no events on a hot topic for N minutes is a page), agree compatibility mode with producing teams (FULL_TRANSITIVE if you can win that fight), and route DDL through the data-contract process rather than discovering it in the registry logs.
Late and out-of-order events in Flink
Event-time windows need watermarks; watermarks are a bet about lateness. Bet too aggressive and late events (connector pause, network partition, source-side batch update) get dropped from closed windows — aggregates are silently wrong. Bet too conservative and latency balloons. Production pattern: bounded watermarks + allowed-lateness + side-output for the truly late, with the side-output count alerting — late events should be measured, not ignored. For CDC specifically, remember updates to old rows are legitimately "late" by design; keyed upsert state, not windowed appends, is usually the right Flink model.
"Exactly-once" — the asterisks
Flink's checkpointing gives exactly-once state; Kafka transactions extend it to Kafka sinks. The moment you write to anything else — ElasticSearch, a REST API, most warehouses — you are back to at-least-once plus idempotent writes. Design the sink contract first: deterministic keys, upsert semantics, and a reconciliation job that compares source counts to sink counts daily. That reconciliation job has caught more real incidents for us than any streaming metric.
The operations checklist we actually run
- Alert on topic silence per table, not just connector status (status lies during retry loops).
- WAL/binlog retention alarms on the source DB (a paused connector can fill the primary's disk).
- Restart rehearsals quarterly: all four stop/crash/config/slot-loss scenarios, in staging, timed.
- Schema-change drill: DDL lands → who is paged, what is the registry decision tree, who unblocks.
- Daily source↔sink reconciliation with tolerance bands and trend alerting.
- Consumer-lag SLOs per topic tied to the business freshness promise (ours: 3 minutes end-to-end — see the LXP streaming case study).