Real-Time Fraud Detection at 50K TPS: A Kafka + Flink + Iceberg Reference Architecture

Executive Summary

Fraud is adversarial and fast: card-testing runs, account takeovers, and mule-network cashouts complete in minutes. A pipeline that lands transactions in a warehouse and scores them on a schedule concedes that entire window — moving the same rules from a 3-hour batch cycle to an 800-millisecond streaming path changes not just when you catch fraud, but what is catchable at all.

This article is the reference architecture we deploy in production streaming engagements: Kafka for durable ordered transport, Flink for stateful detection with exactly-once semantics, Iceberg as the unified batch+streaming lakehouse. The hard parts get the depth — event time, watermarks, state TTL, late data, and the chaos drill that separates exactly-once from optimism.

The 50K TPS figures describe the reference design target. The engineering underneath is documented Vipra production work: 1B+ events/hour with sub-second anomaly detection (network telemetry, Flink + ClickHouse) and sub-3-minute end-to-end Kafka platforms (LXP streaming).

01 · Why Batch Loses to Fraud by Design

The nightly fraud batch has a precise economic meaning: every fraud pattern completing faster than your batch interval is free for the attacker. Card-testing runs burn through stolen BINs in 20 minutes. Account-takeover cashouts finish in under an hour. The batch job that runs at 02:00 produces a beautifully accurate report of money that is already gone.

Latency changes the detection category, not just the detection time. Velocity rules — five transactions in four countries within ten minutes, three failed CVVs then a success, dormant account suddenly testing small amounts — only exist in a streaming world, because by the time batch sees them, they are history rather than signal. The reference target is decision inside the authorization window: gateway event to allow/block/step-up in under a second, with 800ms as the engineered P99.

🔩Keep the rules engine and the model separate from day one: rules are explainable and ship in minutes via control stream; models catch what rules can't and retrain weekly. Estates that fuse them ship neither quickly.

02 · The Reference Architecture, End to End

ingest

→

Kafka. Transactions keyed by account, 30-day retention, schema-registry enforced Avro. Ordered per-key replayable log — the system of record for motion.

enrich

→

Flink stage 1. Joins device fingerprint, merchant risk, account profile from compacted topics — local state lookups, no database calls on the hot path.

detect

→

Flink stage 2. Keyed velocity windows + broadcast rules + ONNX model scoring. Exactly-once state; checkpoints every 15s to object storage.

decide

→

Alert topic → decision API. Transactional producer; idempotent consumer on alert ID. Block / step-up / allow inside the auth window.

persist

→

Iceberg. Same ACID tables serve streaming writes, analyst SQL, model training, and replay. No lambda split, one truth.

Throughput math for the 50K TPS target: 64 partitions on the transaction topic gives ~780 TPS/partition — comfortable for Flink subtasks running enrichment plus scoring at single-digit-millisecond per event. Headroom rule: provision for 3× observed peak, because fraud traffic spikes are correlated with exactly the moments you cannot afford lag (Black Friday, breach aftermath).

03 · Event Time vs Processing Time: The Non-Negotiable

A transaction's fraud signal depends on when it happened, not when your pipeline saw it. Build every window on event time with watermarks, and choose the lag deliberately:

Flink — watermark strategy for card transactions
WatermarkStrategy
    .<Transaction>forBoundedOutOfOrderness(Duration.ofSeconds(45))
    .withTimestampAssigner((txn, ts) -> txn.getAuthTimestampMillis())
    .withIdleness(Duration.ofSeconds(30));   // quiet partitions can't stall the watermark

Too tight a watermark and late transactions miss their window; too loose and every decision waits. For card traffic, 30–60 seconds of bounded disorder is a sound default — but the real design choice is what happens outside the bound, which is Section 06. The withIdleness clause matters more than it looks: one quiet partition (a regional gateway overnight) will otherwise freeze watermarks for the whole job, and your windows simply stop firing. This is the single most common Flink production incident we get called into.

⚠️Never mix processing-time and event-time windows in the same detection logic "temporarily." The hybrid is unreproducible: replaying yesterday's traffic produces different alerts than yesterday produced. Reproducibility is what lets you test rules against history — protect it absolutely.

04 · Exactly-Once: What It Promises and What It Doesn't

Flink checkpoints plus Kafka transactional producers give you exactly-once state and output: a transaction influences a window once, and an alert is emitted once, across any number of failures and restarts. The configuration that delivers it:

Flink — exactly-once checkpointing
execution.checkpointing.interval: 15s
execution.checkpointing.mode: EXACTLY_ONCE
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: s3://fraud-checkpoints/prod
execution.checkpointing.timeout: 120s
# Kafka sink
sink.delivery-guarantee: exactly-once
sink.transactional-id-prefix: fraud-alerts-prod

What it does not promise: anything beyond the transactional boundary. The decision API consuming the alert topic must be idempotent on alert ID, because Kafka read-committed consumers can still re-read after their own failures. And exactly-once costs latency — transactional commits gate on checkpoint completion, so your checkpoint interval is a floor on alert-topic visibility. At a 15s interval with 800ms decisions, the alert is computed in 800ms and committed within the checkpoint cycle; the decision API reads uncommitted for speed and reconciles on commit. That nuance — fast path uncommitted, audit path committed — is the trick that keeps both the latency SLO and the correctness story.

At 50K TPS the binding constraint is state backend throughput, not CPU: RocksDB with incremental checkpoints keeps checkpoint duration under 10% of the interval. When it creeps above that, you are heading for a backpressure spiral — scale state before scaling compute.

05 · State Design at 50K TPS

Velocity rules require per-key state across four key spaces — card, account, device, merchant. The design rules that keep multi-billion-key state tractable:

Rule	Implementation	Why it matters at scale
Bound every state with TTL	StateTtlConfig = window length, enforced by backend	Unbounded state is a slow-motion outage; 24h velocity = 24h TTL, by construction
Aggregate, don't accumulate	Counts, sums, HyperLogLog sketches — not raw event lists	O(1) state per key; a card's 24h profile is ~200 bytes, not 200 events
Rules as broadcast state	Control topic → broadcast stream → connected operator	Risk team ships a rule in minutes, no redeploy, full audit trail of rule versions
Key-space isolation	Separate operators per key space, fan-in to decision	A hot merchant key can't starve card-level detection

Flink — velocity state, aggregating not accumulating
// per-card 24h profile: O(1) size regardless of transaction count
ValueStateDescriptor<CardProfile> desc =
    new ValueStateDescriptor<>("card-profile", CardProfile.class);
StateTtlConfig ttl = StateTtlConfig.newBuilder(Time.hours(24))
    .setUpdateType(UpdateType.OnCreateAndWrite)
    .cleanupInRocksdbCompactFilter(10_000)
    .build();
desc.enableTimeToLive(ttl);
// profile = {count_1h, count_24h, sum_24h, countries_hll, last_seen_ms, decline_streak}

The broadcast-rules pattern deserves emphasis because it changes the organization, not just the job: fraud analysts ship YAML rule changes through a reviewed control topic, the operator validates and applies them on the fly, and every rule version is itself an event in Kafka — which means every alert can cite the exact rule version that fired it. That citation is what your disputes team needs and your regulator eventually asks for.

06 · Late-Arriving Transactions: A Product Decision in Code

Settlement files, retried gateways, store-and-forward terminals, and network partitions guarantee that some transactions arrive minutes late — past any reasonable watermark. What happens next is a product decision per rule, and it belongs in code, not in incident retrospectives:

Flink — late events routed to an explicit correction path
OutputTag<Transaction> lateTag = new OutputTag<>("late-txns"){};

SingleOutputStreamOperator<Alert> alerts = txns
    .keyBy(Transaction::getCardId)
    .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
    .allowedLateness(Time.minutes(2))      // in-window corrections, auto-retracted
    .sideOutputLateData(lateTag)           // beyond that: explicit policy
    .aggregate(new VelocityAggregator(), new AlertEmitter());

DataStream<Transaction> late = alerts.getSideOutput(lateTag);
// policy per destination:
late.sinkTo(riskScoreUpdater);     // silently adjust customer risk profile
late.sinkTo(icebergCorrections);   // analytics truth gets the event regardless
// deliberately NOT wired to the decision API — auth already happened

Three distinct fates, made explicit: events within allowedLateness retract and re-emit window results automatically (the decision API sees a correction it can act on — e.g., release a step-up hold); events beyond it update the customer's risk score and the lakehouse, but never retro-decide an authorization that already settled. Writing this as topology rather than policy-document means the behaviour is tested, versioned, and survives team turnover.

07 · Iceberg: One Table Layer for Stream and Batch

The historical answer to "streaming truth vs batch truth" was the lambda architecture — two pipelines, two stores, eternal reconciliation. Iceberg ends it: Flink commits ACID snapshots into the same tables that serve analyst SQL, model training, and compliance queries.

Flink SQL — streaming upsert into the fraud lakehouse
CREATE TABLE iceberg_catalog.fraud.transactions (
  txn_id         STRING,
  card_id        STRING,
  auth_ts        TIMESTAMP(3),
  amount         DECIMAL(12,2),
  merchant_id    STRING,
  decision       STRING,
  rule_version   STRING,
  PRIMARY KEY (txn_id) NOT ENFORCED
) PARTITIONED BY (days(auth_ts))
WITH ('format-version'='2', 'write.upsert.enabled'='true');

INSERT INTO iceberg_catalog.fraud.transactions
SELECT txn_id, card_id, auth_ts, amount, merchant_id, decision, rule_version
FROM decided_transactions;

What this buys, concretely: replay — when rules or models change, reprocess Kafka history and compare against Iceberg snapshots of what the old logic decided; time travel — "show me the state of knowledge when this disputed transaction was approved" is a FOR TIMESTAMP AS OF query, which is an audit superpower; one schema — the feature pipeline trains on exactly the columns the detector wrote, eliminating the training/serving skew that quietly rots fraud models. Compact small files hourly (streaming commits produce many) and expire snapshots on a compliance-aligned schedule — operational chores, but scheduled ones, not architectural debt.

800ms

Decision P99
(From 3-Hour Batch)

50KTPS

Reference Design
Throughput

1B+/hr

Vipra Production
Streaming Scale

Duplicate Alerts in
Chaos Drill — Required

08 · Lessons Learned: The Hard Truths

Idle partitions stall watermarks, and it will page you at 3am. A quiet regional gateway freezes the global watermark and windows stop firing — silently. withIdleness is not optional configuration; it is the difference between detection and a very expensive event archive.
Exactly-once ends at your consumer. Teams celebrate transactional sinks and then deploy a decision API that double-processes on restart. Idempotency on alert ID is part of the architecture, not a downstream team's problem.
State lists kill you at scale; sketches don't. The naive "keep last N transactions per card" design works in the demo and dies at millions of keys. Aggregating state (counters + HLL) was the difference between 4GB and 400GB of checkpoint.
Rules belong to the risk team, so give them a deployment path. Broadcast state turned rule shipping from a two-week engineering ticket into a reviewed config change — and rule iteration speed is fraud-loss reduction, directly.
Replay is the most-used feature you'll build. Every rule change, every model retrain, every "would we have caught this?" question is a replay. Budget for it: Kafka retention and Iceberg snapshots are the price of answerable questions.
The chaos drill is the acceptance test. Kill a TaskManager at peak synthetic load; require zero lost transactions and zero duplicate alerts. If you haven't demonstrated it, you have optimism, not exactly-once.

09 · Key Takeaways for Practitioners

⏱️

Event time, always

Watermarks at 30–60s for card traffic, idleness handling on, and never mix in processing time "temporarily."

🔁

Exactly-once is a boundary

Flink + transactional Kafka inside; idempotent consumers outside. Both, or neither matters.

📦

Aggregate state, TTL everything

Profiles in O(1) per key — counts, sums, sketches. TTL equals window length by construction.

📜

Rules as broadcast config

Risk ships rules in minutes through a reviewed control topic; every alert cites its rule version.

🧊

Iceberg ends the lambda split

One ACID table layer for stream writes, analyst SQL, training, replay, and time-travel audits.

💥

Drill before declaring

TaskManager kill at peak: zero loss, zero duplicates, checkpoint under 10% of interval. Then it's production.

For the operational underbelly — backpressure diagnosis, schema evolution, DLQ design — see CDC Pipeline Hard Parts; for the streaming-economics decision itself, CDC vs Full Load. The production systems behind this article's claims are documented in the network telemetry and LXP streaming case studies.

FAQ · Frequently Asked Questions

What latency is achievable for streaming fraud detection?

Sub-second end-to-end is a realistic production target — gateway event through Kafka and Flink scoring to a decision, inside the card authorization window. Vipra's documented streaming platforms run sub-second anomaly detection at 1B+ events/hour.

Why Iceberg instead of writing to a warehouse?

Iceberg ends the batch/streaming split: Flink streams ACID commits into the same tables that serve analysts and model training, with snapshot isolation and time travel for replay and audits. A warehouse sink creates a second copy whose truth drifts from the stream.

Is exactly-once processing actually achievable?

Yes, for state and output: Flink checkpointing plus Kafka transactions guarantee each event affects state once and each alert is emitted once across failures. Downstream consumers must still be idempotent — exactly-once semantics end at the transactional boundary.

How do you handle transactions that arrive late?

Event-time windows with watermarks (30–60s for card traffic), allowed lateness routed via side outputs to a correction path, and an explicit per-rule policy: retract the decision, adjust the risk score, or correct analytics only. Late data handling is a product decision implemented in code.

Real-Time Fraud Detection at 50K TPS:Why We Ditch Batch for Kafka + Flink + Iceberg

01 · Why Batch Loses to Fraud by Design

02 · The Reference Architecture, End to End

03 · Event Time vs Processing Time: The Non-Negotiable

04 · Exactly-Once: What It Promises and What It Doesn't

05 · State Design at 50K TPS

06 · Late-Arriving Transactions: A Product Decision in Code

07 · Iceberg: One Table Layer for Stream and Batch

08 · Lessons Learned: The Hard Truths

09 · Key Takeaways for Practitioners

FAQ · Frequently Asked Questions

Real-Time Fraud Detection at 50K TPS:
Why We Ditch Batch for Kafka + Flink + Iceberg