Fraud is adversarial and fast: card-testing runs, account takeovers, and mule-network cashouts complete in minutes. A pipeline that lands transactions in a warehouse and scores them on a schedule concedes that entire window — moving the same rules from a 3-hour batch cycle to an 800-millisecond streaming path changes not just when you catch fraud, but what is catchable at all.
This article is the reference architecture we deploy in production streaming engagements: Kafka for durable ordered transport, Flink for stateful detection with exactly-once semantics, Iceberg as the unified batch+streaming lakehouse. The hard parts get the depth — event time, watermarks, state TTL, late data, and the chaos drill that separates exactly-once from optimism.
The 50K TPS figures describe the reference design target. The engineering underneath is documented Vipra production work: 1B+ events/hour with sub-second anomaly detection (network telemetry, Flink + ClickHouse) and sub-3-minute end-to-end Kafka platforms (LXP streaming).
01 · Why Batch Loses to Fraud by Design
The nightly fraud batch has a precise economic meaning: every fraud pattern completing faster than your batch interval is free for the attacker. Card-testing runs burn through stolen BINs in 20 minutes. Account-takeover cashouts finish in under an hour. The batch job that runs at 02:00 produces a beautifully accurate report of money that is already gone.
Latency changes the detection category, not just the detection time. Velocity rules — five transactions in four countries within ten minutes, three failed CVVs then a success, dormant account suddenly testing small amounts — only exist in a streaming world, because by the time batch sees them, they are history rather than signal. The reference target is decision inside the authorization window: gateway event to allow/block/step-up in under a second, with 800ms as the engineered P99.
02 · The Reference Architecture, End to End
Throughput math for the 50K TPS target: 64 partitions on the transaction topic gives ~780 TPS/partition — comfortable for Flink subtasks running enrichment plus scoring at single-digit-millisecond per event. Headroom rule: provision for 3× observed peak, because fraud traffic spikes are correlated with exactly the moments you cannot afford lag (Black Friday, breach aftermath).
03 · Event Time vs Processing Time: The Non-Negotiable
A transaction's fraud signal depends on when it happened, not when your pipeline saw it. Build every window on event time with watermarks, and choose the lag deliberately:
Flink — watermark strategy for card transactionsWatermarkStrategy .<Transaction>forBoundedOutOfOrderness(Duration.ofSeconds(45)) .withTimestampAssigner((txn, ts) -> txn.getAuthTimestampMillis()) .withIdleness(Duration.ofSeconds(30)); // quiet partitions can't stall the watermark
Too tight a watermark and late transactions miss their window; too loose and every decision waits. For card traffic, 30–60 seconds of bounded disorder is a sound default — but the real design choice is what happens outside the bound, which is Section 06. The withIdleness clause matters more than it looks: one quiet partition (a regional gateway overnight) will otherwise freeze watermarks for the whole job, and your windows simply stop firing. This is the single most common Flink production incident we get called into.
04 · Exactly-Once: What It Promises and What It Doesn't
Flink checkpoints plus Kafka transactional producers give you exactly-once state and output: a transaction influences a window once, and an alert is emitted once, across any number of failures and restarts. The configuration that delivers it:
Flink — exactly-once checkpointingexecution.checkpointing.interval: 15s execution.checkpointing.mode: EXACTLY_ONCE state.backend: rocksdb state.backend.incremental: true state.checkpoints.dir: s3://fraud-checkpoints/prod execution.checkpointing.timeout: 120s # Kafka sink sink.delivery-guarantee: exactly-once sink.transactional-id-prefix: fraud-alerts-prod
What it does not promise: anything beyond the transactional boundary. The decision API consuming the alert topic must be idempotent on alert ID, because Kafka read-committed consumers can still re-read after their own failures. And exactly-once costs latency — transactional commits gate on checkpoint completion, so your checkpoint interval is a floor on alert-topic visibility. At a 15s interval with 800ms decisions, the alert is computed in 800ms and committed within the checkpoint cycle; the decision API reads uncommitted for speed and reconciles on commit. That nuance — fast path uncommitted, audit path committed — is the trick that keeps both the latency SLO and the correctness story.
At 50K TPS the binding constraint is state backend throughput, not CPU: RocksDB with incremental checkpoints keeps checkpoint duration under 10% of the interval. When it creeps above that, you are heading for a backpressure spiral — scale state before scaling compute.
05 · State Design at 50K TPS
Velocity rules require per-key state across four key spaces — card, account, device, merchant. The design rules that keep multi-billion-key state tractable:
| Rule | Implementation | Why it matters at scale |
|---|---|---|
| Bound every state with TTL | StateTtlConfig = window length, enforced by backend | Unbounded state is a slow-motion outage; 24h velocity = 24h TTL, by construction |
| Aggregate, don't accumulate | Counts, sums, HyperLogLog sketches — not raw event lists | O(1) state per key; a card's 24h profile is ~200 bytes, not 200 events |
| Rules as broadcast state | Control topic → broadcast stream → connected operator | Risk team ships a rule in minutes, no redeploy, full audit trail of rule versions |
| Key-space isolation | Separate operators per key space, fan-in to decision | A hot merchant key can't starve card-level detection |
Flink — velocity state, aggregating not accumulating// per-card 24h profile: O(1) size regardless of transaction count ValueStateDescriptor<CardProfile> desc = new ValueStateDescriptor<>("card-profile", CardProfile.class); StateTtlConfig ttl = StateTtlConfig.newBuilder(Time.hours(24)) .setUpdateType(UpdateType.OnCreateAndWrite) .cleanupInRocksdbCompactFilter(10_000) .build(); desc.enableTimeToLive(ttl); // profile = {count_1h, count_24h, sum_24h, countries_hll, last_seen_ms, decline_streak}
The broadcast-rules pattern deserves emphasis because it changes the organization, not just the job: fraud analysts ship YAML rule changes through a reviewed control topic, the operator validates and applies them on the fly, and every rule version is itself an event in Kafka — which means every alert can cite the exact rule version that fired it. That citation is what your disputes team needs and your regulator eventually asks for.
06 · Late-Arriving Transactions: A Product Decision in Code
Settlement files, retried gateways, store-and-forward terminals, and network partitions guarantee that some transactions arrive minutes late — past any reasonable watermark. What happens next is a product decision per rule, and it belongs in code, not in incident retrospectives:
Flink — late events routed to an explicit correction pathOutputTag<Transaction> lateTag = new OutputTag<>("late-txns"){}; SingleOutputStreamOperator<Alert> alerts = txns .keyBy(Transaction::getCardId) .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1))) .allowedLateness(Time.minutes(2)) // in-window corrections, auto-retracted .sideOutputLateData(lateTag) // beyond that: explicit policy .aggregate(new VelocityAggregator(), new AlertEmitter()); DataStream<Transaction> late = alerts.getSideOutput(lateTag); // policy per destination: late.sinkTo(riskScoreUpdater); // silently adjust customer risk profile late.sinkTo(icebergCorrections); // analytics truth gets the event regardless // deliberately NOT wired to the decision API — auth already happened
Three distinct fates, made explicit: events within allowedLateness retract and re-emit window results automatically (the decision API sees a correction it can act on — e.g., release a step-up hold); events beyond it update the customer's risk score and the lakehouse, but never retro-decide an authorization that already settled. Writing this as topology rather than policy-document means the behaviour is tested, versioned, and survives team turnover.
07 · Iceberg: One Table Layer for Stream and Batch
The historical answer to "streaming truth vs batch truth" was the lambda architecture — two pipelines, two stores, eternal reconciliation. Iceberg ends it: Flink commits ACID snapshots into the same tables that serve analyst SQL, model training, and compliance queries.
Flink SQL — streaming upsert into the fraud lakehouseCREATE TABLE iceberg_catalog.fraud.transactions ( txn_id STRING, card_id STRING, auth_ts TIMESTAMP(3), amount DECIMAL(12,2), merchant_id STRING, decision STRING, rule_version STRING, PRIMARY KEY (txn_id) NOT ENFORCED ) PARTITIONED BY (days(auth_ts)) WITH ('format-version'='2', 'write.upsert.enabled'='true'); INSERT INTO iceberg_catalog.fraud.transactions SELECT txn_id, card_id, auth_ts, amount, merchant_id, decision, rule_version FROM decided_transactions;
What this buys, concretely: replay — when rules or models change, reprocess Kafka history and compare against Iceberg snapshots of what the old logic decided; time travel — "show me the state of knowledge when this disputed transaction was approved" is a FOR TIMESTAMP AS OF query, which is an audit superpower; one schema — the feature pipeline trains on exactly the columns the detector wrote, eliminating the training/serving skew that quietly rots fraud models. Compact small files hourly (streaming commits produce many) and expire snapshots on a compliance-aligned schedule — operational chores, but scheduled ones, not architectural debt.
(From 3-Hour Batch)
Throughput
Streaming Scale
Chaos Drill — Required
08 · Lessons Learned: The Hard Truths
- Idle partitions stall watermarks, and it will page you at 3am. A quiet regional gateway freezes the global watermark and windows stop firing — silently.
withIdlenessis not optional configuration; it is the difference between detection and a very expensive event archive. - Exactly-once ends at your consumer. Teams celebrate transactional sinks and then deploy a decision API that double-processes on restart. Idempotency on alert ID is part of the architecture, not a downstream team's problem.
- State lists kill you at scale; sketches don't. The naive "keep last N transactions per card" design works in the demo and dies at millions of keys. Aggregating state (counters + HLL) was the difference between 4GB and 400GB of checkpoint.
- Rules belong to the risk team, so give them a deployment path. Broadcast state turned rule shipping from a two-week engineering ticket into a reviewed config change — and rule iteration speed is fraud-loss reduction, directly.
- Replay is the most-used feature you'll build. Every rule change, every model retrain, every "would we have caught this?" question is a replay. Budget for it: Kafka retention and Iceberg snapshots are the price of answerable questions.
- The chaos drill is the acceptance test. Kill a TaskManager at peak synthetic load; require zero lost transactions and zero duplicate alerts. If you haven't demonstrated it, you have optimism, not exactly-once.
09 · Key Takeaways for Practitioners
Watermarks at 30–60s for card traffic, idleness handling on, and never mix in processing time "temporarily."
Flink + transactional Kafka inside; idempotent consumers outside. Both, or neither matters.
Profiles in O(1) per key — counts, sums, sketches. TTL equals window length by construction.
Risk ships rules in minutes through a reviewed control topic; every alert cites its rule version.
One ACID table layer for stream writes, analyst SQL, training, replay, and time-travel audits.
TaskManager kill at peak: zero loss, zero duplicates, checkpoint under 10% of interval. Then it's production.
For the operational underbelly — backpressure diagnosis, schema evolution, DLQ design — see CDC Pipeline Hard Parts; for the streaming-economics decision itself, CDC vs Full Load. The production systems behind this article's claims are documented in the network telemetry and LXP streaming case studies.