Cheating is an economics problem before it is a detection problem: cheaters churn the legitimate players who fund the game, cheat developers iterate like the adversaries they are, and a single viral false-positive ban costs more goodwill than a hundred missed cheaters. Whatever you build must therefore optimise two numbers in tension — detection latency and false-positive rate — and be honest that the second one rules.
The architecture: session telemetry streaming through Kafka at 10M+ daily sessions, Flink CEP and statistical scoring against per-player baselines from a feature store, graduated response (shadow-flag → restrict → review → ban) with confidence-tiered automation, and human review queues wired in as a design element rather than an apology.
Session scale, the <0.1% false-positive target, and the $12M revenue-protection figure are labelled reference values. The streaming engineering underneath is documented Vipra production work: 1B+ events/hour with sub-second anomaly detection, and the same exactly-once Flink discipline as our fraud-detection architecture — anti-cheat is fraud detection where the currency is fairness.
01 · The Economics of Cheating — and of False Positives
Run the numbers before the architecture. A competitive title with 10M daily sessions losing 3–5% of its paying base annually to cheat-driven churn is leaking revenue at exactly the scale of the reference scenario's $12M/year. Against that: a false ban of a streamer with an audience is a PR incident with a refund tail, and a false-positive rate above noise level poisons the appeal queue and the community's trust simultaneously.
This asymmetry dictates the design: detection can be aggressive; consequence must be conservative. The pipeline detects in seconds, but the automated path to a ban is gated by confidence tiers, corroboration windows, and human review for everything ambiguous — the same graduated-response shape as our fraud architecture, with the block/step-up/allow ladder relabelled shadow-flag/restrict/ban.
02 · The Architecture, End to End
Throughput math: 10M daily sessions × ~2K events/session averages ~230K events/sec with evening peaks 3–4×. That is well inside the envelope our production telemetry platform sustains (1B+ events/hour with sub-second detection) — the gaming twist is not volume but adversarial drift, which Sections 04–06 address.
03 · The Data Flow: Session Telemetry to Verdict
The corroboration rule is the false-positive firewall: no single detector can ban. An aim-statistics anomaly alone shadow-flags; aim anomaly plus input-cadence signature plus percentile jump escalates. Independent evidence multiplies confidence in a way one detector's higher threshold never can — the same principle as the self-consistency votes in our LLM grading pipeline, applied to adversaries instead of essays.
04 · Behavioral Pattern Matching in Flink
Cheat detection layers four detector families, cheapest first:
Flink CEP — the impossible-input rule family (Java, simplified)// Family 1: physics violations — cheap, certain, CEP-shaped Pattern<InputEvent, ?> flickPattern = Pattern.<InputEvent>begin("pre") .where(e -> e.angularVelocity() < HUMAN_SACCADE_LIMIT) .next("flick") .where(e -> e.angularVelocity() > HUMAN_SACCADE_LIMIT * 3 && e.endsOnTarget() && e.fired()) // snap-to-head .timesOrMore(4).within(Time.minutes(2)); // repetition, not luck // Family 2: self-baseline — keyed state, the player vs themselves // headshot%, reaction-time distribution, accuracy-by-range curves as // aggregating state (counts + digests, TTL 30d) — a 45th-percentile player // posting 99th-percentile numbers overnight is a signal no rule misses // Families 3 & 4: population percentile shifts + ONNX sequence models // on input cadence — humans are noisy, scripts are clean; the cleanliness // itself is the tell. Models score in-stream; training lives in the lakehouse.
State discipline is the fraud playbook verbatim: aggregating state (digests, not event lists), TTL bounded to the baseline window, rules deployed as broadcast config so the anti-cheat team ships a new signature in minutes when a cheat build drops — and every verdict cites the rule/model version that fired it, because appeals will ask.
05 · The Player Feature Store: Every Account Has a History
Context converts detections into decisions. The feature store (Redis hot path, lakehouse-backed) holds per-player: skill trajectory (30/90-day envelopes per metric), account economics (age, purchases, prior flags and their outcomes), device and network fingerprints, and social graph signals (party patterns, report sources weighted by reporter credibility).
Two properties matter more than the feature list. Point-in-time correctness — models train on what was knowable at detection time, the same discipline as every feature store we build (digital twin, AVM) — and outcome feedback: every review verdict and appeal result writes back as a label, so the models and the reviewer-credibility weights improve from ground truth, not assumption. Smurf detection (good player, new account) lives here too — skill envelope wildly ahead of account age is its own signal family, handled with restriction rather than ban because smurfing is a fairness problem, not a software crime.
06 · The Ban Pipeline: Automation with Human Gates
The response ladder is where engineering meets community trust:
| Tier | Trigger | Action | Reversibility |
|---|---|---|---|
| Shadow-flag | Single detector, any confidence | Observation only; corroboration window opens | Silent, total |
| Restrict | Moderate corroborated confidence | Cheater-pool matchmaking; economy limits | Invisible, instant |
| Human review | High confidence, OR any high-value account | Queue with evidence pack: replay clips, detector outputs, baseline charts | Reviewer decides |
| Auto-ban | ≥2 independent detectors + replay evidence, highest tier only | Ban with evidence attached; appeal path included in the notice | Appeal = replay review |
The review queue is engineered like a product: evidence packs assemble automatically (the reviewer watches a 20-second clip beside the detector chart, decides in under a minute), queue SLAs are monitored, and reviewer agreement is calibrated the way we calibrate graders in the LLM grading architecture. The <0.1% false-positive target is measured, not asserted: every appeal outcome and every overturned review feeds the weekly metric, sliced by detector family — the slice is where you find the rule that's quietly misfiring on one region's network jitter.
07 · Business Implementation: Fairness as Revenue Protection
The reference scenario quantifies what every live-ops team knows qualitatively: cheating churns payers. A 10M-session title attributing 3–5% of paying-player churn to cheat encounters recovers, with mid-session detection and restriction, the bulk of that loss — the $12M/year revenue protection figure is that recovery, and it dwarfs the platform's run cost by an order of magnitude. The instrumentation to prove it: cohort retention curves split by cheat-encounter rate, before and after deployment — the same evidence discipline as everything else in the platform.
Implementation arc that works: weeks 1–6, telemetry spine and shadow-flagging only (the platform observes, baselines build, nobody is actioned); weeks 7–12, restriction tier live (reversible consequences while false-positive measurement matures); quarter 2, review queue and auto-ban for the corroborated top tier. Shipping bans before baselines is the classic failure — the first viral false positive costs more trust than the cheaters did.
Reference Scale
Measured Weekly
Protected (Reference)
Streaming Headroom
08 · Lessons Learned: The Hard Truths
- Server-authoritative telemetry or nothing. Client-reported stats are the cheater's first forgery. Every signal the detectors trust must originate from the server's own observation of inputs and outcomes.
- The corroboration rule is non-negotiable under pressure. After every viral cheating clip, someone proposes letting the hot new detector ban alone. That is how false-positive incidents happen; the two-detector floor held every time we defended it.
- Restriction beats banning for almost everything. Cheater-pool matchmaking removes the victim experience instantly, silently, reversibly — and watching restricted cheaters play each other is the best training data the models ever got.
- War-game weekly or drift quietly. Cheat vendors ship updates like the software companies they are. The lab harness that replays fresh cheat builds against the detector suite is the platform's immune system — detection rates against new builds is a tracked KPI.
- Network jitter mimics cheating. Our worst false-positive cluster traced to one region's packet loss producing input patterns a cadence model read as scripted. Per-region baselines and a network-quality feature fixed it; the slice-by-everything dashboard found it.
- Appeals are training data, not overhead. Every overturned verdict is a labelled false positive — the most valuable label the system receives. The appeal workflow writes back to the feature store by design.
09 · Key Takeaways for Practitioners
Detect aggressively, consequence conservatively. The graduated ladder exists because bans are revenue events both ways.
Independent corroboration multiplies confidence; one detector's higher threshold never can. Two-detector floor, always.
Self-baselines catch what population stats miss; aggregating state with TTL keeps it tractable at 10M sessions.
Auto-assembled evidence packs, sub-minute decisions, calibrated reviewers, SLA'd queues. Humans are a tier, not a fallback.
Replay fresh cheat builds weekly; track detection rate against new builds as a KPI. Drift is the default; the harness fights it.
Cohort retention split by cheat-encounter rate, before and after. Fairness is measurable; measure it.
The streaming foundations are documented production work: 1B+ events/hour telemetry and the exactly-once Flink fraud architecture this design adapts. The feature-store and human-review disciplines recur across our LLM grading and digital twin pieces.