How Vipra Software built an Apache Flink + ClickHouse real-time Network Operations Centre platform that ingests 1B+ hourly telemetry events and delivers sub-second anomaly detection to a tier-1 telecom operator.
A tier-1 telecom operator managing a national 5G network was generating over 1 billion telemetry events per hour from network equipment — routers, base stations, optical transmission nodes, and core network elements. This data contained the signals needed to detect network degradation, predict equipment failures, and identify capacity bottlenecks before they impacted subscribers. The problem: none of it was being processed in real-time.
Network events were written to flat files, compressed, and batch-loaded into a legacy network management system every 30 minutes. By the time the NOC team saw a degradation signal, subscriber impact had already begun. Mean time to detect network incidents was 22 minutes; mean time to resolve was 4.5 hours. For a network carrying national voice and data traffic, these figures translated directly into SLA penalties, customer churn, and reputational damage during high-profile incidents.
The data engineering challenge at 1B events per hour is fundamentally different from typical enterprise analytics workloads. Standard database architectures cannot ingest at this throughput. Stateful stream processing is required for anomaly detection algorithms that need to maintain rolling windows of metric baselines across tens of thousands of network elements simultaneously. The storage technology for querying this data must support sub-second response times on billions of rows — ruling out conventional columnar databases.
Vipra Software selected Apache Flink as the stream processing engine for its mature stateful processing capabilities and exactly-once semantics, and ClickHouse as the analytical store for its exceptional query performance on time-series data at scale — consistently delivering sub-second query responses on billions of rows that would take minutes in conventional warehouses.
The architecture handles 1B+ events/hour through a layered design: Kafka absorbs ingestion bursts (network equipment generates bursty traffic patterns during maintenance windows and incidents), Flink provides stateful processing with exactly-once semantics, and ClickHouse serves as the time-series analytical store. Each layer scales independently — Kafka partitions scale horizontally, Flink parallelism scales by adding task slots, and ClickHouse scales by adding shards.
The anomaly detection algorithm runs as a Flink stateful function maintaining a rolling exponentially weighted moving average (EWMA) and standard deviation for each metric on each network element. When a current metric value deviates more than N standard deviations from the EWMA baseline, an anomaly event is emitted with severity classification (warning, critical, emergency) based on deviation magnitude and metric importance weighting. This approach adapts to normal metric evolution over time — reducing false positives as the baseline follows seasonal and time-of-day patterns.
Mean time to detect network incidents dropped from 22 minutes to under 45 seconds — a 97% reduction — in the first month of production operation. The NOC team shifted from reactive incident response to proactive anomaly investigation, often resolving developing issues before subscriber impact threshold was reached.
Mean time to resolve improved by 70%, driven by the network topology correlation view that eliminated the manual process of mapping an anomaly alert to its physical location in the network. Engineers could see the affected equipment, its upstream and downstream dependencies, and historical incident patterns for that element in a single dashboard view.
The platform's predictive capability — identifying equipment with degrading metric trends before failure — prevented 3 major network outages in the first quarter of operation that historical analysis suggested would have caused significant subscriber impact. The SLA penalty savings from these prevented outages alone represented a return exceeding the full platform investment.