Network Telemetry Platform with Apache Flink — Case Study

1B/hr

Telemetry Events Ingested

<1sec

Anomaly Detection Latency

70%

MTTR Reduction

20w

Delivery Timeline

The Challenge

A tier-1 telecom operator managing a national 5G network was generating over 1 billion telemetry events per hour from network equipment — routers, base stations, optical transmission nodes, and core network elements. This data contained the signals needed to detect network degradation, predict equipment failures, and identify capacity bottlenecks before they impacted subscribers. The problem: none of it was being processed in real-time.

Network events were written to flat files, compressed, and batch-loaded into a legacy network management system every 30 minutes. By the time the NOC team saw a degradation signal, subscriber impact had already begun. Mean time to detect network incidents was 22 minutes; mean time to resolve was 4.5 hours. For a network carrying national voice and data traffic, these figures translated directly into SLA penalties, customer churn, and reputational damage during high-profile incidents.

The data engineering challenge at 1B events per hour is fundamentally different from typical enterprise analytics workloads. Standard database architectures cannot ingest at this throughput. Stateful stream processing is required for anomaly detection algorithms that need to maintain rolling windows of metric baselines across tens of thousands of network elements simultaneously. The storage technology for querying this data must support sub-second response times on billions of rows — ruling out conventional columnar databases.

Our Approach

Vipra Software selected Apache Flink as the stream processing engine for its mature stateful processing capabilities and exactly-once semantics, and ClickHouse as the analytical store for its exceptional query performance on time-series data at scale — consistently delivering sub-second query responses on billions of rows that would take minutes in conventional warehouses.

Telemetry Ingestion Design (Weeks 1–3): Inventoried all telemetry sources across 12 equipment vendors. Designed a Kafka-based ingestion architecture with 48 partitions across 6 topics — partitioned by network region for processing locality. Implemented protobuf serialisation for 60% bandwidth reduction versus JSON.
Flink Stream Processing (Weeks 4–10): Built Flink jobs for: (a) metric normalisation across 12 vendor formats, (b) rolling window aggregations (1-min, 5-min, 15-min, 1-hour) for 340 KPI metrics per network element, (c) statistical baseline computation using exponential weighted moving averages, (d) z-score anomaly detection comparing current metric values against rolling baselines with configurable sensitivity thresholds.
ClickHouse Analytics Store (Weeks 11–14): Deployed ClickHouse cluster with ReplicatedMergeTree tables partitioned by equipment ID and time. Implemented TTL policies retaining 90 days at full resolution, 2 years at hourly aggregation. Configured ClickHouse materialised views for the 20 most frequent NOC dashboard queries — eliminating query-time computation for standard views.
NOC Dashboard & Alerting (Weeks 15–18): Built Grafana dashboards consuming ClickHouse directly via the Grafana ClickHouse datasource. Implemented alert routing through PagerDuty integration for severity-tiered anomaly alerts. Built a network topology correlation view linking anomaly alerts to physical network topology for rapid fault isolation.
Load Testing & Production Cutover (Weeks 19–20): Validated platform at 1.5B events/hour (150% of peak load) with sustained sub-second anomaly detection. Executed phased cutover by network region over 5 days. Decommissioned batch processing pipeline after 2 weeks of parallel operation.

Technical Architecture

The architecture handles 1B+ events/hour through a layered design: Kafka absorbs ingestion bursts (network equipment generates bursty traffic patterns during maintenance windows and incidents), Flink provides stateful processing with exactly-once semantics, and ClickHouse serves as the time-series analytical store. Each layer scales independently — Kafka partitions scale horizontally, Flink parallelism scales by adding task slots, and ClickHouse scales by adding shards.

The anomaly detection algorithm runs as a Flink stateful function maintaining a rolling exponentially weighted moving average (EWMA) and standard deviation for each metric on each network element. When a current metric value deviates more than N standard deviations from the EWMA baseline, an anomaly event is emitted with severity classification (warning, critical, emergency) based on deviation magnitude and metric importance weighting. This approach adapts to normal metric evolution over time — reducing false positives as the baseline follows seasonal and time-of-day patterns.

Business Impact

Mean time to detect network incidents dropped from 22 minutes to under 45 seconds — a 97% reduction — in the first month of production operation. The NOC team shifted from reactive incident response to proactive anomaly investigation, often resolving developing issues before subscriber impact threshold was reached.

Mean time to resolve improved by 70%, driven by the network topology correlation view that eliminated the manual process of mapping an anomaly alert to its physical location in the network. Engineers could see the affected equipment, its upstream and downstream dependencies, and historical incident patterns for that element in a single dashboard view.

The platform's predictive capability — identifying equipment with degrading metric trends before failure — prevented 3 major network outages in the first quarter of operation that historical analysis suggested would have caused significant subscriber impact. The SLA penalty savings from these prevented outages alone represented a return exceeding the full platform investment.

← Previous: Executive BI Next: Supply Chain Lakehouse →

Network Telemetry
Data Platform

The Challenge

Our Approach

Technical Architecture

Business Impact

Technology Stack

Services Delivered

High-Volume Telemetry Challenge?

Network TelemetryData Platform

The Challenge

Our Approach

Technical Architecture

Business Impact

Technology Stack

Services Delivered

High-Volume Telemetry Challenge?

Network Telemetry
Data Platform