Vipra Software Articles Learner Engagement Telemetry
ClickHouse Apache Kafka EdTech Materialized Views xAPI Streaming

The Attention Economy:
Real-Time Learner Telemetry with ClickHouse + Kafka

Learning platforms measure the scarcest resource on the internet — attention — and most measure it a day late. The architecture that fixes it: Kafka event spine, ClickHouse analytics engine, materialized views that keep dashboards never-stale, and funnels under 100ms. Built on Vipra’s proven sub-3-minute production pattern.

Domain
EdTech / Learning Platforms
Design Volume
50M+ events/day
Funnel Query Latency
< 100ms
Vipra Proven
<3min E2E · 1B+ events/hr
Stack
Kafka · ClickHouse · Superset
Published
June 2026
Executive Summary

A learner who struggles on Tuesday night and gets help Thursday morning got help one session too late — disengagement compounds per session, and nightly-batch analytics can only autopsy it. Moving telemetry to streaming changes the product category: live cohort dashboards, struggle detection while the session is open, and recommendations that reflect the last ten minutes.

The architecture: a versioned xAPI-informed event taxonomy on Kafka, ClickHouse as the analytical engine (high-cardinality time aggregation is its home turf), insert-time materialized views serving sub-100ms funnels, humble-statistics anomaly detection routed to humans, and FERPA/GDPR privacy designed in rather than bolted on.

The sub-3-minute end-to-end target is not aspirational — it is Vipra's documented production result (Kafka LXP platform, millions of learners), and our ClickHouse telemetry platform sustains 1B+ events/hour. This architecture is those two production systems, composed.

01 · Why Learning Analytics Lags a Day Behind Learning

The standard LMS analytics stack is a nightly export into a warehouse and a morning dashboard refresh. Structurally, that architecture can answer what happened and never what is happening — and in learning, the difference is the product. Disengagement compounds per session: the learner who hits a confusing segment tonight and gets no response is measurably less likely to start the next session at all. Intervention has a half-life measured in hours.

Streaming telemetry creates capabilities batch cannot approximate: instructors watching live cohort dashboards during a synchronous session; struggle signals flagged while re-engagement still works; content teams seeing a confusing video segment spike replays the same day they shipped it; recommendations that know what the learner did ten minutes ago. Our LXP engagement made exactly this transition — nightly batch to sub-3-minute latency for a platform serving millions of learners — and the product changed around it.

02 · The Architecture, End to End

emit
Clients → Kafka. Web/mobile/SCORM players emit versioned events, keyed by learner ID; schema registry enforces the taxonomy at the gate.
land
Kafka → ClickHouse. Kafka table engine + materialized-view pipeline lands events into MergeTree, partitioned by date, ordered by (course, learner, time).
aggregate
Insert-time rollups. Per-minute engagement, funnel counters, cohort summaries — updated incrementally as rows arrive, owned in the repo.
serve
Dashboards + APIs. Superset/embedded dashboards read rollups in milliseconds; recommendation features exported to the serving layer.
watch
Anomaly + privacy layer. Baseline-deviation detection → human review queues; TTLs, pseudonymisation, and consent enforced in-engine.

Volume math: 50M events/day averages ~580/sec and peaks around 10–15K/sec at synchronous-class boundaries — comfortable single-digit-node ClickHouse territory. Our telemetry production system runs this engine at 1B+ events/hour, an order of magnitude above this design's needs. That headroom is the point: capacity planning becomes boring, which is what capacity planning should be.

03 · The Event Spine: Taxonomy Before Throughput

The engineering risk in learning telemetry is semantic, not volumetric. Define the taxonomy first, version it, and enforce it at the gate:

event taxonomy — xAPI-informed, pragmatic (registry schema, excerpt)
{ "event_version": "2.3", "event_type": "enum[content.view, content.start, content.complete, assess.attempt, assess.submit, assess.score, interact.pause, interact.replay, interact.speed_change, session.start, session.heartbeat, session.end]", "learner_id": "pseudonymous-uuid", "course_id": "string", "content_id": "string", "position_sec": "nullable-int", "cohort_id": "string", "institution_id": "string", "client_ts": "timestamp-ms", "server_ts": "timestamp-ms" }

Three taxonomy decisions that pay forever: interaction events are the honest signals — pause, replay, and speed-change cluster around confusion in ways completion events never reveal; key by learner ID for ordered per-learner streams (session reconstruction depends on it); and carry both client and server timestamps — clock-skewed mobile clients will otherwise write your funnels backwards. Schema registry enforcement means a client release cannot silently rename a field a funnel depends on; we learned that one in production so you don't have to.

04 · ClickHouse: Built for Exactly This Query Shape

Engagement analytics is high-cardinality aggregation over time — millions of learners × thousands of content items, sliced live by cohort, course, and institution. That shape is ClickHouse's home turf:

ClickHouse — the base table (MergeTree, tuned for the access pattern)
CREATE TABLE events.learning_events ( event_date Date DEFAULT toDate(server_ts), server_ts DateTime64(3), event_type LowCardinality(String), learner_id UUID, course_id LowCardinality(String), content_id String, cohort_id LowCardinality(String), institution_id LowCardinality(String), position_sec Nullable(UInt32), props Map(String, String) ) ENGINE = MergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (course_id, cohort_id, learner_id, server_ts) TTL event_date + INTERVAL 26 MONTH DELETE SETTINGS index_granularity = 8192;

The ORDER BY is the design: course-and-cohort-scoped queries — which is everything an instructor dashboard asks — prune to a sliver of the table. LowCardinality on the dimension columns and Map for long-tail properties keep storage at a fraction of raw JSON; compression ratios of 10–20× on learning events are routine. And note the TTL clause — retention is an engine property here, which Section 07 turns into a compliance feature.

05 · Materialized Views: Dashboards That Are Never Stale

ClickHouse materialized views aggregate at insert time: as Kafka rows land, the rollups update incrementally. Dashboards read tiny pre-aggregated tables — milliseconds, regardless of concurrency:

funnel rollup — one view per dashboard query shape
CREATE MATERIALIZED VIEW agg.funnel_by_cohort ENGINE = AggregatingMergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (course_id, cohort_id, event_date) AS SELECT event_date, course_id, cohort_id, uniqState(learner_id) AS learners, uniqStateIf(learner_id, event_type = 'content.start') AS started, uniqStateIf(learner_id, event_type = 'content.complete') AS completed, uniqStateIf(learner_id, event_type = 'assess.submit') AS assessed FROM events.learning_events GROUP BY event_date, course_id, cohort_id; -- dashboard query: enrolled → started → completed → assessed, by cohort SELECT cohort_id, uniqMerge(learners) AS enrolled, uniqMerge(started) AS started, uniqMerge(completed) AS completed FROM agg.funnel_by_cohort WHERE course_id = 'cs101' AND event_date >= today() - 27 GROUP BY cohort_id; -- answers in <100ms over billions of base rows

The working discipline: one materialized view per dashboard query shape, named for it, owned in the repo — views accreted ad hoc become an unmaintainable thicket within two quarters. Sub-100ms funnels are the difference between a dashboard people watch during class and one they refresh and abandon.

⚠️Materialized views fire on insert — they never backfill themselves. Every view ships with its backfill INSERT…SELECT in the same migration, or your historical dashboards quietly start at the view's birthday.

06 · Behavioral Anomaly Detection: The Duty-of-Care Layer

Streaming telemetry enables detection that batch cannot, and in education the detections carry weight:

SignalDetectionRouted to
Disengagement riskSession cadence collapse vs learner's own baselineInstructor/advisor queue — while re-engagement still works
Integrity signalAssessment patterns consistent with answer-sharing (timing clusters, sequence similarity)Human review, never automated consequence
Content defectReplay-rate spike at one timestamp across many learnersContent authors, same day

Keep the statistics humble — per-cohort baselines and deviation bands, recalibrated weekly — and the governance serious: these are signals about people, often young people. The platform's rule, written down: signals inform humans; humans decide consequences. An anomaly system that auto-flags a struggling teenager to a disciplinary process is a product failure regardless of its precision.

07 · Privacy Is a Feature of the Architecture

Learner telemetry is personal data, frequently minors' data: FERPA/GDPR-class constraints are design inputs, not compliance paperwork. The enforcement points, in the engine:

  • Pseudonymous IDs on the spine — the event stream never carries real identity; a separately-governed identity vault maps UUIDs to people, with its own access regime and audit log.
  • Cohort-level defaults — dashboards aggregate by default; individual drill-down is role-gated and logged. Most users never need (and never get) the individual view.
  • Retention as a table property — the TTL clause in Section 04 is the retention policy; expiry doesn't depend on someone remembering to run a script.
  • Protected attributes excluded by contract — recommendation features derive from behaviour; demographic fields are contractually absent from the feature pipeline, validated in CI.
  • The subject-access query is a deliverable — "what do you hold about this learner?" is a tested, documented query, not a three-week scramble. Build it before the first request, because the first request comes with a deadline.
<3min
End-to-End — Vipra
Documented Production
<100ms
Funnel Queries Over
Billions of Base Rows
50M+/day
Design Volume —
An Order Below Proven
1B+/hr
Vipra ClickHouse
Production Headroom

08 · Lessons Learned: The Hard Truths

  • Taxonomy churn costs more than throughput ever will. Our hardest weeks traced to a client release renaming an event field, not to load. Schema registry enforcement at the gate is the cheapest insurance in this architecture.
  • Client clocks lie constantly. Mobile devices arrive minutes skewed; funnels built on client timestamps run backwards. Carry both timestamps, order by server time, use client time only for within-session sequencing.
  • Heartbeats are load-bearing. Session duration computed from start/end events alone overcounts abandoned tabs enormously. A 30-second heartbeat made every engagement metric honest — and added 60% of our event volume. Worth it; plan for it.
  • One view per query shape, enforced. The "flexible general-purpose rollup" we built first served no dashboard well and cost more than the five specific views that replaced it.
  • The backfill is part of the view. Two separate incidents of "dashboard starts in March" taught us: the migration that creates a materialized view includes its historical backfill, no exceptions.
  • Privacy earns the product its license to exist. Institutions buy live analytics; their counsel approves pseudonymisation, TTLs, and the subject-access query. The privacy architecture closed deals the dashboards started.

09 · Key Takeaways for Practitioners

🗂️
Taxonomy first

Versioned, registry-enforced, interaction events included. Semantics are the real risk, not volume.

🏛️
ClickHouse for the shape

MergeTree ordered by (course, cohort, learner, time); LowCardinality dims; TTL retention in-engine.

Rollups at insert time

One materialized view per dashboard shape, with its backfill, in the repo. Sub-100ms or it isn't live.

❤️
Signals inform, humans decide

Baseline-deviation detection routed to people — disengagement, integrity, content defects. Never auto-consequence.

🔒
Privacy in the engine

Pseudonymous spine, identity vault, cohort defaults, TTLs, and a tested subject-access query.

📡
Heartbeats make metrics honest

Duration without heartbeats is fiction. Budget the volume; it's most of your truth.

The two production systems this architecture composes: the real-time Kafka LXP platform (nightly batch → sub-3-minute, millions of learners) and the ClickHouse telemetry platform (1B+ events/hour). Sector context on the EdTech industry page.

FAQ · Frequently Asked Questions

Why ClickHouse instead of a cloud warehouse for learning analytics?
Query shape and economics: high-cardinality time-series aggregation with sub-second interactive response at high concurrency is ClickHouse's core strength, at a fraction of per-query warehouse cost. Vipra runs ClickHouse in production at 1B+ events/hour. Warehouses remain right for complex ad-hoc joins — many estates run both.
Is sub-3-minute end-to-end latency proven or aspirational?
Proven — it's Vipra's documented production result: a Kafka streaming platform that replaced nightly batch for a learning platform serving millions of learners, with end-to-end latency under 3 minutes.
How do materialized views keep dashboards fast?
They aggregate at insert time: every arriving event incrementally updates per-minute rollups and funnel counters, so dashboards read tiny pre-aggregated tables in milliseconds instead of scanning raw events. One view per dashboard query shape, version-controlled, is the working discipline.
How is learner privacy handled in real-time telemetry?
Pseudonymous event IDs with a separately-governed identity vault, cohort-level defaults with role-gated individual drill-down, TTL-enforced retention, and anomaly signals routed to human judgment rather than automated consequences. FERPA/GDPR constraints are architecture inputs, not compliance paperwork.