A learner who struggles on Tuesday night and gets help Thursday morning got help one session too late — disengagement compounds per session, and nightly-batch analytics can only autopsy it. Moving telemetry to streaming changes the product category: live cohort dashboards, struggle detection while the session is open, and recommendations that reflect the last ten minutes.
The architecture: a versioned xAPI-informed event taxonomy on Kafka, ClickHouse as the analytical engine (high-cardinality time aggregation is its home turf), insert-time materialized views serving sub-100ms funnels, humble-statistics anomaly detection routed to humans, and FERPA/GDPR privacy designed in rather than bolted on.
The sub-3-minute end-to-end target is not aspirational — it is Vipra's documented production result (Kafka LXP platform, millions of learners), and our ClickHouse telemetry platform sustains 1B+ events/hour. This architecture is those two production systems, composed.
01 · Why Learning Analytics Lags a Day Behind Learning
The standard LMS analytics stack is a nightly export into a warehouse and a morning dashboard refresh. Structurally, that architecture can answer what happened and never what is happening — and in learning, the difference is the product. Disengagement compounds per session: the learner who hits a confusing segment tonight and gets no response is measurably less likely to start the next session at all. Intervention has a half-life measured in hours.
Streaming telemetry creates capabilities batch cannot approximate: instructors watching live cohort dashboards during a synchronous session; struggle signals flagged while re-engagement still works; content teams seeing a confusing video segment spike replays the same day they shipped it; recommendations that know what the learner did ten minutes ago. Our LXP engagement made exactly this transition — nightly batch to sub-3-minute latency for a platform serving millions of learners — and the product changed around it.
02 · The Architecture, End to End
Volume math: 50M events/day averages ~580/sec and peaks around 10–15K/sec at synchronous-class boundaries — comfortable single-digit-node ClickHouse territory. Our telemetry production system runs this engine at 1B+ events/hour, an order of magnitude above this design's needs. That headroom is the point: capacity planning becomes boring, which is what capacity planning should be.
03 · The Event Spine: Taxonomy Before Throughput
The engineering risk in learning telemetry is semantic, not volumetric. Define the taxonomy first, version it, and enforce it at the gate:
event taxonomy — xAPI-informed, pragmatic (registry schema, excerpt){ "event_version": "2.3", "event_type": "enum[content.view, content.start, content.complete, assess.attempt, assess.submit, assess.score, interact.pause, interact.replay, interact.speed_change, session.start, session.heartbeat, session.end]", "learner_id": "pseudonymous-uuid", "course_id": "string", "content_id": "string", "position_sec": "nullable-int", "cohort_id": "string", "institution_id": "string", "client_ts": "timestamp-ms", "server_ts": "timestamp-ms" }
Three taxonomy decisions that pay forever: interaction events are the honest signals — pause, replay, and speed-change cluster around confusion in ways completion events never reveal; key by learner ID for ordered per-learner streams (session reconstruction depends on it); and carry both client and server timestamps — clock-skewed mobile clients will otherwise write your funnels backwards. Schema registry enforcement means a client release cannot silently rename a field a funnel depends on; we learned that one in production so you don't have to.
04 · ClickHouse: Built for Exactly This Query Shape
Engagement analytics is high-cardinality aggregation over time — millions of learners × thousands of content items, sliced live by cohort, course, and institution. That shape is ClickHouse's home turf:
ClickHouse — the base table (MergeTree, tuned for the access pattern)CREATE TABLE events.learning_events ( event_date Date DEFAULT toDate(server_ts), server_ts DateTime64(3), event_type LowCardinality(String), learner_id UUID, course_id LowCardinality(String), content_id String, cohort_id LowCardinality(String), institution_id LowCardinality(String), position_sec Nullable(UInt32), props Map(String, String) ) ENGINE = MergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (course_id, cohort_id, learner_id, server_ts) TTL event_date + INTERVAL 26 MONTH DELETE SETTINGS index_granularity = 8192;
The ORDER BY is the design: course-and-cohort-scoped queries — which is everything an instructor dashboard asks — prune to a sliver of the table. LowCardinality on the dimension columns and Map for long-tail properties keep storage at a fraction of raw JSON; compression ratios of 10–20× on learning events are routine. And note the TTL clause — retention is an engine property here, which Section 07 turns into a compliance feature.
05 · Materialized Views: Dashboards That Are Never Stale
ClickHouse materialized views aggregate at insert time: as Kafka rows land, the rollups update incrementally. Dashboards read tiny pre-aggregated tables — milliseconds, regardless of concurrency:
funnel rollup — one view per dashboard query shapeCREATE MATERIALIZED VIEW agg.funnel_by_cohort ENGINE = AggregatingMergeTree PARTITION BY toYYYYMM(event_date) ORDER BY (course_id, cohort_id, event_date) AS SELECT event_date, course_id, cohort_id, uniqState(learner_id) AS learners, uniqStateIf(learner_id, event_type = 'content.start') AS started, uniqStateIf(learner_id, event_type = 'content.complete') AS completed, uniqStateIf(learner_id, event_type = 'assess.submit') AS assessed FROM events.learning_events GROUP BY event_date, course_id, cohort_id; -- dashboard query: enrolled → started → completed → assessed, by cohort SELECT cohort_id, uniqMerge(learners) AS enrolled, uniqMerge(started) AS started, uniqMerge(completed) AS completed FROM agg.funnel_by_cohort WHERE course_id = 'cs101' AND event_date >= today() - 27 GROUP BY cohort_id; -- answers in <100ms over billions of base rows
The working discipline: one materialized view per dashboard query shape, named for it, owned in the repo — views accreted ad hoc become an unmaintainable thicket within two quarters. Sub-100ms funnels are the difference between a dashboard people watch during class and one they refresh and abandon.
06 · Behavioral Anomaly Detection: The Duty-of-Care Layer
Streaming telemetry enables detection that batch cannot, and in education the detections carry weight:
| Signal | Detection | Routed to |
|---|---|---|
| Disengagement risk | Session cadence collapse vs learner's own baseline | Instructor/advisor queue — while re-engagement still works |
| Integrity signal | Assessment patterns consistent with answer-sharing (timing clusters, sequence similarity) | Human review, never automated consequence |
| Content defect | Replay-rate spike at one timestamp across many learners | Content authors, same day |
Keep the statistics humble — per-cohort baselines and deviation bands, recalibrated weekly — and the governance serious: these are signals about people, often young people. The platform's rule, written down: signals inform humans; humans decide consequences. An anomaly system that auto-flags a struggling teenager to a disciplinary process is a product failure regardless of its precision.
07 · Privacy Is a Feature of the Architecture
Learner telemetry is personal data, frequently minors' data: FERPA/GDPR-class constraints are design inputs, not compliance paperwork. The enforcement points, in the engine:
- Pseudonymous IDs on the spine — the event stream never carries real identity; a separately-governed identity vault maps UUIDs to people, with its own access regime and audit log.
- Cohort-level defaults — dashboards aggregate by default; individual drill-down is role-gated and logged. Most users never need (and never get) the individual view.
- Retention as a table property — the TTL clause in Section 04 is the retention policy; expiry doesn't depend on someone remembering to run a script.
- Protected attributes excluded by contract — recommendation features derive from behaviour; demographic fields are contractually absent from the feature pipeline, validated in CI.
- The subject-access query is a deliverable — "what do you hold about this learner?" is a tested, documented query, not a three-week scramble. Build it before the first request, because the first request comes with a deadline.
Documented Production
Billions of Base Rows
An Order Below Proven
Production Headroom
08 · Lessons Learned: The Hard Truths
- Taxonomy churn costs more than throughput ever will. Our hardest weeks traced to a client release renaming an event field, not to load. Schema registry enforcement at the gate is the cheapest insurance in this architecture.
- Client clocks lie constantly. Mobile devices arrive minutes skewed; funnels built on client timestamps run backwards. Carry both timestamps, order by server time, use client time only for within-session sequencing.
- Heartbeats are load-bearing. Session duration computed from start/end events alone overcounts abandoned tabs enormously. A 30-second heartbeat made every engagement metric honest — and added 60% of our event volume. Worth it; plan for it.
- One view per query shape, enforced. The "flexible general-purpose rollup" we built first served no dashboard well and cost more than the five specific views that replaced it.
- The backfill is part of the view. Two separate incidents of "dashboard starts in March" taught us: the migration that creates a materialized view includes its historical backfill, no exceptions.
- Privacy earns the product its license to exist. Institutions buy live analytics; their counsel approves pseudonymisation, TTLs, and the subject-access query. The privacy architecture closed deals the dashboards started.
09 · Key Takeaways for Practitioners
Versioned, registry-enforced, interaction events included. Semantics are the real risk, not volume.
MergeTree ordered by (course, cohort, learner, time); LowCardinality dims; TTL retention in-engine.
One materialized view per dashboard shape, with its backfill, in the repo. Sub-100ms or it isn't live.
Baseline-deviation detection routed to people — disengagement, integrity, content defects. Never auto-consequence.
Pseudonymous spine, identity vault, cohort defaults, TTLs, and a tested subject-access query.
Duration without heartbeats is fiction. Budget the volume; it's most of your truth.
The two production systems this architecture composes: the real-time Kafka LXP platform (nightly batch → sub-3-minute, millions of learners) and the ClickHouse telemetry platform (1B+ events/hour). Sector context on the EdTech industry page.