The Attention Economy: Real-Time Learner Engagement Telemetry with ClickHouse + Kafka

Q: Why ClickHouse instead of a cloud warehouse for learning analytics?

Query shape and economics: high-cardinality time-series aggregation with sub-second interactive response at high concurrency is ClickHouse's core strength, at a fraction of per-query warehouse cost. Vipra runs ClickHouse in production at 1B+ events/hour. Warehouses remain right for complex ad-hoc joins — many estates run both.

Q: Is sub-3-minute end-to-end latency proven or aspirational?

Proven — it's Vipra's documented production result: a Kafka streaming platform that replaced nightly batch for a learning platform serving millions of learners, with end-to-end latency under 3 minutes.

Q: How do materialized views keep dashboards fast?

They aggregate at insert time: every arriving event incrementally updates per-minute rollups and funnel counters, so dashboards read tiny pre-aggregated tables in milliseconds instead of scanning raw events. One view per dashboard query shape, version-controlled, is the working discipline.

Q: How is learner privacy handled in real-time telemetry?

Pseudonymous event IDs with a separately-governed identity vault, cohort-level defaults with role-gated individual drill-down, TTL-enforced retention, and anomaly signals routed to human judgment rather than automated consequences. FERPA/GDPR constraints are architecture inputs, not compliance paperwork.

Executive Summary

A learner who struggles on Tuesday night and gets help Thursday morning got help one session too late — disengagement compounds per session, and nightly-batch analytics can only autopsy it. Moving telemetry to streaming changes the product category: live cohort dashboards, struggle detection while the session is open, and recommendations that reflect the last ten minutes.

The architecture: a versioned xAPI-informed event taxonomy on Kafka, ClickHouse as the analytical engine (high-cardinality time aggregation is its home turf), insert-time materialized views serving sub-100ms funnels, humble-statistics anomaly detection routed to humans, and FERPA/GDPR privacy designed in rather than bolted on.

The sub-3-minute end-to-end target is not aspirational — it is Vipra's documented production result (Kafka LXP platform, millions of learners), and our ClickHouse telemetry platform sustains 1B+ events/hour. This architecture is those two production systems, composed.

01 · Why Learning Analytics Lags a Day Behind Learning

The standard LMS analytics stack is a nightly export into a warehouse and a morning dashboard refresh. Structurally, that architecture can answer what happened and never what is happening — and in learning, the difference is the product. Disengagement compounds per session: the learner who hits a confusing segment tonight and gets no response is measurably less likely to start the next session at all. Intervention has a half-life measured in hours.

Streaming telemetry creates capabilities batch cannot approximate: instructors watching live cohort dashboards during a synchronous session; struggle signals flagged while re-engagement still works; content teams seeing a confusing video segment spike replays the same day they shipped it; recommendations that know what the learner did ten minutes ago. Our LXP engagement made exactly this transition — nightly batch to sub-3-minute latency for a platform serving millions of learners — and the product changed around it.

02 · The Architecture, End to End

emit

→

Clients → Kafka. Web/mobile/SCORM players emit versioned events, keyed by learner ID; schema registry enforces the taxonomy at the gate.

land

→

Kafka → ClickHouse. Kafka table engine + materialized-view pipeline lands events into MergeTree, partitioned by date, ordered by (course, learner, time).

aggregate

→

Insert-time rollups. Per-minute engagement, funnel counters, cohort summaries — updated incrementally as rows arrive, owned in the repo.

serve

→

Dashboards + APIs. Superset/embedded dashboards read rollups in milliseconds; recommendation features exported to the serving layer.

watch

→

Anomaly + privacy layer. Baseline-deviation detection → human review queues; TTLs, pseudonymisation, and consent enforced in-engine.

Volume math: 50M events/day averages ~580/sec and peaks around 10–15K/sec at synchronous-class boundaries — comfortable single-digit-node ClickHouse territory. Our telemetry production system runs this engine at 1B+ events/hour, an order of magnitude above this design's needs. That headroom is the point: capacity planning becomes boring, which is what capacity planning should be.

03 · The Event Spine: Taxonomy Before Throughput

The engineering risk in learning telemetry is semantic, not volumetric. Define the taxonomy first, version it, and enforce it at the gate:

event taxonomy — xAPI-informed, pragmatic (registry schema, excerpt)
{
  "event_version": "2.3",
  "event_type": "enum[content.view, content.start, content.complete,
                      assess.attempt, assess.submit, assess.score,
                      interact.pause, interact.replay, interact.speed_change,
                      session.start, session.heartbeat, session.end]",
  "learner_id":   "pseudonymous-uuid",
  "course_id":    "string", "content_id": "string",
  "position_sec": "nullable-int",
  "cohort_id":    "string", "institution_id": "string",
  "client_ts":    "timestamp-ms", "server_ts": "timestamp-ms"
}

Three taxonomy decisions that pay forever: interaction events are the honest signals — pause, replay, and speed-change cluster around confusion in ways completion events never reveal; key by learner ID for ordered per-learner streams (session reconstruction depends on it); and carry both client and server timestamps — clock-skewed mobile clients will otherwise write your funnels backwards. Schema registry enforcement means a client release cannot silently rename a field a funnel depends on; we learned that one in production so you don't have to.

04 · ClickHouse: Built for Exactly This Query Shape

Engagement analytics is high-cardinality aggregation over time — millions of learners × thousands of content items, sliced live by cohort, course, and institution. That shape is ClickHouse's home turf:

ClickHouse — the base table (MergeTree, tuned for the access pattern)
CREATE TABLE events.learning_events (
    event_date    Date     DEFAULT toDate(server_ts),
    server_ts     DateTime64(3),
    event_type    LowCardinality(String),
    learner_id    UUID,
    course_id     LowCardinality(String),
    content_id    String,
    cohort_id     LowCardinality(String),
    institution_id LowCardinality(String),
    position_sec  Nullable(UInt32),
    props         Map(String, String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (course_id, cohort_id, learner_id, server_ts)
TTL event_date + INTERVAL 26 MONTH DELETE
SETTINGS index_granularity = 8192;

The ORDER BY is the design: course-and-cohort-scoped queries — which is everything an instructor dashboard asks — prune to a sliver of the table. LowCardinality on the dimension columns and Map for long-tail properties keep storage at a fraction of raw JSON; compression ratios of 10–20× on learning events are routine. And note the TTL clause — retention is an engine property here, which Section 07 turns into a compliance feature.

05 · Materialized Views: Dashboards That Are Never Stale

ClickHouse materialized views aggregate at insert time: as Kafka rows land, the rollups update incrementally. Dashboards read tiny pre-aggregated tables — milliseconds, regardless of concurrency:

funnel rollup — one view per dashboard query shape
CREATE MATERIALIZED VIEW agg.funnel_by_cohort
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (course_id, cohort_id, event_date)
AS SELECT
    event_date, course_id, cohort_id,
    uniqState(learner_id)                                        AS learners,
    uniqStateIf(learner_id, event_type = 'content.start')        AS started,
    uniqStateIf(learner_id, event_type = 'content.complete')     AS completed,
    uniqStateIf(learner_id, event_type = 'assess.submit')        AS assessed
FROM events.learning_events
GROUP BY event_date, course_id, cohort_id;

-- dashboard query: enrolled → started → completed → assessed, by cohort
SELECT cohort_id,
       uniqMerge(learners)  AS enrolled,
       uniqMerge(started)   AS started,
       uniqMerge(completed) AS completed
FROM agg.funnel_by_cohort
WHERE course_id = 'cs101' AND event_date >= today() - 27
GROUP BY cohort_id;   -- answers in <100ms over billions of base rows

The working discipline: one materialized view per dashboard query shape, named for it, owned in the repo — views accreted ad hoc become an unmaintainable thicket within two quarters. Sub-100ms funnels are the difference between a dashboard people watch during class and one they refresh and abandon.

⚠️Materialized views fire on insert — they never backfill themselves. Every view ships with its backfill INSERT…SELECT in the same migration, or your historical dashboards quietly start at the view's birthday.

06 · Behavioral Anomaly Detection: The Duty-of-Care Layer

Streaming telemetry enables detection that batch cannot, and in education the detections carry weight:

Signal	Detection	Routed to
Disengagement risk	Session cadence collapse vs learner's own baseline	Instructor/advisor queue — while re-engagement still works
Integrity signal	Assessment patterns consistent with answer-sharing (timing clusters, sequence similarity)	Human review, never automated consequence
Content defect	Replay-rate spike at one timestamp across many learners	Content authors, same day

Keep the statistics humble — per-cohort baselines and deviation bands, recalibrated weekly — and the governance serious: these are signals about people, often young people. The platform's rule, written down: signals inform humans; humans decide consequences. An anomaly system that auto-flags a struggling teenager to a disciplinary process is a product failure regardless of its precision.

07 · Privacy Is a Feature of the Architecture

Learner telemetry is personal data, frequently minors' data: FERPA/GDPR-class constraints are design inputs, not compliance paperwork. The enforcement points, in the engine:

Pseudonymous IDs on the spine — the event stream never carries real identity; a separately-governed identity vault maps UUIDs to people, with its own access regime and audit log.
Cohort-level defaults — dashboards aggregate by default; individual drill-down is role-gated and logged. Most users never need (and never get) the individual view.
Retention as a table property — the TTL clause in Section 04 is the retention policy; expiry doesn't depend on someone remembering to run a script.
Protected attributes excluded by contract — recommendation features derive from behaviour; demographic fields are contractually absent from the feature pipeline, validated in CI.
The subject-access query is a deliverable — "what do you hold about this learner?" is a tested, documented query, not a three-week scramble. Build it before the first request, because the first request comes with a deadline.

<3min

End-to-End — Vipra
Documented Production

<100ms

Funnel Queries Over
Billions of Base Rows

50M+/day

Design Volume —
An Order Below Proven

1B+/hr

Vipra ClickHouse
Production Headroom

08 · Lessons Learned: The Hard Truths

Taxonomy churn costs more than throughput ever will. Our hardest weeks traced to a client release renaming an event field, not to load. Schema registry enforcement at the gate is the cheapest insurance in this architecture.
Client clocks lie constantly. Mobile devices arrive minutes skewed; funnels built on client timestamps run backwards. Carry both timestamps, order by server time, use client time only for within-session sequencing.
Heartbeats are load-bearing. Session duration computed from start/end events alone overcounts abandoned tabs enormously. A 30-second heartbeat made every engagement metric honest — and added 60% of our event volume. Worth it; plan for it.
One view per query shape, enforced. The "flexible general-purpose rollup" we built first served no dashboard well and cost more than the five specific views that replaced it.
The backfill is part of the view. Two separate incidents of "dashboard starts in March" taught us: the migration that creates a materialized view includes its historical backfill, no exceptions.
Privacy earns the product its license to exist. Institutions buy live analytics; their counsel approves pseudonymisation, TTLs, and the subject-access query. The privacy architecture closed deals the dashboards started.

09 · Key Takeaways for Practitioners

🗂️

Taxonomy first

Versioned, registry-enforced, interaction events included. Semantics are the real risk, not volume.

🏛️

ClickHouse for the shape

MergeTree ordered by (course, cohort, learner, time); LowCardinality dims; TTL retention in-engine.

⚡

Rollups at insert time

One materialized view per dashboard shape, with its backfill, in the repo. Sub-100ms or it isn't live.

❤️

Signals inform, humans decide

Baseline-deviation detection routed to people — disengagement, integrity, content defects. Never auto-consequence.

🔒

Privacy in the engine

Pseudonymous spine, identity vault, cohort defaults, TTLs, and a tested subject-access query.

📡

Heartbeats make metrics honest

Duration without heartbeats is fiction. Budget the volume; it's most of your truth.

The two production systems this architecture composes: the real-time Kafka LXP platform (nightly batch → sub-3-minute, millions of learners) and the ClickHouse telemetry platform (1B+ events/hour). Sector context on the EdTech industry page.

FAQ · Frequently Asked Questions

Why ClickHouse instead of a cloud warehouse for learning analytics?

Query shape and economics: high-cardinality time-series aggregation with sub-second interactive response at high concurrency is ClickHouse's core strength, at a fraction of per-query warehouse cost. Vipra runs ClickHouse in production at 1B+ events/hour. Warehouses remain right for complex ad-hoc joins — many estates run both.

Is sub-3-minute end-to-end latency proven or aspirational?

Proven — it's Vipra's documented production result: a Kafka streaming platform that replaced nightly batch for a learning platform serving millions of learners, with end-to-end latency under 3 minutes.

How do materialized views keep dashboards fast?

They aggregate at insert time: every arriving event incrementally updates per-minute rollups and funnel counters, so dashboards read tiny pre-aggregated tables in milliseconds instead of scanning raw events. One view per dashboard query shape, version-controlled, is the working discipline.

How is learner privacy handled in real-time telemetry?

Pseudonymous event IDs with a separately-governed identity vault, cohort-level defaults with role-gated individual drill-down, TTL-enforced retention, and anomaly signals routed to human judgment rather than automated consequences. FERPA/GDPR constraints are architecture inputs, not compliance paperwork.

The Attention Economy:Real-Time Learner Telemetry with ClickHouse + Kafka

01 · Why Learning Analytics Lags a Day Behind Learning

02 · The Architecture, End to End

03 · The Event Spine: Taxonomy Before Throughput

04 · ClickHouse: Built for Exactly This Query Shape

05 · Materialized Views: Dashboards That Are Never Stale

06 · Behavioral Anomaly Detection: The Duty-of-Care Layer

07 · Privacy Is a Feature of the Architecture

08 · Lessons Learned: The Hard Truths

09 · Key Takeaways for Practitioners

FAQ · Frequently Asked Questions

The Attention Economy:
Real-Time Learner Telemetry with ClickHouse + Kafka