The four built-ins (not_null, unique, accepted_values, relationships) plus dbt-utils and dbt-expectations cover schema shape and declared rules. They run when you run them, on assertions you thought to write, with thresholds you hard-coded — and all three of those clauses are the failure surface.
Closing the gap needs a second layer that learns what normal looks like: volume baselines, distribution monitoring, freshness against history, cross-table consistency. Build it (custom tests over run metadata) or buy it (Elementary, Soda, Monte Carlo) — but run it as a separate layer with a separate question: not 'does the data match my assertions?' but 'does the data match its own history?'
This is the testing philosophy behind Vipra's production quality work — the same layered discipline that powers our phantom-inventory reconciliation playbook and the Fortune 500 governance engagement that cut reconciliation effort 40%. Every incident class below is one we have met in production.
01 · What dbt Tests Actually Assert
Be precise about the tool before blaming it. dbt tests are assertions about structure, evaluated at run time: this column is never null, this key is unique, this value sits in this list, this foreign key resolves. With model contracts (enforced: true) they also pin column names and types at build time. Within that scope they are excellent — cheap, versioned with the code, reviewable in the PR that changes the logic.
The scope is the problem. Three structural blind spots: tests run when you run them (a source that dies after your 06:00 build fails nothing until tomorrow); they test what you thought to assert (nobody writes the assertion for the failure they haven't had yet); and thresholds are hard-coded (accepted_values can't notice that a permitted value's frequency just went insane). A green suite says: the data violates none of the rules we wrote down. It says nothing about whether the data is right.
02 · The Architecture: Two Layers, Different Questions
The two layers are not redundant; they are orthogonal. Layer 1 catches the developer who renamed a column; Layer 2 catches the upstream team who changed a filter, the currency bug, the dead source. Estates that run only Layer 1 have green dashboards and quarterly surprises.
03 · Five Incidents Your Green Suite Won't Catch
| Incident | What happens | Why tests pass | What catches it |
|---|---|---|---|
| Volume collapse | Upstream filter change drops daily rows 2M → 400K | Every surviving row is valid | Row count vs 28-day seasonal baseline |
| Distribution drift | Currency bug multiplies order totals ×100 | not_null ✓ unique ✓ range (0, ∞) ✓ | Mean/percentile checks on money columns |
| Staleness with green runs | Source stopped; pipeline re-transforms yesterday faithfully | The stale data is perfectly shaped | Freshness vs historical landing times |
| Cross-table inconsistency | Orders say $4.2M; finance mart says $3.9M | Both pass their local tests | Reconciliation models that must return zero rows |
| Null-rate creep | Critical column drifts 2% → 30% null | not_null fails only at 100% | Null-rate baselines per column |
Each row is a real incident pattern from production estates we've audited. The common shape: well-formed data that is wrong — the category structural testing is constitutionally blind to. The cross-table row is the one that costs the most and gets written last; we built an entire playbook around it for inventory (Inventory Ghosts), where the disagreements are dollar-denominated.
04 · The Detection Data Flow
Two design notes: metadata capture is an on-run-end hook writing to an append-only table — ten lines of macro, and it is the asset everything else learns from; and every anomaly carries its dbt lineage link, because root-cause time is the metric that decides whether teams fix causes or learn to ignore alerts.
05 · Layer 1 Done Right: Structure in CI
the structural floor — contracts + severity discipline (YAML)models: - name: fct_orders config: contract: {enforced: true} # types + columns pinned at build columns: - name: order_id data_type: string constraints: [{type: not_null}, {type: primary_key}] data_tests: [unique] - name: order_total data_type: numeric(18,2) data_tests: - dbt_utils.accepted_range: min_value: 0 config: {severity: error} # structural floor: hard fail - name: currency data_tests: - accepted_values: values: ['USD','EUR','GBP','INR'] config: {severity: warn} # new currency = question, not outage sources: - name: payments freshness: # the most under-configured feature in dbt warn_after: {count: 2, period: hour} error_after: {count: 12, period: hour} loaded_at_field: _loaded_at
The disciplines that make Layer 1 honest: severity is a decision, not a default — a test that pages at 2am must represent something worth waking for, and warn-tier tests need a weekly review or they become wallpaper; source freshness configured per source with someone actually watching it (the most under-used feature in dbt, and the cheapest staleness defense that exists); and store_failures on for anything money-adjacent, because the failed rows are the investigation, and re-finding them after the fact is archaeology.
06 · Layer 2: Anomaly Detection Against Learned Baselines
The build-vs-buy menu, honestly priced:
| Option | What you get | Right when |
|---|---|---|
| Hand-rolled (metadata + SQL) | Volume/null/freshness bands from your own run history; full control | Strong SQL team, narrow critical surface, zero budget |
| Elementary (dbt-native OSS) | Anomaly tests as dbt packages, lineage-aware reports, Slack routing | dbt-centric estates wanting the fastest credible start |
| Soda | Declarative checks + anomaly detection, contract-friendly syntax | Mixed-stack estates; checks owned beyond the dbt team |
| Monte Carlo | Full observability: auto-baselines, lineage incident routing, field-level monitors | Large estates where coverage breadth beats per-check control |
hand-rolled volume baseline — the 80% solution in 25 lines (SQL)WITH history AS ( SELECT model_name, row_count, landed_at, EXTRACT(DOW FROM landed_at) AS dow FROM metadata.run_results WHERE landed_at >= CURRENT_DATE - 28 ), bands AS ( SELECT model_name, dow, AVG(row_count) AS mu, STDDEV(row_count) AS sigma FROM history GROUP BY 1, 2 ) SELECT t.model_name, t.row_count, b.mu, b.sigma, (t.row_count - b.mu) / NULLIF(b.sigma,0) AS z FROM metadata.todays_runs t JOIN bands b USING (model_name, dow) WHERE ABS((t.row_count - b.mu) / NULLIF(b.sigma,0)) > 3 -- page-worthy
Day-of-week awareness matters more than statistical sophistication — Monday volumes are not Saturday volumes, and a baseline that ignores seasonality cries wolf weekly, which kills the program faster than missed incidents do. Humble statistics, correctly seasonal, beat clever models that nobody trusts.
07 · Production Evidence: Where This Discipline Paid
This layered architecture is not hypothetical — it is the quality spine across Vipra's documented engagements. The Fortune 500 governance program applied exactly this make-disagreement-visible mechanism and cut reconciliation effort 40%. The 560-model banking platform runs Layer 1 in CI with synthetic data plus a separate pipeline-health monitoring layer — built after learning precisely this article's lesson: dbt tests tell you whether the data that arrived passes your rules, not whether the data you expected to arrive, arrived. And the cross-table reconciliation pattern became its own playbook in retail inventory, where conservation tests that "should return zero rows" are valued in dollars per break.
Vipra Documented
Discipline — Banking
Structural Tests Miss
DOW-Aware Minimum
08 · Lessons Learned: The Hard Truths
- Green dashboards breed the worst incidents. The estates with the most confident test suites had the longest detection lags on the five incident classes — confidence without coverage is anesthesia.
- Seasonality first, statistics second. Every naive baseline we've replaced was killed by Monday-vs-Saturday false alarms within a month. DOW-aware bands are the difference between a program and a muted channel.
- The metadata store pays for everything. Ten lines of on-run-end macro created the asset every later capability — baselines, ledgers, SLA reports — was built on. Start capturing before you need it.
- Cross-table tests are written after the incident, always. Nobody budgets for reconciliation models until finance and orders disagree publicly. Write the three most money-critical ones this sprint; they are the highest-ROI tests in the stack.
- Warn-severity is where assertions go to die. Unreviewed warnings train everyone to scroll past yellow. Weekly triage or delete them — wallpaper is worse than nothing.
- Detection lag is the program's real KPI. Not test count, not coverage percent: hours between data going wrong and a human knowing. The incident ledger that tracks it is what converts quality work from faith to evidence.
09 · Key Takeaways for Practitioners
dbt asserts structure in CI; anomaly detection compares data to its own history post-load. Orthogonal, both required.
Capture run metadata from day one — row counts, null rates, landing times. The store is the asset.
DOW-aware bands beat clever models. False alarms kill programs faster than missed incidents.
Identity models that must return zero rows, valued in dollars. Write the money-critical three first.
Page-worthy means wake-worthy; warn needs weekly triage; freshness configured per source, watched.
The incident ledger — what was caught, how fast, by which layer — is the program's report card.
Companions: Inventory Ghosts applies this to dollar-denominated reconciliation; Data Contracts That Stick covers the producer-side culture; the governance engineering project documents the 40% result.