Vipra Software Articles dbt Tests False Confidence
dbt Data Quality Observability Great Expectations Elementary Anomaly Detection

Why Your dbt Tests Are Giving You
False Confidence

dbt tests validate structure — nulls, uniqueness, accepted values, referential integrity. Necessary, and not sufficient: a pipeline can pass every test while loading half the usual rows, with order values drifted 100×, from a source that stopped updating yesterday. Green checkmarks measure what you asserted, not what changed.

Discipline
Data Quality / Observability
Failure Surface
5 incident classes
Fix
2-layer architecture
Vipra Proven
40% reconciliation cut
Stack
dbt · GE/Elementary · SQL
Published
June 2026
Executive Summary

The four built-ins (not_null, unique, accepted_values, relationships) plus dbt-utils and dbt-expectations cover schema shape and declared rules. They run when you run them, on assertions you thought to write, with thresholds you hard-coded — and all three of those clauses are the failure surface.

Closing the gap needs a second layer that learns what normal looks like: volume baselines, distribution monitoring, freshness against history, cross-table consistency. Build it (custom tests over run metadata) or buy it (Elementary, Soda, Monte Carlo) — but run it as a separate layer with a separate question: not 'does the data match my assertions?' but 'does the data match its own history?'

This is the testing philosophy behind Vipra's production quality work — the same layered discipline that powers our phantom-inventory reconciliation playbook and the Fortune 500 governance engagement that cut reconciliation effort 40%. Every incident class below is one we have met in production.

01 · What dbt Tests Actually Assert

Be precise about the tool before blaming it. dbt tests are assertions about structure, evaluated at run time: this column is never null, this key is unique, this value sits in this list, this foreign key resolves. With model contracts (enforced: true) they also pin column names and types at build time. Within that scope they are excellent — cheap, versioned with the code, reviewable in the PR that changes the logic.

The scope is the problem. Three structural blind spots: tests run when you run them (a source that dies after your 06:00 build fails nothing until tomorrow); they test what you thought to assert (nobody writes the assertion for the failure they haven't had yet); and thresholds are hard-coded (accepted_values can't notice that a permitted value's frequency just went insane). A green suite says: the data violates none of the rules we wrote down. It says nothing about whether the data is right.

02 · The Architecture: Two Layers, Different Questions

layer 1
Structural assertions — dbt, in CI. Keys, contracts, relationships, accepted values. Question: did a code change break the declared shape? Blocks merges.
layer 2
Anomaly detection — post-load, against baselines. Volumes, distributions, freshness, cross-table identities. Question: did the data stop matching its own history?
metadata
Run metadata store. Row counts, null rates, distinct counts, landing times per model per run — the memory that baselines learn from.
routing
Tiered alerting. Layer-1 failures block the PR; Layer-2 anomalies page or ticket by money-at-risk, routed to the owning team with lineage attached.
ledger
Incident ledger. Every caught anomaly with detection lag and root cause — the report card that proves the layer earns its keep.

The two layers are not redundant; they are orthogonal. Layer 1 catches the developer who renamed a column; Layer 2 catches the upstream team who changed a filter, the currency bug, the dead source. Estates that run only Layer 1 have green dashboards and quarterly surprises.

03 · Five Incidents Your Green Suite Won't Catch

IncidentWhat happensWhy tests passWhat catches it
Volume collapseUpstream filter change drops daily rows 2M → 400KEvery surviving row is validRow count vs 28-day seasonal baseline
Distribution driftCurrency bug multiplies order totals ×100not_null ✓ unique ✓ range (0, ∞) ✓Mean/percentile checks on money columns
Staleness with green runsSource stopped; pipeline re-transforms yesterday faithfullyThe stale data is perfectly shapedFreshness vs historical landing times
Cross-table inconsistencyOrders say $4.2M; finance mart says $3.9MBoth pass their local testsReconciliation models that must return zero rows
Null-rate creepCritical column drifts 2% → 30% nullnot_null fails only at 100%Null-rate baselines per column

Each row is a real incident pattern from production estates we've audited. The common shape: well-formed data that is wrong — the category structural testing is constitutionally blind to. The cross-table row is the one that costs the most and gets written last; we built an entire playbook around it for inventory (Inventory Ghosts), where the disagreements are dollar-denominated.

04 · The Detection Data Flow

dbt run (06:00) every model materialization │ │ ▼ ▼ ┌─ LAYER 1: CI + run-time ─────────┐ ┌─ METADATA CAPTURE (on-run-end) ─────┐ │ contracts · keys · relationships │ │ row_count, null_rate per column, │ │ accepted_values │ │ distinct_count, landed_at, run_id │ │ fail → block merge / halt DAG │ │ → metadata.run_results (append) │ └──────────────────────────────────┘ └──────────────────┬──────────────────┘ ▼ ┌─ LAYER 2: post-load checks ─────────┐ │ volume: count vs trailing-28d │ │ seasonal band (dow-aware) │ │ drift: mean/p50/p95 on money cols │ │ nulls: rate vs column baseline │ │ fresh: landed_at vs history │ │ x-table: reconciliation models = 0 │ └──────────────────┬──────────────────┘ ▼ anomaly {model, metric, expected band, observed, lineage link, owner} │ page (money) · ticket · digest ▼ INCIDENT LEDGER — detection lag is the KPI

Two design notes: metadata capture is an on-run-end hook writing to an append-only table — ten lines of macro, and it is the asset everything else learns from; and every anomaly carries its dbt lineage link, because root-cause time is the metric that decides whether teams fix causes or learn to ignore alerts.

05 · Layer 1 Done Right: Structure in CI

the structural floor — contracts + severity discipline (YAML)
models: - name: fct_orders config: contract: {enforced: true} # types + columns pinned at build columns: - name: order_id data_type: string constraints: [{type: not_null}, {type: primary_key}] data_tests: [unique] - name: order_total data_type: numeric(18,2) data_tests: - dbt_utils.accepted_range: min_value: 0 config: {severity: error} # structural floor: hard fail - name: currency data_tests: - accepted_values: values: ['USD','EUR','GBP','INR'] config: {severity: warn} # new currency = question, not outage sources: - name: payments freshness: # the most under-configured feature in dbt warn_after: {count: 2, period: hour} error_after: {count: 12, period: hour} loaded_at_field: _loaded_at

The disciplines that make Layer 1 honest: severity is a decision, not a default — a test that pages at 2am must represent something worth waking for, and warn-tier tests need a weekly review or they become wallpaper; source freshness configured per source with someone actually watching it (the most under-used feature in dbt, and the cheapest staleness defense that exists); and store_failures on for anything money-adjacent, because the failed rows are the investigation, and re-finding them after the fact is archaeology.

06 · Layer 2: Anomaly Detection Against Learned Baselines

The build-vs-buy menu, honestly priced:

OptionWhat you getRight when
Hand-rolled (metadata + SQL)Volume/null/freshness bands from your own run history; full controlStrong SQL team, narrow critical surface, zero budget
Elementary (dbt-native OSS)Anomaly tests as dbt packages, lineage-aware reports, Slack routingdbt-centric estates wanting the fastest credible start
SodaDeclarative checks + anomaly detection, contract-friendly syntaxMixed-stack estates; checks owned beyond the dbt team
Monte CarloFull observability: auto-baselines, lineage incident routing, field-level monitorsLarge estates where coverage breadth beats per-check control
hand-rolled volume baseline — the 80% solution in 25 lines (SQL)
WITH history AS ( SELECT model_name, row_count, landed_at, EXTRACT(DOW FROM landed_at) AS dow FROM metadata.run_results WHERE landed_at >= CURRENT_DATE - 28 ), bands AS ( SELECT model_name, dow, AVG(row_count) AS mu, STDDEV(row_count) AS sigma FROM history GROUP BY 1, 2 ) SELECT t.model_name, t.row_count, b.mu, b.sigma, (t.row_count - b.mu) / NULLIF(b.sigma,0) AS z FROM metadata.todays_runs t JOIN bands b USING (model_name, dow) WHERE ABS((t.row_count - b.mu) / NULLIF(b.sigma,0)) > 3 -- page-worthy

Day-of-week awareness matters more than statistical sophistication — Monday volumes are not Saturday volumes, and a baseline that ignores seasonality cries wolf weekly, which kills the program faster than missed incidents do. Humble statistics, correctly seasonal, beat clever models that nobody trusts.

07 · Production Evidence: Where This Discipline Paid

This layered architecture is not hypothetical — it is the quality spine across Vipra's documented engagements. The Fortune 500 governance program applied exactly this make-disagreement-visible mechanism and cut reconciliation effort 40%. The 560-model banking platform runs Layer 1 in CI with synthetic data plus a separate pipeline-health monitoring layer — built after learning precisely this article's lesson: dbt tests tell you whether the data that arrived passes your rules, not whether the data you expected to arrive, arrived. And the cross-table reconciliation pattern became its own playbook in retail inventory, where conservation tests that "should return zero rows" are valued in dollars per break.

40%
Reconciliation Cut —
Vipra Documented
560+
Models Under This
Discipline — Banking
5
Incident Classes
Structural Tests Miss
28d
Seasonal Baseline —
DOW-Aware Minimum

08 · Lessons Learned: The Hard Truths

  • Green dashboards breed the worst incidents. The estates with the most confident test suites had the longest detection lags on the five incident classes — confidence without coverage is anesthesia.
  • Seasonality first, statistics second. Every naive baseline we've replaced was killed by Monday-vs-Saturday false alarms within a month. DOW-aware bands are the difference between a program and a muted channel.
  • The metadata store pays for everything. Ten lines of on-run-end macro created the asset every later capability — baselines, ledgers, SLA reports — was built on. Start capturing before you need it.
  • Cross-table tests are written after the incident, always. Nobody budgets for reconciliation models until finance and orders disagree publicly. Write the three most money-critical ones this sprint; they are the highest-ROI tests in the stack.
  • Warn-severity is where assertions go to die. Unreviewed warnings train everyone to scroll past yellow. Weekly triage or delete them — wallpaper is worse than nothing.
  • Detection lag is the program's real KPI. Not test count, not coverage percent: hours between data going wrong and a human knowing. The incident ledger that tracks it is what converts quality work from faith to evidence.

09 · Key Takeaways for Practitioners

🏗️
Two layers, two questions

dbt asserts structure in CI; anomaly detection compares data to its own history post-load. Orthogonal, both required.

📊
Baselines need memory

Capture run metadata from day one — row counts, null rates, landing times. The store is the asset.

📅
Seasonal or silent

DOW-aware bands beat clever models. False alarms kill programs faster than missed incidents.

⚖️
Reconcile across tables

Identity models that must return zero rows, valued in dollars. Write the money-critical three first.

🔔
Severity is a decision

Page-worthy means wake-worthy; warn needs weekly triage; freshness configured per source, watched.

📒
Track detection lag

The incident ledger — what was caught, how fast, by which layer — is the program's report card.

Companions: Inventory Ghosts applies this to dollar-denominated reconciliation; Data Contracts That Stick covers the producer-side culture; the governance engineering project documents the 40% result.

FAQ · Frequently Asked Questions

Are dbt tests enough for data quality?
No — they validate structure (nulls, uniqueness, accepted values, relationships) but cannot catch volume anomalies, distribution drift, staleness with successful runs, or cross-table inconsistency. They are the necessary CI layer; production reliability needs baseline-learning observability on top.
What data incidents do dbt tests miss?
The expensive ones: row counts collapsing while all surviving rows are valid, numeric distributions drifting (currency bugs, unit changes), sources that stop updating while pipelines keep succeeding, null rates creeping below the binary not_null threshold, and totals disagreeing across tables that each pass locally.
Do I need Monte Carlo or Soda, or can I build observability myself?
Small platforms can start with Elementary (dbt-native, open source) or hand-rolled metadata baselines. Buy a platform when you need lineage-aware alert routing, many domains, or anomaly models you don't want to maintain. The decision is operational capacity, not technical possibility.
How many dbt tests should a project have?
Fewer than most have, better owned: contracts and key tests on every exposed model, accepted values where business-critical, plus a handful of cross-table reconciliations. A thousand unowned tests feeding a muted Slack channel provide negative value — alert fatigue is how real incidents get missed.