Why Your dbt Tests Are Giving You False Confidence

Executive Summary

The four built-ins (not_null, unique, accepted_values, relationships) plus dbt-utils and dbt-expectations cover schema shape and declared rules. They run when you run them, on assertions you thought to write, with thresholds you hard-coded — and all three of those clauses are the failure surface.

Closing the gap needs a second layer that learns what normal looks like: volume baselines, distribution monitoring, freshness against history, cross-table consistency. Build it (custom tests over run metadata) or buy it (Elementary, Soda, Monte Carlo) — but run it as a separate layer with a separate question: not 'does the data match my assertions?' but 'does the data match its own history?'

This is the testing philosophy behind Vipra's production quality work — the same layered discipline that powers our phantom-inventory reconciliation playbook and the Fortune 500 governance engagement that cut reconciliation effort 40%. Every incident class below is one we have met in production.

01 · What dbt Tests Actually Assert

Be precise about the tool before blaming it. dbt tests are assertions about structure, evaluated at run time: this column is never null, this key is unique, this value sits in this list, this foreign key resolves. With model contracts (enforced: true) they also pin column names and types at build time. Within that scope they are excellent — cheap, versioned with the code, reviewable in the PR that changes the logic.

The scope is the problem. Three structural blind spots: tests run when you run them (a source that dies after your 06:00 build fails nothing until tomorrow); they test what you thought to assert (nobody writes the assertion for the failure they haven't had yet); and thresholds are hard-coded (accepted_values can't notice that a permitted value's frequency just went insane). A green suite says: the data violates none of the rules we wrote down. It says nothing about whether the data is right.

02 · The Architecture: Two Layers, Different Questions

layer 1

→

Structural assertions — dbt, in CI. Keys, contracts, relationships, accepted values. Question: did a code change break the declared shape? Blocks merges.

layer 2

→

Anomaly detection — post-load, against baselines. Volumes, distributions, freshness, cross-table identities. Question: did the data stop matching its own history?

metadata

→

Run metadata store. Row counts, null rates, distinct counts, landing times per model per run — the memory that baselines learn from.

routing

→

Tiered alerting. Layer-1 failures block the PR; Layer-2 anomalies page or ticket by money-at-risk, routed to the owning team with lineage attached.

ledger

→

Incident ledger. Every caught anomaly with detection lag and root cause — the report card that proves the layer earns its keep.

The two layers are not redundant; they are orthogonal. Layer 1 catches the developer who renamed a column; Layer 2 catches the upstream team who changed a filter, the currency bug, the dead source. Estates that run only Layer 1 have green dashboards and quarterly surprises.

03 · Five Incidents Your Green Suite Won't Catch

Incident	What happens	Why tests pass	What catches it
Volume collapse	Upstream filter change drops daily rows 2M → 400K	Every surviving row is valid	Row count vs 28-day seasonal baseline
Distribution drift	Currency bug multiplies order totals ×100	not_null ✓ unique ✓ range (0, ∞) ✓	Mean/percentile checks on money columns
Staleness with green runs	Source stopped; pipeline re-transforms yesterday faithfully	The stale data is perfectly shaped	Freshness vs historical landing times
Cross-table inconsistency	Orders say $4.2M; finance mart says $3.9M	Both pass their local tests	Reconciliation models that must return zero rows
Null-rate creep	Critical column drifts 2% → 30% null	not_null fails only at 100%	Null-rate baselines per column

Each row is a real incident pattern from production estates we've audited. The common shape: well-formed data that is wrong — the category structural testing is constitutionally blind to. The cross-table row is the one that costs the most and gets written last; we built an entire playbook around it for inventory (Inventory Ghosts), where the disagreements are dollar-denominated.

04 · The Detection Data Flow

dbt run (06:00) every model materialization │ │ ▼ ▼ ┌─ LAYER 1: CI + run-time ─────────┐ ┌─ METADATA CAPTURE (on-run-end) ─────┐ │ contracts · keys · relationships │ │ row_count, null_rate per column, │ │ accepted_values │ │ distinct_count, landed_at, run_id │ │ fail → block merge / halt DAG │ │ → metadata.run_results (append) │ └──────────────────────────────────┘ └──────────────────┬──────────────────┘ ▼ ┌─ LAYER 2: post-load checks ─────────┐ │ volume: count vs trailing-28d │ │ seasonal band (dow-aware) │ │ drift: mean/p50/p95 on money cols │ │ nulls: rate vs column baseline │ │ fresh: landed_at vs history │ │ x-table: reconciliation models = 0 │ └──────────────────┬──────────────────┘ ▼ anomaly {model, metric, expected band, observed, lineage link, owner} │ page (money) · ticket · digest ▼ INCIDENT LEDGER — detection lag is the KPI

Two design notes: metadata capture is an on-run-end hook writing to an append-only table — ten lines of macro, and it is the asset everything else learns from; and every anomaly carries its dbt lineage link, because root-cause time is the metric that decides whether teams fix causes or learn to ignore alerts.

05 · Layer 1 Done Right: Structure in CI

the structural floor — contracts + severity discipline (YAML)
models:
  - name: fct_orders
    config:
      contract: {enforced: true}          # types + columns pinned at build
    columns:
      - name: order_id
        data_type: string
        constraints: [{type: not_null}, {type: primary_key}]
        data_tests: [unique]
      - name: order_total
        data_type: numeric(18,2)
        data_tests:
          - dbt_utils.accepted_range:
              min_value: 0
              config: {severity: error}    # structural floor: hard fail
      - name: currency
        data_tests:
          - accepted_values:
              values: ['USD','EUR','GBP','INR']
              config: {severity: warn}     # new currency = question, not outage
sources:
  - name: payments
    freshness:                             # the most under-configured feature in dbt
      warn_after:  {count: 2,  period: hour}
      error_after: {count: 12, period: hour}
    loaded_at_field: _loaded_at

The disciplines that make Layer 1 honest: severity is a decision, not a default — a test that pages at 2am must represent something worth waking for, and warn-tier tests need a weekly review or they become wallpaper; source freshness configured per source with someone actually watching it (the most under-used feature in dbt, and the cheapest staleness defense that exists); and store_failures on for anything money-adjacent, because the failed rows are the investigation, and re-finding them after the fact is archaeology.

06 · Layer 2: Anomaly Detection Against Learned Baselines

The build-vs-buy menu, honestly priced:

Option	What you get	Right when
Hand-rolled (metadata + SQL)	Volume/null/freshness bands from your own run history; full control	Strong SQL team, narrow critical surface, zero budget
Elementary (dbt-native OSS)	Anomaly tests as dbt packages, lineage-aware reports, Slack routing	dbt-centric estates wanting the fastest credible start
Soda	Declarative checks + anomaly detection, contract-friendly syntax	Mixed-stack estates; checks owned beyond the dbt team
Monte Carlo	Full observability: auto-baselines, lineage incident routing, field-level monitors	Large estates where coverage breadth beats per-check control

hand-rolled volume baseline — the 80% solution in 25 lines (SQL)
WITH history AS (
  SELECT model_name, row_count, landed_at,
         EXTRACT(DOW FROM landed_at) AS dow
  FROM metadata.run_results
  WHERE landed_at >= CURRENT_DATE - 28
), bands AS (
  SELECT model_name, dow,
         AVG(row_count)    AS mu,
         STDDEV(row_count) AS sigma
  FROM history GROUP BY 1, 2
)
SELECT t.model_name, t.row_count, b.mu, b.sigma,
       (t.row_count - b.mu) / NULLIF(b.sigma,0) AS z
FROM metadata.todays_runs t
JOIN bands b USING (model_name, dow)
WHERE ABS((t.row_count - b.mu) / NULLIF(b.sigma,0)) > 3   -- page-worthy

Day-of-week awareness matters more than statistical sophistication — Monday volumes are not Saturday volumes, and a baseline that ignores seasonality cries wolf weekly, which kills the program faster than missed incidents do. Humble statistics, correctly seasonal, beat clever models that nobody trusts.

07 · Production Evidence: Where This Discipline Paid

This layered architecture is not hypothetical — it is the quality spine across Vipra's documented engagements. The Fortune 500 governance program applied exactly this make-disagreement-visible mechanism and cut reconciliation effort 40%. The 560-model banking platform runs Layer 1 in CI with synthetic data plus a separate pipeline-health monitoring layer — built after learning precisely this article's lesson: dbt tests tell you whether the data that arrived passes your rules, not whether the data you expected to arrive, arrived. And the cross-table reconciliation pattern became its own playbook in retail inventory, where conservation tests that "should return zero rows" are valued in dollars per break.

40%

Reconciliation Cut —
Vipra Documented

560+

Models Under This
Discipline — Banking

Incident Classes
Structural Tests Miss

28d

Seasonal Baseline —
DOW-Aware Minimum

08 · Lessons Learned: The Hard Truths

Green dashboards breed the worst incidents. The estates with the most confident test suites had the longest detection lags on the five incident classes — confidence without coverage is anesthesia.
Seasonality first, statistics second. Every naive baseline we've replaced was killed by Monday-vs-Saturday false alarms within a month. DOW-aware bands are the difference between a program and a muted channel.
The metadata store pays for everything. Ten lines of on-run-end macro created the asset every later capability — baselines, ledgers, SLA reports — was built on. Start capturing before you need it.
Cross-table tests are written after the incident, always. Nobody budgets for reconciliation models until finance and orders disagree publicly. Write the three most money-critical ones this sprint; they are the highest-ROI tests in the stack.
Warn-severity is where assertions go to die. Unreviewed warnings train everyone to scroll past yellow. Weekly triage or delete them — wallpaper is worse than nothing.
Detection lag is the program's real KPI. Not test count, not coverage percent: hours between data going wrong and a human knowing. The incident ledger that tracks it is what converts quality work from faith to evidence.

09 · Key Takeaways for Practitioners

🏗️

Two layers, two questions

dbt asserts structure in CI; anomaly detection compares data to its own history post-load. Orthogonal, both required.

📊

Baselines need memory

Capture run metadata from day one — row counts, null rates, landing times. The store is the asset.

📅

Seasonal or silent

DOW-aware bands beat clever models. False alarms kill programs faster than missed incidents.

⚖️

Reconcile across tables

Identity models that must return zero rows, valued in dollars. Write the money-critical three first.

🔔

Severity is a decision

Page-worthy means wake-worthy; warn needs weekly triage; freshness configured per source, watched.

📒

Track detection lag

The incident ledger — what was caught, how fast, by which layer — is the program's report card.

Companions: Inventory Ghosts applies this to dollar-denominated reconciliation; Data Contracts That Stick covers the producer-side culture; the governance engineering project documents the 40% result.

FAQ · Frequently Asked Questions

Are dbt tests enough for data quality?

No — they validate structure (nulls, uniqueness, accepted values, relationships) but cannot catch volume anomalies, distribution drift, staleness with successful runs, or cross-table inconsistency. They are the necessary CI layer; production reliability needs baseline-learning observability on top.

What data incidents do dbt tests miss?

The expensive ones: row counts collapsing while all surviving rows are valid, numeric distributions drifting (currency bugs, unit changes), sources that stop updating while pipelines keep succeeding, null rates creeping below the binary not_null threshold, and totals disagreeing across tables that each pass locally.

Do I need Monte Carlo or Soda, or can I build observability myself?

Small platforms can start with Elementary (dbt-native, open source) or hand-rolled metadata baselines. Buy a platform when you need lineage-aware alert routing, many domains, or anomaly models you don't want to maintain. The decision is operational capacity, not technical possibility.

How many dbt tests should a project have?

Fewer than most have, better owned: contracts and key tests on every exposed model, accepted values where business-critical, plus a handful of cross-table reconciliations. A thousand unowned tests feeding a muted Slack channel provide negative value — alert fatigue is how real incidents get missed.

Why Your dbt Tests Are Giving YouFalse Confidence