Why Your dbt Tests Are Giving You False Confidence

TL;DR — Direct Answer

dbt tests validate structure: nulls, uniqueness, accepted values, referential integrity. They are necessary and not sufficient — a pipeline can pass every dbt test while loading half the usual rows, with order values drifted 100x, from a source that stopped updating yesterday. Closing the gap needs a second layer that learns what normal looks like — volume baselines, distribution monitoring, freshness against history, cross-table consistency — whether you build it (custom tests + metadata) or buy it (Monte Carlo, Soda, Elementary). Green checkmarks measure what you asserted, not what changed.

What dbt tests actually assert

The four built-ins (not_null, unique, accepted_values, relationships) plus packages like dbt-utils and dbt-expectations cover schema shape and declared rules. They run when you run them, on assertions you thought to write, with thresholds you hard-coded. All three italics are the failure surface.

Five incidents your green test suite won't catch

Volume collapse: an upstream filter change drops daily rows from 2M to 400K. Every surviving row is valid. Every test passes. Revenue is down 80% in the dashboard before anyone asks why.
Distribution drift: a currency bug multiplies order totals by 100. Not null ✓ unique ✓ accepted range (0, ∞) ✓. The mean just moved two orders of magnitude.
Staleness with successful runs: the source stopped updating; your pipeline faithfully re-transforms yesterday's data. dbt's source freshness helps — if you configured it per source and someone watches it.
Cross-table inconsistency: orders say $4.2M, the finance mart says $3.9M, both pass their local tests. Consistency is a relationship between tables; almost nobody writes those tests.
Null-rate creep: a column drifts from 2% to 30% null. not_null would fail at 100%; at 30% it shrugs. Rates need baselines, not booleans.

The layered setup that closes the gap

Layer 1 — Keep dbt tests for what they're good at

Structure in CI: keys, contracts (enforced: true), relationships, accepted values. Block merges on these. This layer is about code changes breaking shape.

Layer 2 — Anomaly detection against learned baselines

Row counts vs trailing 28-day seasonal baseline; null-rate and distinct-count drift per critical column; numeric distribution checks (mean/percentiles) on money columns; freshness vs historical landing times. Options by budget: Elementary (dbt-native, OSS, good start), Soda (declarative checks + anomaly detection), Monte Carlo (full observability with lineage-aware incident routing). Or hand-rolled: store run metadata, compare in a nightly job — fine at small scale, a product at large scale.

Layer 3 — Cross-table reconciliation

Write the five business invariants that must always hold (orders total = finance mart total ± tolerance; users in events ⊆ users in dim). Schedule them like contracts, route violations to owners. Five reconciliations catch what five hundred column tests miss.

The cultural failure mode

Teams add tests until the suite is slow, then mute the channel where failures land. Coverage theater. Fewer, owned, routed assertions (see our data-contract system) beat exhaustive unowned ones — an alert nobody acts on is technical debt with a notification sound. This layering is how our governance practice cut a Fortune 500 client's manual reconciliation by 40%: not more tests — the right three layers, owned.

Frequently Asked Questions

Are dbt tests enough for data quality?

No — they validate structure (nulls, uniqueness, accepted values, relationships) but cannot catch volume anomalies, distribution drift, staleness with successful runs, or cross-table inconsistency. They are the necessary CI layer; production reliability needs baseline-learning observability on top.

What data incidents do dbt tests miss?

The expensive ones: row counts collapsing while all surviving rows are valid, numeric distributions drifting (currency bugs, unit changes), sources that stop updating while pipelines keep succeeding, null rates creeping below the binary not_null threshold, and totals disagreeing across tables that each pass locally.

Do I need Monte Carlo or Soda, or can I build observability myself?

Small platforms can start with Elementary (dbt-native, open source) or hand-rolled metadata baselines. Buy a platform when you need lineage-aware alert routing, many domains, or anomaly models you don't want to maintain. The decision is operational capacity, not technical possibility.

How many dbt tests should a project have?

Fewer than most have, better owned: contracts and key tests on every exposed model, accepted values where business-critical, plus a handful of cross-table reconciliations. A thousand unowned tests feeding a muted Slack channel provide negative value — alert fatigue is how real incidents get missed.