TL;DR — Direct Answer
dbt tests validate structure: nulls, uniqueness, accepted values, referential integrity. They are necessary and not sufficient — a pipeline can pass every dbt test while loading half the usual rows, with order values drifted 100x, from a source that stopped updating yesterday. Closing the gap needs a second layer that learns what normal looks like — volume baselines, distribution monitoring, freshness against history, cross-table consistency — whether you build it (custom tests + metadata) or buy it (Monte Carlo, Soda, Elementary). Green checkmarks measure what you asserted, not what changed.
What dbt tests actually assert
The four built-ins (not_null, unique, accepted_values, relationships) plus packages like dbt-utils and dbt-expectations cover schema shape and declared rules. They run when you run them, on assertions you thought to write, with thresholds you hard-coded. All three italics are the failure surface.
Five incidents your green test suite won't catch
- Volume collapse: an upstream filter change drops daily rows from 2M to 400K. Every surviving row is valid. Every test passes. Revenue is down 80% in the dashboard before anyone asks why.
- Distribution drift: a currency bug multiplies order totals by 100. Not null ✓ unique ✓ accepted range (0, ∞) ✓. The mean just moved two orders of magnitude.
- Staleness with successful runs: the source stopped updating; your pipeline faithfully re-transforms yesterday's data. dbt's
source freshnesshelps — if you configured it per source and someone watches it. - Cross-table inconsistency: orders say $4.2M, the finance mart says $3.9M, both pass their local tests. Consistency is a relationship between tables; almost nobody writes those tests.
- Null-rate creep: a column drifts from 2% to 30% null.
not_nullwould fail at 100%; at 30% it shrugs. Rates need baselines, not booleans.
The layered setup that closes the gap
Layer 1 — Keep dbt tests for what they're good at
Structure in CI: keys, contracts (enforced: true), relationships, accepted values. Block merges on these. This layer is about code changes breaking shape.
Layer 2 — Anomaly detection against learned baselines
Row counts vs trailing 28-day seasonal baseline; null-rate and distinct-count drift per critical column; numeric distribution checks (mean/percentiles) on money columns; freshness vs historical landing times. Options by budget: Elementary (dbt-native, OSS, good start), Soda (declarative checks + anomaly detection), Monte Carlo (full observability with lineage-aware incident routing). Or hand-rolled: store run metadata, compare in a nightly job — fine at small scale, a product at large scale.
Layer 3 — Cross-table reconciliation
Write the five business invariants that must always hold (orders total = finance mart total ± tolerance; users in events ⊆ users in dim). Schedule them like contracts, route violations to owners. Five reconciliations catch what five hundred column tests miss.
The cultural failure mode
Teams add tests until the suite is slow, then mute the channel where failures land. Coverage theater. Fewer, owned, routed assertions (see our data-contract system) beat exhaustive unowned ones — an alert nobody acts on is technical debt with a notification sound. This layering is how our governance practice cut a Fortune 500 client's manual reconciliation by 40%: not more tests — the right three layers, owned.