Vipra Software Articles dbt at Scale
Most Read Article · Last 30 Days
dbt dbt-osmosis BigQuery GCP AWS Apache Kafka Banking Data Governance

dbt at Scale:
Managing 500+ Models Without Losing Your Mind

A banking data platform case study — S3 → GCS → BigQuery + dbt in production. How a 12-person team cut pipeline runtime from 6.5 hours to 87 minutes, managed 560+ models across a domain-driven architecture, and built production-grade dbt governance for a regulated bank.

7.9K
Views · Last 30 Days
11min
Avg. Read Time
4.9
Reader Rating
Industry
Retail Banking
Team Size
12 Engineers
Data Volume
620 TB (BigQuery)
dbt Models
560+
Runtime Reduction
6.5h → 87 min
Published
April 2024
87min
Pipeline Runtime
(Down from 6.5h)
560+
dbt Models in
Production
620TB
BigQuery Data
Under Management
63%
Compute Cost
Reduction
Executive Summary

In 2022, our data engineering team at a large retail banking organisation inherited what can only be described as a functioning disaster: 340 dbt models across two repositories, zero naming conventions, a monolithic DAG taking 6.5 hours to complete, and 11 engineers constantly stepping on each other's work. Production incidents linked to bad transformations occurred roughly twice a month.

Eighteen months later: 560+ models across a clean domain-driven structure, end-to-end pipeline runtime reduced to 87 minutes, zero naming inconsistencies, and a CI/CD gate catching data quality failures before they reached risk or compliance teams.

This article is not about dbt features. It is about the specific architectural, operational, and governance decisions that allowed a 12-person team to confidently run dbt at scale across 500+ TB of banking data — including customer PII, AML transaction feeds, internet banking event streams, and regulatory datasets.

Everything here is drawn from real production experience. The numbers are real. The failures are real. The solutions can be implemented starting this week.

01 · Architecture Overview: Data Flow at Scale

Source Systems

Our data originated from four distinct source categories. Core banking systems running on a legacy Oracle platform emitted nightly full extracts as compressed CSVs delivered to AWS S3 via a managed file transfer service. Event streams — internet banking clickstreams, payment initiation events, and fraud signals — ingested via Kafka topics published by application teams at 4.2 million events per day. Regulatory and compliance feeds — KYC verification results, AML screening outputs, and sanctions list match data — pushed by third-party vendors in SFTP batches. Internal operational systems — HR, branch performance, product pricing — arriving as scheduled API exports.

Ingestion Layer: S3 → GCS

We used a hybrid ingestion approach. Core banking extracts landed in AWS S3, where a lightweight Python-based Lambda validated file integrity and triggered a cross-cloud transfer job (using Google Cloud Storage Transfer Service) to move validated files into GCS landing buckets. Kafka streams were consumed by Dataflow pipelines that landed raw JSON payloads into GCS in 15-minute micro-batch windows.

⚠️ This dual-cloud reality was not a choice — it was the product of a cloud migration still in progress when the data platform was being built. Learning to work with it rather than waiting for it to resolve was one of our first hard lessons.

Transformation Layer: GCS → BigQuery → dbt

From GCS, data was loaded into BigQuery raw datasets using scheduled BigQuery Data Transfer jobs for bulk files and streaming inserts for near-real-time event data. dbt sat entirely within BigQuery, structured across five clearly defined data zones:

raw_*
Exact copies of source data. No transformations. Schema enforced. Not touched by dbt.
staging_*
Cleaned, typed, renamed. One-to-one with source tables. Views.
intermediate_*
Business logic joins. Not exposed to consumers. Ephemeral.
mart_*
Aggregated, consumption-ready. Owned by domain teams. Incremental tables.
reporting_*
Finance/regulatory-specific views. Tightly access-controlled.

dbt models only existed from staging_* onwards. Raw datasets were entirely managed by ingestion pipelines, not dbt. This boundary was sacred — and enforcing it saved us from numerous incidents where someone tried to "just quickly fix" something in the raw layer.

Data Volumes

At peak, our BigQuery environment managed 620 TB across all datasets (raw + transformed), approximately 1.1 billion transaction rows in the core transactions fact table (partitioned by transaction date, 36 months hot retention), 14 million customer records with full history, and 230 GB of daily incremental data across all sources.

02 · The Breaking Point: When dbt Doesn't Scale by Default

By the time our team recognised a scaling problem, the symptoms were impossible to ignore.

DAG Explosion

With 340 models in a single project, the dbt DAG had become visually and operationally unmanageable. Engineers would open the lineage graph and immediately close it. Nobody had a mental model of how data flowed from source to mart anymore.

The deeper problem was phantom dependencies — models referencing upstream sources through multiple unnecessary intermediate steps, creating chains of 8–10 hops where 3 would have sufficed. One mart model had 14 upstream dependencies, 6 of which were only there because someone had copied and extended an older model without pruning the lineage.

Runtime: 6 Hours and Climbing

Our full refresh run was taking 6 hours 20 minutes. For a regulated bank, any incident discovered after 6pm could not be resolved and validated before the next business day. The root cause was not BigQuery performance — it was dbt execution order and model design. We had 80+ models running as full-table scans on tables with hundreds of millions of rows. No partitioning. No clustering. Sequential execution of models that could have run in parallel.

Developer Conflicts

With 11 engineers on a single dbt project in a monorepo, merge conflicts were a weekly event. Two engineers working on adjacent domains would modify the same schema.yml file, and reconciling those conflicts often introduced subtle errors not caught until production. We had no enforced ownership model. Any engineer could modify any model. This was philosophically nice ("shared ownership!") and operationally catastrophic.

Inconsistent Naming and Modeling Patterns

A brief audit revealed: 4 different conventions for staging model names (stg_, staging_, s_, and unprefixed), 3 different approaches to handling null values in the same fact table, 2 models named customer_metrics in different folders producing different numbers, and no consistent approach to grain documentation.

🔴 When two models with the same name produce different numbers, your analytics team stops trusting either of them. That trust, once lost, takes months to rebuild.

Debugging and Lineage Tracking

When a regulatory report produced incorrect AML risk scores, tracing the issue through our dbt lineage took 4.5 hours. The problem was a type casting error in a staging model introduced 6 weeks earlier, buried under 9 downstream dependencies. We had no automated data quality checks between layers. By the time the error surfaced in the reporting layer, it had propagated through dozens of models.

03 · Model Organisation Strategies That Actually Work

Layered Architecture with Hard Zone Boundaries

The first structural decision was to enforce hard boundaries between data zones — not as a convention, but as a technical constraint baked into dbt_project.yml:

dbt_project.yml
models: banking_platform: staging: +materialized: view +tags: ["staging"] intermediate: +materialized: ephemeral +tags: ["intermediate"] marts: +materialized: table +tags: ["mart"] payments: +materialized: incremental accounts: +materialized: incremental reporting: +materialized: view +tags: ["reporting", "regulated"] +meta: access_tier: "restricted"

Staging models were views — cheap to run, easy to debug, always reflecting current raw data. Intermediate models were ephemeral — they exist in the DAG but not as physical objects in BigQuery, reducing storage costs significantly. Marts were incremental tables. No mart model was permitted to reference another mart model. This single rule eliminated an entire class of dependency hell.

Domain-Driven Structure

We reorganised from a flat technical structure into a domain-driven folder hierarchy, with each domain having a designated owning team. Pull requests touching /compliance/ required approval from both an engineer and a compliance data steward — enforced via GitHub CODEOWNERS.

Domain-Driven Folder Structure
models/ ├── staging/ │ ├── core_banking/ # stg_accounts, stg_transactions │ ├── internet_banking/ # stg_ib_sessions, stg_ib_events │ └── compliance/ # stg_kyc_verifications, stg_aml_alerts ├── intermediate/ │ ├── payments/ # int_payment_enriched │ └── customers/ # int_customer_360 └── marts/ ├── payments/ # fct_daily_payments, dim_payment_channel ├── accounts/ # fct_daily_balances, dim_account └── compliance/ # fct_aml_risk_scores, fct_regulatory_positions

Naming Conventions — Enforced, Not Suggested

We adopted a strict naming convention and wrote a custom dbt macro that would fail the run if any model violated it:

LayerPrefixExample
Stagingstg_stg_transactions
Intermediateint_int_customer_metrics
Fact tablesfct_fct_daily_balances
Dimension tablesdim_dim_account
Reportingrpt_rpt_aml_monthly_summary

A model named outside this convention would fail in CI before reaching review. The first week this was enforced, three PRs were rejected. After that, nobody forgot.

04 · DAG Performance: From 6 Hours to 87 Minutes

Incremental Models for Everything Over 10M Rows

The single highest-impact change: converting 23 full-refresh fact models to incremental. Our transactions fact table alone accounted for 2 hours of the original 6-hour runtime.

fct_daily_balances.sql — Incremental Model Config
{{ config( materialized='incremental', unique_key='balance_snapshot_id', incremental_strategy='merge', partition_by={ "field": "snapshot_date", "data_type": "date", "granularity": "day" }, cluster_by=['account_id', 'product_type'], on_schema_change='sync_all_columns' ) }} SELECT {{ dbt_utils.generate_surrogate_key(['account_id', 'snapshot_date']) }} AS balance_snapshot_id, account_id, snapshot_date, closing_balance, available_balance, product_type, currency_code FROM {{ ref('int_account_daily_position') }} {% if is_incremental() %} WHERE snapshot_date >= (SELECT MAX(snapshot_date) FROM {{ this }}) - INTERVAL 3 DAY {% endif %}

The 3-day lookback window handled late-arriving corrections from the core banking system without requiring a full refresh.

Partitioning and Clustering Strategy

We applied a consistent BigQuery optimisation strategy: partition by date on any model with a time dimension; cluster by the most common filter columns (account_id, customer_id, product_type); avoid clustering on high-cardinality string columns. Before optimisation, a query scanning fct_daily_payments for a single month processed 340 GB. After partition pruning and clustering: 4.2 GB.

Parallelism via Thread Tuning

The default dbt threads setting of 4 was insufficient. BigQuery handles concurrency well. We tuned to 16 threads per environment — and with a properly pruned DAG, models previously executing sequentially ran in large parallel batches, reducing runtime by approximately 40 minutes alone.

Before vs. After

MetricBeforeAfter
Full pipeline runtime6h 20min87 min
Models on full refresh340 (all)47 (dimensions + lookups only)
BigQuery slot utilisation peak94%61%
Monthly BigQuery compute cost$8,400$3,100
Data quality test coverage12% of models89% of models

05 · Refactoring at Scale with dbt-osmosis

By 300 models, our schema.yml files were in entropy. Some models had no documentation. Others had outdated column descriptions copied from tables refactored away from long ago. dbt-osmosis allowed us to automate the majority of remediation that would otherwise have taken weeks.

What dbt-osmosis Did For Us

Schema propagation was the killer feature. When a staging model added a new column — say, risk_band added to stg_accounts — dbt-osmosis automatically propagated the column description down to every intermediate and mart model that referenced it through ref(). Previously this required manually editing 6–8 schema.yml files per new column.

dbt-osmosis — Schema Propagation
# Sync column descriptions from upstream to all downstream models dbt-osmosis yaml refactor --project-dir . --profiles-dir . # Organise schema files to match the new folder structure dbt-osmosis yaml organise --project-dir . --profiles-dir .

Legacy model migration was our most painful project. We had 80 models built before the naming convention existed, sitting in a flat /models/legacy/ folder. dbt-osmosis handled the schema.yml restructuring for the migration, saving approximately 3 weeks of manual work.

⚠️ One real caveat: dbt-osmosis is powerful but not magical. It will propagate whatever description exists upstream — including wrong ones. Before running a mass propagation, audit your staging model descriptions first. Propagating bad documentation at scale is worse than having no documentation at all.

06 · Enforcing Standards Across a Team of 12

CI/CD Pipeline Structure

Every dbt model change went through a four-stage CI pipeline before merging to main:

.github/workflows/dbt_ci.yml
stages: - name: Lint and naming check run: | dbt debug python scripts/validate_model_names.py # Custom prefix enforcement - name: Compile and parse run: dbt compile --select state:modified+ - name: Test on modified models run: | dbt run --select state:modified+ --target ci dbt test --select state:modified+ --target ci - name: Documentation coverage check run: python scripts/check_doc_coverage.py --min-coverage 80

The state:modified+ selector was critical — it ran tests not only on the changed model but on all downstream models. A breaking change to stg_transactions would trigger tests on every mart and report depending on it.

Automated Testing Strategy — Three Tiers

Tier 1 — Always run (every model, no exceptions): not_null and unique on all primary keys; accepted_values on all categorical columns with known enumerations.

Tier 2 — Run on all mart and reporting models: row count assertions against previous run (±15% threshold triggers alert, not failure); referential integrity between fact and dimension tables; custom freshness checks on time-partitioned tables.

Tier 3 — Run on compliance and regulatory models only: balance reconciliation checks (total in dbt mart must match source control totals ±0.01%); record completeness checks. Tier 3 failures blocked deployment completely. No human override.

07 · Handling PII and Compliance in a Banking Environment

In banking, you cannot treat PII as an afterthought. It has to be a first-class design constraint.

Column-Level Security in BigQuery

We used BigQuery's column-level security features, managed through dbt model configuration and Terraform-provisioned policy tags:

stg_customers.sql — PII Masking Macro
SELECT customer_id, -- not PII {{ mask_pii('first_name') }} AS first_name, -- masked in non-prod {{ mask_pii('last_name') }} AS last_name, {{ mask_pii('date_of_birth') }} AS date_of_birth, {{ mask_pii('national_id_number') }} AS national_id_number, {{ hash_identifier('email_address') }} AS email_hash, -- hashed, original removed account_open_date, customer_segment FROM {{ source('core_banking', 'raw_customers') }} -- macros/mask_pii.sql {% macro mask_pii(column_name) %} {% if target.name == 'prod' %} {{ column_name }} {% else %} SHA256(CAST({{ column_name }} AS STRING)) {% endif %} {% endmacro %}
🔒 Production data was never replicated to developer environments. CI ran against a synthetic dataset generated from production schemas but with completely fabricated values. This was a firm rule with no exceptions.

Role-Based Access Control

TierDatasets AccessibleTeams
Standardmart_*, reporting_public_*Analytics, Product
Elevated+ staging_*, intermediate_*Data Engineering
Restricted+ reporting_compliance_*, reporting_regulatory_*Compliance, Risk, Audit

Restricted tier access required a formal access request, manager approval, and quarterly review — implemented as BigQuery IAM bindings managed in Terraform, not ad-hoc grants.

Audit Trails

Every model in the reporting_regulatory_* dataset included an automatically appended audit metadata block via a post-hook that wrote run metadata (model name, run timestamp, row count, dbt invocation ID) to a dedicated audit.model_run_log table. This gave our internal audit team the ability to reconstruct exactly which data transformation produced any specific regulatory report on any specific date.

08 · Lessons Learned: The Hard Truths

  • We waited too long to enforce naming conventions. At 100 models, a messy naming convention is an annoyance. At 300 models, it is a productivity tax on every engineer every day. Establish naming rules in week one. Automate enforcement immediately.
  • Ephemeral models are underused. Most teams default to views or tables for intermediate logic. Ephemeral models eliminate intermediate physical tables entirely, reducing storage costs and simplifying the lineage graph. We converted 60% of our intermediate models to ephemeral, and the DAG became dramatically cleaner overnight.
  • The monorepo was wrong for us. For the first 200 models, a single dbt project felt like unity. After 300, it felt like a bottleneck. Splitting into three domain projects — transactional, compliance, and customer — with a shared macros package reduced merge conflicts by approximately 80%.
  • Synthetic data for CI is not optional in banking. Running CI tests against real customer data — even in a "dev" dataset — is a data governance risk. Building a synthetic data generator was a 3-week investment that paid back in audit confidence within the first quarter.
  • dbt alone is not an observability solution. dbt tests tell you if the data that arrived passes your rules. They do not tell you if the data you expected to arrive, arrived. We added a separate pipeline health monitoring layer (Python + BigQuery metadata tables) to close this gap.
  • Invest in dbt macros early. We wrote over 40 custom macros — PII masking, audit logging, standard test generators, naming validators. The upfront investment in macro quality paid compound returns as the project grew.

09 · Key Takeaways for Practitioners

📐
Enforce naming at 50 models

50 models feels early. It is exactly the right time. Every month you delay is compounding technical debt.

🚧
Make zone boundaries technical constraints

If a mart model can reference another mart model, eventually it will. Block it in config, not convention.

Convert large tables to incremental immediately

Full refreshes on 100M+ row tables in a daily pipeline are an architectural smell, not a badge of honour.

🗃️
Partition and cluster everything in BigQuery

The performance and cost benefits are too significant to defer. 340 GB → 4.2 GB on the same query.

🔬
Use state:modified+ in CI

Testing only changed models and their downstream dependencies makes CI fast enough to actually be used.

🔒
Synthetic data for non-prod is non-negotiable

In regulated industries, build it once, maintain it forever. Audit confidence is worth the 3-week investment.

🏷️
Assign domain ownership with CODEOWNERS

Shared ownership without defined accountability is no ownership at all. GitHub CODEOWNERS enforces this automatically.

🧬
Use dbt-osmosis proactively

Run it regularly, not as a rescue tool. Keep schema.yml files aligned before entropy sets in.

— · Closing Thoughts

Managing dbt at scale in a banking environment is fundamentally a people and process problem that technology can only partially solve. The architecture described here — layered zones, domain-driven structure, incremental models, CI/CD gates, PII macros, tiered testing — is a framework that worked for our context. Your context will differ.

What will not differ: the cost of addressing these problems at 300 models versus 100 models is roughly 5x. Every month you delay establishing structure is technical debt accumulating at compound interest.

The most valuable thing our team did was not a tool or a dbt feature. It was spending two days with all 12 engineers aligning on principles — what models should be named, who owned what, what a test failure meant — before writing a single line of SQL.

Make those decisions early. Your future self — debugging a pipeline at 11pm before a regulatory submission — will be grateful.

Principal Data Architect
Vipra Software Private Limited · Data Platform Practice
Extensive experience building large-scale data platforms in regulated financial services environments. Work spans real-time ingestion, cloud-native transformation with dbt and BigQuery, and platform governance at enterprise scale across AWS, GCP, and Azure.
← All Case Studies Discuss Your dbt Architecture →