Most Snowflake bills are 60–75% reducible without performance loss, and almost none of it requires migration. The waste concentrates in four places: warehouses that never suspend, warehouses sized for the worst query of the quarter, queries that scan instead of prune, and a bill that no one owns at line-item level.
This playbook is the audit order we run in production FinOps engagements — the same cost-engineering discipline behind Vipra's documented 62% TCO reduction ($125K/year saved) on a Redshift-to-BigQuery modernization. The mechanics transfer directly to Snowflake; the reference scenario used throughout (a $2.1M/year mid-size fintech estate) is labelled as such, with the modelled end-state at $550–700K.
Every step below produces a number before and after. If a step doesn't produce a number, it isn't engineering — it's vibes.
01 · Anatomy of a $2.1M Bill
Snowflake's pricing model is honest but unforgiving: you pay for warehouse-seconds whether those seconds do useful work or not. There is no idle alarm, no default budget, and the people writing queries never see the meter. The reference estate we use in workshops — a mid-size fintech at $2.1M/year — decomposes the way most unaudited estates do:
The proportions vary by estate; the categories almost never do. Two implications drive the whole playbook: first, the biggest wins are operational, not architectural — no replatforming required; second, the order matters, because idle and sizing fixes change the baseline against which query fixes are measured.
02 · The Audit Architecture: Metering Data as a Data Product
You cannot fix what you cannot decompose. Before touching a single warehouse, build the cost mart — Snowflake exposes everything needed in ACCOUNT_USAGE:
cost_mart — daily credit attribution (dbt model)WITH metering AS ( SELECT warehouse_name, DATE_TRUNC('day', start_time) AS usage_date, SUM(credits_used) AS credits FROM snowflake.account_usage.warehouse_metering_history WHERE start_time >= DATEADD('day', -90, CURRENT_DATE) GROUP BY 1, 2 ), tags AS ( -- query_tag convention: team:domain:job SELECT warehouse_name, SPLIT_PART(query_tag, ':', 1) AS team, DATE_TRUNC('day', start_time) AS usage_date, SUM(total_elapsed_time) AS active_ms FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD('day', -90, CURRENT_DATE) GROUP BY 1, 2, 3 ) SELECT m.usage_date, m.warehouse_name, t.team, m.credits * RATIO_TO_REPORT(t.active_ms) OVER (PARTITION BY m.warehouse_name, m.usage_date) AS attributed_credits FROM metering m JOIN tags t USING (warehouse_name, usage_date)
Materialise this daily with dbt, join your credit price, and publish a Looker/Sigma dashboard before the first optimization ships. Two reasons: you need the before/after evidence per step, and — more powerfully — teams start fixing their own waste the week they can see it. We have watched the dashboard alone shave 8–12% before any policy landed.
WAREHOUSE_METERING_HISTORY against QUERY_HISTORY per hour — hours with credits but zero queries are pure burn. In the reference estate this single query explained $640K/year.03 · Step 1 — Kill Idle: Auto-Suspend Done Right
The default 10-minute auto-suspend is a subsidy you pay Snowflake. The reference settings we deploy:
warehouse DDL — suspend policy by workload classALTER WAREHOUSE BI_WH SET AUTO_SUSPEND = 60 AUTO_RESUME = TRUE; ALTER WAREHOUSE ELT_WH SET AUTO_SUSPEND = 60; -- orchestrator batches work ALTER WAREHOUSE ADHOC_WH SET AUTO_SUSPEND = 120; -- humans think between queries ALTER WAREHOUSE MONTH_END_WH SET AUTO_SUSPEND = 60 COMMENT = 'Resumed by Airflow on month-end calendar only';
The standard objection is cache loss: a suspended warehouse drops its local SSD cache, so the first queries after resume read from remote storage. Measure it instead of arguing about it — compare P95 dashboard latency across a two-week A/B of suspend settings. Our rule: if P95 degrades less than 15%, take the money. Result caching (24h, service-layer) is unaffected by suspension, which is why dashboards with repeating queries barely notice.
SELECT 1 jobs that defeat suspension to keep caches warm. They convert a measurable latency trade into an unmeasured 24/7 bill. Delete them; if a dashboard truly needs warm cache, schedule a targeted warm-up query 5 minutes before business hours.Reference impact: −$540K/year (idle burn from 40% → ~8%). This step is configuration, not code; it ships in a day and pays from the first hour.
04 · Step 2 — Rightsize: Warehouses by Workload Shape
Warehouse sizes double cost per tier; a query that fits in S runs on XL at 8× the spend for — frequently — indistinguishable wall-clock time, because small queries don't parallelise across the extra clusters. Split by workload shape, not by team:
| Workload shape | Right setup | Common mistake | Cost effect |
|---|---|---|---|
| Dashboard / BI bursts | S–M, multi-cluster 1–4, 60s suspend | L single-cluster "for safety" | −50–70% |
| Scheduled ELT / dbt | Sized to job profile, orchestrator-started | Always-on shared warehouse | −40–60% |
| Month-end heavy batch | XL on calendar schedule | XL all month | −90% of that line |
| Ad-hoc analyst SQL | XS–S + resource monitor caps | Whatever was open | −30–50% |
The diagnostic: for each warehouse, pull spill statistics from QUERY_HISTORY (bytes_spilled_to_local_storage, bytes_spilled_to_remote_storage). Zero spill at P95 means the warehouse is oversized — step down a tier and re-measure. Remote spill at P50 means undersized — step up, because spill is slower and more expensive than the bigger tier. Multi-cluster handles concurrency; size handles complexity. Most estates have them backwards.
spill diagnostic — is this warehouse the wrong size?SELECT warehouse_name, warehouse_size, APPROX_PERCENTILE(bytes_spilled_to_remote_storage, 0.5) AS p50_remote_spill, APPROX_PERCENTILE(bytes_spilled_to_local_storage, 0.95) AS p95_local_spill, COUNT(*) AS queries FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) AND warehouse_size IS NOT NULL GROUP BY 1, 2 ORDER BY p50_remote_spill DESC;
Reference impact: −$390K/year. Combined with Step 1, the estate is now at ~$1.17M — and every subsequent measurement happens against a stable, right-sized baseline.
05 · Step 3 — Make Queries Prune, Not Scan
Compute amplification usually traces to table design rather than query authorship. Rank by total credits, not single-run time — the query that runs 400 times a day at 40 seconds outweighs the hourly monster:
top offenders — by aggregate cost, not anecdoteSELECT query_parameterized_hash, ANY_VALUE(query_text) AS sample_text, COUNT(*) AS runs_30d, SUM(total_elapsed_time) / 1000 / 3600 AS total_hours, SUM(partitions_scanned) AS parts_scanned, SUM(partitions_total) AS parts_total, ROUND(SUM(partitions_scanned) / NULLIF(SUM(partitions_total),0) * 100, 1) AS scan_pct FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ORDER BY total_hours DESC LIMIT 50;
A scan_pct near 100 on a large table is the tell. The fixes, in payoff order: clustering keys on the predicate columns the business actually filters by (almost always event date + a tenant/account column); merge discipline — micro-batched MERGEs every two minutes churn micro-partitions and destroy natural clustering, so batch them to 15–30 minutes or use Snowpipe Streaming where freshness genuinely matters; and column hygiene — SELECT * in dbt staging models materialises columns nobody reads, and on wide tables that's real money in a columnar store.
AUTOMATIC_CLUSTERING_HISTORY after enabling — clustering on a high-churn table can cost more in background credits than it saves in scans. Cluster tables that are read-heavy, write-calm.Reference impact: −$280K/year, and the BI tier got faster — pruned queries are quicker queries, which buys political capital for the rest of the program.
06 · Step 4 — Query Acceleration & Search Optimization: Trade, Don't Default
Both features convert money into latency, and both are billed as serverless credits that bypass your warehouse discipline. That trade is sometimes excellent and sometimes a silent leak:
| Feature | Buy it when | Skip it when | Audit signal |
|---|---|---|---|
| Query Acceleration Service | Unpredictable scan-heavy bursts on a small warehouse (exec dashboards at quarter-end) | Workload is steady — a scheduled size bump is cheaper | QUERY_ACCELERATION_ELIGIBLE vs credits billed |
| Search Optimization Service | Needle-in-haystack point lookups on multi-TB tables (support tooling, fraud investigation) | A clustering key would serve the same predicates | SEARCH_OPTIMIZATION_HISTORY maintenance credits vs query savings |
The governance rule that makes this stick: features are enabled per-table via PR, and the PR description must contain the before/after credit delta. If nobody wrote the number down, it wasn't engineering. Quarterly, the cost mart flags any feature whose maintenance credits exceed its measured savings — in the reference estate, two of seven search-optimized tables failed that test and were reverted. Reference impact: −$95K/year.
07 · Step 5 — Attribution: Every Credit Gets an Owner
Steps 1–4 cut the bill; attribution keeps it cut. Three mechanisms, all cheap:
query tagging — enforced via dbt + session policy# dbt_project.yml — every dbt query carries its lineage query-comment: comment: "team:{{ env_var('DBT_TEAM') }}:job:{{ invocation_id }}" append: true -- Session-level for services; ALTER USER for humans ALTER USER looker_svc SET QUERY_TAG = 'team:bi:service:looker';
- Resource monitors with teeth: per-team monthly quotas, notify at 75%, suspend non-prod at 100%. Suspension sounds draconian until you watch a forgotten dev loop burn a weekend's budget — it happens in every estate, usually twice.
- Per-domain showback: the cost mart dashboard from Step 0, sliced by team tag, reviewed in each domain's ops meeting. Engineers fix what they can see; finance escalations become rare because nothing is a surprise.
- A budget line in CI: dbt CI runs annotate PRs with the projected credit delta of changed models (slim CI +
EXPLAINpartition estimates). Cost regression becomes a review comment, not a quarterly archaeology project.
This is the same governance mechanism from our Fortune 500 governance engagement (40% reconciliation reduction): make the invisible visible, assign an owner, automate the consequence. Reference impact: −$130K/year in prevented regression, compounding.
$2.1M → $567K/yr
TCO Reduction
Savings Delivered
For All Changes
08 · Lessons Learned: The Hard Truths
- The dashboard does more than the policy. Every estate we've audited started saving money the week showback shipped, before a single warehouse changed. Visibility is the intervention; policy is the backstop.
- Cache-loss fear is the biggest single source of waste. Teams keep warehouses hot all night to protect a 9am dashboard. Measure the actual resume penalty — it is almost always smaller than feared, and a scheduled warm-up beats a 14-hour idle burn every time.
- Rightsizing is not one-and-done. Workloads drift; the warehouse that was right in Q1 is wrong by Q3. The spill diagnostic runs monthly, automatically, and opens tickets — or the estate regresses to the mean within two quarters.
- Serverless features are where discipline goes to die. QAS, SOS, materialized view maintenance, and automatic clustering all bill outside warehouse budgets. If your attribution doesn't include serverless lines, your attribution is fiction.
- Never lead with query rewrites. Tuning queries on wrongly-sized, never-suspending warehouses measures nothing. Fix the substrate first; half the "slow queries" disappear on their own.
- Finance and engineering need the same dashboard. Separate views breed separate truths and quarterly blame exchanges. One mart, one dashboard, two audiences — disagreements become data questions instead of political ones.
09 · Key Takeaways for Practitioners
ACCOUNT_USAGE → dbt → dashboard, before any optimization. Evidence per step or it didn't happen.
Take the cache-loss trade if P95 degrades under 15%. Delete keep-alive jobs on sight.
Zero spill at P95 → size down. Remote spill at P50 → size up. Multi-cluster for concurrency, size for complexity.
Rank queries by total credits, fix the tables they hit, verify clustering credits don't eat the savings.
QAS/SOS enabled per-table, by PR, with the before/after delta in the description. Audit quarterly.
Tags + resource monitors + showback. Costs stay down when teams see their own line of the bill.
The order is the method: idle, size, pruning, features, attribution — each step stabilises the baseline for the next. Run it once and you get the 60–75%; institutionalise Steps 0 and 5 and you keep it. The companion Snowflake audit checklist is the self-service version; the BigQuery migration playbook documents the engagement whose numbers anchor this one.