Run this before your renewal negotiation, not after. The checklist takes one engineer roughly two days against ACCOUNT_USAGE, produces a findings document with dollar figures attached, and typically locates 20–40% recoverable spend on estates that have never had a FinOps pass.
Four sections, ordered by typical payoff: warehouse sizing and suspension (usually the biggest line), clustering antipatterns (the most invisible), serverless meters (the most forgotten), and per-query waste (the most political). Each item names the exact ACCOUNT_USAGE view and the signal to look for.
This checklist is the self-service version of Vipra's FinOps engagement — the discipline that delivered our documented 62% TCO reduction ($125K/year). The full remediation playbook, with the step-by-step fixes and the attribution machinery that keeps costs down, is the companion piece: The $2M Query.
01 · How to Run This Audit
Three rules make the difference between a findings document that changes the bill and one that decorates a wiki. Attach dollars to every finding — credits × your rate, monthly run-rate; "WH_ANALYTICS idles 61% of its credits ≈ $4,100/month" gets a meeting that "auto-suspend seems high" never will. Work from ACCOUNT_USAGE, not dashboards — the views lag up to 45 minutes but hold 365 days of history, and the audit needs the history. Two days, one engineer, read-only — the audit changes nothing; it produces the prioritised fix list. Separation of audit from remediation is what keeps it honest.
02 · The Audit Architecture: Where the Money Hides
03 · Section 1 — Warehouse Sizing and Suspension
Usually the biggest line, always the fastest fix. The four checks:
idle ratio — the single highest-value query in this articleWITH metered AS ( SELECT warehouse_name, SUM(credits_used) AS credits FROM snowflake.account_usage.warehouse_metering_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ), active AS ( SELECT warehouse_name, SUM(total_elapsed_time) / 1000 / 3600 AS active_hours FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ) SELECT m.warehouse_name, m.credits, ROUND(a.active_hours, 1) AS active_hours, ROUND(m.credits - a.active_hours, 1) AS idle_credits, -- ≈ pure waste ROUND((m.credits - a.active_hours) / m.credits * 100) AS idle_pct FROM metered m LEFT JOIN active a USING (warehouse_name) ORDER BY idle_credits DESC;
| Check | Signal | Fix |
|---|---|---|
| Auto-suspend over 60s | Idle ratio above ~15% on bursty warehouses | 60s for BI; immediate-after-job for ETL; measure the P95 cache trade |
| One-size-fits-all sizing | Sub-2-second queries profiled on L/XL warehouses | Split by workload shape: XS/S for BI, sized-up only for transform windows |
| Multi-cluster maxed | MAX_CLUSTER_COUNT sized for Black Friday, running all year | Economy scaling policy + realistic maximums per season |
| No resource monitors | Non-prod warehouses without quotas | Monitors with notify-75% / suspend-100% on every non-prod warehouse |
The sizing diagnostic that settles arguments: spill statistics from QUERY_HISTORY. Zero remote spill at P95 → the warehouse is oversized, step down and re-measure; remote spill at P50 → undersized, step up (spill is slower and dearer than the bigger tier). Multi-cluster handles concurrency; size handles complexity — most estates have them inverted, and the spill numbers end the debate.
04 · Section 2 — Clustering Key Antipatterns
Clustering spends background credits to maintain order. The audit asks one question per clustered table: do the reclustering credits buy more than the scans they save?
clustering ROI — credits spent vs pruning earnedSELECT table_name, SUM(credits_used) AS recluster_credits_30d FROM snowflake.account_usage.automatic_clustering_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ORDER BY 2 DESC; -- join against QUERY_HISTORY partition-scan ratios on the same tables: -- high recluster credits + scan_pct still ~100 → paying to organise -- a bookshelf nobody reads in order. Drop the key.
The three antipatterns, in descending frequency: clustering small tables (under ~1TB, natural micro-partition pruning usually suffices — reclustering a churning small table burns credits to optimise nothing); clustering on high-cardinality columns (UUIDs scatter every insert across partitions, so reclustering runs forever — cluster on what queries actually filter by, usually date plus one low-cardinality dimension); and clustering full-scan workloads (if the nightly job aggregates everything anyway, pruning buys nothing — verify scan ratios before paying for order).
05 · Section 3 — The Quiet Serverless Meters
Every serverless feature is valuable when deliberate and expensive when forgotten — and all of them bill outside warehouse budgets, which is why warehouse-focused audits miss them:
| Meter | Forgotten-mode failure | Audit query |
|---|---|---|
| Search Optimization | Enabled for a 2024 investigation tool, still metering on every write | SEARCH_OPTIMIZATION_HISTORY vs point-lookup query volume |
| Materialized views | High-churn base table → maintenance exceeds what the MV saves | MATERIALIZED_VIEW_REFRESH_HISTORY vs MV query hits |
| Query Acceleration | Masking a sizing problem on one warehouse, permanently | QUERY_ACCELERATION_ELIGIBLE vs credits billed |
| Tasks & Snowpipe | Error-retry loops metering around the clock | TASK_HISTORY failures; PIPE_USAGE_HISTORY spikes |
Reconcile monthly in METERING_DAILY_HISTORY by service type — the one view where every serverless line appears side by side. The governance fix that makes this stick (from the full playbook): features are enabled per-table by PR, with the before/after credit delta recorded in the description. If nobody wrote the number down, it wasn't engineering.
06 · Section 4 — Per-Query Waste and Role Hygiene
Fixes ranked by effort-to-payoff: clustering keys on the predicate columns the top offenders filter by (one key routinely beats a quarter of query tuning); merge-cadence batching where micro-batched MERGEs churn partitions; column pruning in staging models; and the political one — ad-hoc traffic routed to small capped warehouses by role, which is governance wearing a cost costume. Query tags per team turn next month's audit from forensic reconstruction into a GROUP BY.
07 · Production Evidence: The Audit in Practice
This checklist is the opening move of Vipra's FinOps engagements — the documented result being a 62% TCO reduction worth $125K/year on our cloud modernization engagement (Redshift→BigQuery, where the same audit discipline applied to a different engine). The pattern across estates we've audited: Section 1 findings fund the remediation of everything else within the first month; Section 3 contains the most embarrassing line items (a forgotten Search Optimization on a deprecated table is the genre's classic); and the findings document changes renewal-negotiation leverage materially — walking in with a quantified 30% self-recovery plan changes what the sales team offers. The full remediation sequence, with the warehouse-splitting playbook and the attribution machinery that keeps the bill down, is in The $2M Query.
Typical First Audit
Documented Engagement
Read-Only, Full Audit
Its Dollar Figure
08 · Lessons Learned: The Hard Truths
- Findings without dollar figures don't get meetings. The identical technical finding, priced, moved from backlog to done in a week. Pricing the waste is most of the audit's value.
- The idle query embarrasses everyone the first time. No estate we've audited believed its idle ratio before seeing it. Run it first; it buys credibility for everything after.
- Clustering waste hides because nobody owns it. Warehouse spend has a team's name on it; AUTOMATIC_CLUSTERING_HISTORY belongs to nobody. Assign the storage-layout budget an owner or audit it forever.
- Serverless features outlive their reasons. Every estate carries at least one meter whose justification left the company two reorgs ago. The monthly METERING_DAILY_HISTORY reconciliation is the only durable defense.
- Renewal timing is leverage. The audit run a quarter before renewal — with the self-recovery plan quantified — changed negotiation outcomes in ways the same audit run after signing never could.
- The audit decays in two quarters without attribution. Tags, monitors, and showback (the playbook's Section 5) are what keep the recovered 30% recovered. An audit without follow-through is a very detailed photograph of a leak.
09 · Key Takeaways for Practitioners
Credits × rate × monthly run-rate, attached to each item. Dollar figures get meetings; observations get wikis.
The metered-vs-active gap is pure waste and the fastest fix. Run that query before any other.
Recluster spend vs pruning gained, per table. Small tables, UUID keys, and full-scan workloads: drop the key.
METERING_DAILY_HISTORY monthly, by service type. Every serverless feature needs a living justification.
Aggregate cost, not anecdote. Fix the tables the top 50 hit; route ad-hoc to capped warehouses.
A quantified self-recovery plan is negotiation leverage. After signing, it's just hygiene.
The full remediation playbook — warehouse splitting, pruning fixes, feature governance, and the attribution that keeps costs down — is The $2M Query. The documented engagement behind the numbers: Cloud FinOps & BigQuery Modernization.