Vipra Software Articles Snowflake Audit Checklist
Snowflake FinOps ACCOUNT_USAGE Audit Clustering Serverless Meters

The Hidden Cost of Your Snowflake Warehouse:
A Principal Engineer's Audit Checklist

Most Snowflake bills hide 20–40% recoverable waste in four places: warehouses idling on suspend timers, clustering keys on tables that don't need them, serverless features quietly metering, and full scans your role hierarchy lets anyone run. The self-service checklist we run in FinOps engagements — every item with its ACCOUNT_USAGE query.

Discipline
FinOps / Self-Service Audit
Recoverable Waste
20–40% typical
Audit Sections
4 + governance
Vipra Proven
62% TCO · $125K/yr
Source of Truth
ACCOUNT_USAGE
Published
June 2026
Executive Summary

Run this before your renewal negotiation, not after. The checklist takes one engineer roughly two days against ACCOUNT_USAGE, produces a findings document with dollar figures attached, and typically locates 20–40% recoverable spend on estates that have never had a FinOps pass.

Four sections, ordered by typical payoff: warehouse sizing and suspension (usually the biggest line), clustering antipatterns (the most invisible), serverless meters (the most forgotten), and per-query waste (the most political). Each item names the exact ACCOUNT_USAGE view and the signal to look for.

This checklist is the self-service version of Vipra's FinOps engagement — the discipline that delivered our documented 62% TCO reduction ($125K/year). The full remediation playbook, with the step-by-step fixes and the attribution machinery that keeps costs down, is the companion piece: The $2M Query.

01 · How to Run This Audit

Three rules make the difference between a findings document that changes the bill and one that decorates a wiki. Attach dollars to every finding — credits × your rate, monthly run-rate; "WH_ANALYTICS idles 61% of its credits ≈ $4,100/month" gets a meeting that "auto-suspend seems high" never will. Work from ACCOUNT_USAGE, not dashboards — the views lag up to 45 minutes but hold 365 days of history, and the audit needs the history. Two days, one engineer, read-only — the audit changes nothing; it produces the prioritised fix list. Separation of audit from remediation is what keeps it honest.

02 · The Audit Architecture: Where the Money Hides

§1 idle+size
Warehouse layer. Suspend timers, size-to-workload mismatch, multi-cluster maximums, missing resource monitors. Typically 40–60% of findings value.
§2 clustering
Storage layout layer. Reclustering credits on tables that don't earn them — small tables, UUID keys, full-scan workloads. Invisible on every dashboard.
§3 serverless
Feature layer. Search Optimization, materialized views, QAS, tasks, Snowpipe — each metering outside warehouse budgets, each enabled once and forgotten.
§4 queries
Workload layer. Top-50 by aggregate credits, scan ratios, spill, and the role hygiene that lets ad-hoc SQL run anywhere.
output
Findings doc. Each item: evidence query, dollar figure, fix, risk. Sorted by payback. This is the renewal-negotiation artifact.

03 · Section 1 — Warehouse Sizing and Suspension

Usually the biggest line, always the fastest fix. The four checks:

idle ratio — the single highest-value query in this article
WITH metered AS ( SELECT warehouse_name, SUM(credits_used) AS credits FROM snowflake.account_usage.warehouse_metering_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ), active AS ( SELECT warehouse_name, SUM(total_elapsed_time) / 1000 / 3600 AS active_hours FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ) SELECT m.warehouse_name, m.credits, ROUND(a.active_hours, 1) AS active_hours, ROUND(m.credits - a.active_hours, 1) AS idle_credits, -- ≈ pure waste ROUND((m.credits - a.active_hours) / m.credits * 100) AS idle_pct FROM metered m LEFT JOIN active a USING (warehouse_name) ORDER BY idle_credits DESC;
CheckSignalFix
Auto-suspend over 60sIdle ratio above ~15% on bursty warehouses60s for BI; immediate-after-job for ETL; measure the P95 cache trade
One-size-fits-all sizingSub-2-second queries profiled on L/XL warehousesSplit by workload shape: XS/S for BI, sized-up only for transform windows
Multi-cluster maxedMAX_CLUSTER_COUNT sized for Black Friday, running all yearEconomy scaling policy + realistic maximums per season
No resource monitorsNon-prod warehouses without quotasMonitors with notify-75% / suspend-100% on every non-prod warehouse

The sizing diagnostic that settles arguments: spill statistics from QUERY_HISTORY. Zero remote spill at P95 → the warehouse is oversized, step down and re-measure; remote spill at P50 → undersized, step up (spill is slower and dearer than the bigger tier). Multi-cluster handles concurrency; size handles complexity — most estates have them inverted, and the spill numbers end the debate.

04 · Section 2 — Clustering Key Antipatterns

Clustering spends background credits to maintain order. The audit asks one question per clustered table: do the reclustering credits buy more than the scans they save?

clustering ROI — credits spent vs pruning earned
SELECT table_name, SUM(credits_used) AS recluster_credits_30d FROM snowflake.account_usage.automatic_clustering_history WHERE start_time >= DATEADD('day', -30, CURRENT_DATE) GROUP BY 1 ORDER BY 2 DESC; -- join against QUERY_HISTORY partition-scan ratios on the same tables: -- high recluster credits + scan_pct still ~100 → paying to organise -- a bookshelf nobody reads in order. Drop the key.

The three antipatterns, in descending frequency: clustering small tables (under ~1TB, natural micro-partition pruning usually suffices — reclustering a churning small table burns credits to optimise nothing); clustering on high-cardinality columns (UUIDs scatter every insert across partitions, so reclustering runs forever — cluster on what queries actually filter by, usually date plus one low-cardinality dimension); and clustering full-scan workloads (if the nightly job aggregates everything anyway, pruning buys nothing — verify scan ratios before paying for order).

05 · Section 3 — The Quiet Serverless Meters

Every serverless feature is valuable when deliberate and expensive when forgotten — and all of them bill outside warehouse budgets, which is why warehouse-focused audits miss them:

MeterForgotten-mode failureAudit query
Search OptimizationEnabled for a 2024 investigation tool, still metering on every writeSEARCH_OPTIMIZATION_HISTORY vs point-lookup query volume
Materialized viewsHigh-churn base table → maintenance exceeds what the MV savesMATERIALIZED_VIEW_REFRESH_HISTORY vs MV query hits
Query AccelerationMasking a sizing problem on one warehouse, permanentlyQUERY_ACCELERATION_ELIGIBLE vs credits billed
Tasks & SnowpipeError-retry loops metering around the clockTASK_HISTORY failures; PIPE_USAGE_HISTORY spikes

Reconcile monthly in METERING_DAILY_HISTORY by service type — the one view where every serverless line appears side by side. The governance fix that makes this stick (from the full playbook): features are enabled per-table by PR, with the before/after credit delta recorded in the description. If nobody wrote the number down, it wasn't engineering.

06 · Section 4 — Per-Query Waste and Role Hygiene

QUERY_HISTORY (30 days) │ ├─► rank by SUM(credits) per query_parameterized_hash ── not by single-run time: │ top 50 = the audit surface the 400×/day 40s query │ outweighs the hourly monster ├─► per offender, three diagnostics: │ scan_pct partitions_scanned / partitions_total ~100% on big table → no pruning │ spill local/remote bytes remote spill → wrong size │ SELECT * wide staging models columnar store, real money │ └─► role hygiene cross-check: who CAN run ad-hoc SQL on which warehouse? ad-hoc on XL "because it was open" ──► route to capped XS/S no resource monitor = unbounded experiment budget

Fixes ranked by effort-to-payoff: clustering keys on the predicate columns the top offenders filter by (one key routinely beats a quarter of query tuning); merge-cadence batching where micro-batched MERGEs churn partitions; column pruning in staging models; and the political one — ad-hoc traffic routed to small capped warehouses by role, which is governance wearing a cost costume. Query tags per team turn next month's audit from forensic reconstruction into a GROUP BY.

07 · Production Evidence: The Audit in Practice

This checklist is the opening move of Vipra's FinOps engagements — the documented result being a 62% TCO reduction worth $125K/year on our cloud modernization engagement (Redshift→BigQuery, where the same audit discipline applied to a different engine). The pattern across estates we've audited: Section 1 findings fund the remediation of everything else within the first month; Section 3 contains the most embarrassing line items (a forgotten Search Optimization on a deprecated table is the genre's classic); and the findings document changes renewal-negotiation leverage materially — walking in with a quantified 30% self-recovery plan changes what the sales team offers. The full remediation sequence, with the warehouse-splitting playbook and the attribution machinery that keeps the bill down, is in The $2M Query.

20–40%
Recoverable Waste —
Typical First Audit
62%
TCO Cut — Vipra
Documented Engagement
2days
One Engineer,
Read-Only, Full Audit
$
Every Finding Carries
Its Dollar Figure

08 · Lessons Learned: The Hard Truths

  • Findings without dollar figures don't get meetings. The identical technical finding, priced, moved from backlog to done in a week. Pricing the waste is most of the audit's value.
  • The idle query embarrasses everyone the first time. No estate we've audited believed its idle ratio before seeing it. Run it first; it buys credibility for everything after.
  • Clustering waste hides because nobody owns it. Warehouse spend has a team's name on it; AUTOMATIC_CLUSTERING_HISTORY belongs to nobody. Assign the storage-layout budget an owner or audit it forever.
  • Serverless features outlive their reasons. Every estate carries at least one meter whose justification left the company two reorgs ago. The monthly METERING_DAILY_HISTORY reconciliation is the only durable defense.
  • Renewal timing is leverage. The audit run a quarter before renewal — with the self-recovery plan quantified — changed negotiation outcomes in ways the same audit run after signing never could.
  • The audit decays in two quarters without attribution. Tags, monitors, and showback (the playbook's Section 5) are what keep the recovered 30% recovered. An audit without follow-through is a very detailed photograph of a leak.

09 · Key Takeaways for Practitioners

💵
Price every finding

Credits × rate × monthly run-rate, attached to each item. Dollar figures get meetings; observations get wikis.

⏱️
Idle first

The metered-vs-active gap is pure waste and the fastest fix. Run that query before any other.

🗂️
Clustering must earn its credits

Recluster spend vs pruning gained, per table. Small tables, UUID keys, and full-scan workloads: drop the key.

👻
Reconcile the quiet meters

METERING_DAILY_HISTORY monthly, by service type. Every serverless feature needs a living justification.

🔝
Rank queries by total credits

Aggregate cost, not anecdote. Fix the tables the top 50 hit; route ad-hoc to capped warehouses.

📅
Audit before renewal

A quantified self-recovery plan is negotiation leverage. After signing, it's just hygiene.

The full remediation playbook — warehouse splitting, pruning fixes, feature governance, and the attribution that keeps costs down — is The $2M Query. The documented engagement behind the numbers: Cloud FinOps & BigQuery Modernization.

FAQ · Frequently Asked Questions

How much can a Snowflake cost audit realistically save?
Typically 20–40% of the monthly bill: 10–20% from warehouse sizing and auto-suspend fixes alone, the rest from clustering cleanup, forgotten serverless features, and query hygiene. The first tranche is config-only and lands within days.
What auto-suspend setting should Snowflake warehouses use?
60 seconds for BI/interactive warehouses; immediate (or job-scoped warehouses) for ETL. Long timers exist to keep caches warm — measure whether cache hits actually justify the idle spend; in most accounts they don't.
When should a Snowflake table have a clustering key?
When it is large (1TB+), queries filter on a small set of columns, and QUERY_HISTORY shows poor partition pruning. Small tables, full-scan workloads, and high-cardinality keys are the three classic mis-clustering patterns that burn reclustering credits for nothing.
Which ACCOUNT_USAGE views matter most for cost?
WAREHOUSE_METERING_HISTORY (credits vs activity = idle waste), QUERY_HISTORY (scans, spilling, repeats), AUTOMATIC_CLUSTERING_HISTORY (reclustering spend per table), and METERING_DAILY_HISTORY by service type (the quiet serverless meters). Those four views power the whole audit.