A warehouse wants tidy rows; genomics produces 200GB BCF files, semi-structured annotations, reference genomes that version, and results that must be re-derivable years later for regulators. Forcing variants through a conventional warehouse loses the provenance chain — which reference build, which caller version, which filter thresholds — that makes a scientific result a result.
The lakehouse keeps raw artifacts, structured variant tables, and full lineage in one governed system: object-storage economics with database semantics. Delta Lake's ACID transactions make pipeline runs atomic, time travel makes any historical analysis reproducible bit-for-bit, and Spark scales horizontally — with RAPIDS GPU acceleration where per-stage benchmarks justify it.
This is a reference architecture: throughput and the 12PB estate are stated design targets, labelled as such. The lakehouse engineering underneath is Vipra production practice — see the multi-cloud Databricks lakehouse we shipped for a different high-cardinality science domain.
01 · Why Genomics Defeats the Standard Warehouse
Three properties make genomics a worst case for conventional platforms. Shape: the atoms are huge semi-structured files (FASTQ, BAM/CRAM, VCF/BCF) with deeply nested per-sample fields — flattening them into warehouse rows either explodes cardinality or amputates the science. Reproducibility: a published cohort analysis must be re-derivable against the exact data, reference build, and code that produced it — sometimes seven years later. Audit: clinical-trial submissions answer to regulators who ask precisely the questions warehouses can't: show me what you knew, when.
The lakehouse resolves the tension because it refuses the false choice: raw artifacts stay in object storage as the immutable scientific record, structured variant tables live beside them with database semantics, and one transaction log binds both into a provenance chain.
02 · The Lakehouse Architecture for Variant Data
Partitioning is the load-bearing decision: partition by chromosome (and date for ingestion tables), Z-order on (sample_id, position) so both access patterns — "everything about this sample" and "everything in this region across the cohort" — prune effectively. Whole-genome scans become targeted reads; the difference at biobank scale is hours and four figures per query.
03 · Ingestion: VCF/BCF as a Streaming Problem
Sequencers don't batch politely — they emit runs continuously, and a sequencing center's output is a stream wearing a filesystem costume. Treat it that way:
Auto Loader — continuous VCF discovery into bronze (PySpark)(spark.readStream .format("cloudFiles") .option("cloudFiles.format", "text") .option("cloudFiles.schemaLocation", chk("vcf_bronze")) .load("s3://seq-center/landing/vcf/") .transform(parse_vcf_lines) # header-aware; per-sample fields → MapType .withColumn("source_file", input_file_name()) .withColumn("ref_build", extract_reference(col("source_file"))) # pinned, always .writeStream .format("delta") .option("checkpointLocation", chk("vcf_bronze")) .trigger(availableNow=True) # cost-controlled micro-batch .toTable("genomics.bronze.vcf_records"))
A sustained 50GB/hour per ingestion lane is a comfortable design target on a modest Spark cluster, and lanes scale horizontally — sequencing throughput becomes a provisioning decision, not an architecture one. The rules that keep bronze honest: preserve the raw line per parsed row (re-parse history when a parser bug surfaces, instead of re-requesting files from the lab); pin the reference build as a column, never an assumption; and checksum-verify against the sequencer manifest before a file is considered landed.
04 · Variant Processing as Governed Spark Pipelines
Joint genotyping, annotation, and cohort filtering become Spark jobs over Delta tables instead of shell-script chains over files. The wins are mundane and decisive:
| Shell-script era | Lakehouse era | What changed |
|---|---|---|
| Per-file annotation runs, ad hoc | Versioned table-to-table joins | Annotation source + version recorded per row; re-annotation is a backfill, not a campaign |
| Cohort = directory of files someone curated | Cohort = consent-filtered gold table | Membership is a query with lineage, reproducible at any timestamp |
| Failed job = half-written outputs | Failed job = no commit | Atomicity; downstream never sees partial cohorts |
| "Which filters did we use?" = lab notebook | Filter thresholds in versioned pipeline code | The methods section writes itself, accurately |
Maintenance is scheduled, not heroic: OPTIMIZE with Z-ordering after large ingests, file compaction for streaming landings, and VACUUM windows aligned to the retention policy (not the default — Section 07 explains why).
05 · ACID + Time Travel = Reproducible Research
The killer feature is not speed; it is the audit answer. Every table version is retained and queryable:
the inspection query — re-derive a 2024 analysis exactly-- What did the cohort look like when the paper was submitted? SELECT * FROM genomics.gold.cohort_cardio TIMESTAMP AS OF '2024-03-15 00:00:00'; -- Bind the full provenance chain for the audit response: DESCRIBE HISTORY genomics.gold.cohort_cardio; -- who, what job, which commit -- + pipeline git SHA pinned in table properties -- + reference genome checksum pinned per row -- = the rerun IS the audit response
A pipeline run is a transaction: it commits whole or not at all, so a failed annotation job can never leave a half-updated cohort for a downstream model to silently train on. Pair table history with pinned pipeline-code versions and reference checksums and reproducibility stops being a policy and becomes a property — the difference between "we believe the analysis was correct" and "here is the rerun" when a trial faces inspection.
06 · Where GPUs Earn Their Cost — and Where They Don't
GPU acceleration is real and it is not universal. The honest engineering move is per-stage benchmarking: run each pipeline stage on CPU and GPU fleets, divide speedup by cost ratio, pin each stage to its winner.
| Stage | GPU speedup (typical) | Verdict |
|---|---|---|
| Alignment (BWA-class → Parabricks-class) | 5–10× | Buy. Compute-bound, embarrassingly parallel |
| DL variant calling (DeepVariant-class) | 5–8× | Buy. The model is the workload |
| Annotation joins (gnomAD-class) | ~1× | Skip. I/O-bound; Spark + pruning wins |
| Cohort dataframe ops (RAPIDS) | 2–4×, workload-dependent | Benchmark. Wins on wide aggregations, loses on shuffle-heavy joins |
Mixed fleets orchestrated per-stage are normal and unexciting — which is what you want. The waste pattern we see in the field is symmetrical: estates paying for GPUs on I/O-bound stages, and estates grinding CPU weeks on alignment because "GPUs are expensive." Both lose to a benchmark spreadsheet that takes two days to build.
07 · Governance for Human Genomes
A genome is the most identifying datum that exists, consent is per-sample and revocable, and jurisdictions disagree about where genomes may rest. The platform encodes all three:
consent enforcement — revocation as a platform property-- Gold cohorts are consent-filtered views, never copies: CREATE OR REPLACE VIEW genomics.gold.cohort_cardio AS SELECT v.* FROM genomics.silver.variants v JOIN governance.consent_registry c USING (sample_id) WHERE c.study_scope = 'cardio_2026' AND c.status = 'active'; -- revocation is immediate, by construction -- Erasure (GDPR-class) is provable, not aspirational: DELETE FROM genomics.silver.variants WHERE sample_id = :revoked; -- deletion vectors mark immediately; VACUUM physically removes -- past retention window; both events land in the audit log
Plus: study-scoped access grants (a researcher holds grants to studies, not tables), region-pinned storage with governed sharing instead of replication for cross-border collaborations, and the audit log itself as a queryable product. A 12PB estate is a realistic scenario target for a national-scale biobank on this design — and the controls are identical at 50TB, which is the right time to install them.
Lanes Scale Horizontally
(Biobank Scale)
It Actually Applies
Every Analysis Re-Derivable
08 · Lessons Learned: The Hard Truths
- Normalization is the cohort. Decomposition and left-alignment differences between labs are the silent killer of cross-site comparability. Normalize deterministically in silver, record the tool version per row, and re-normalize history when the tool upgrades.
- Preserve the raw line. The month-nine parser bug is a certainty, not a risk. Bronze rows that carry their source line turn it into a re-parse backfill instead of a lab-relations incident.
- Small files will eat the metadata layer. Streaming VCF landings produce file counts that degrade planning long before storage hurts. Compaction is a scheduled job from day one, not a remediation.
- VACUUM and audit retention fight — decide deliberately. Time travel depends on retained versions; storage budgets want them gone. Set retention per zone from the compliance calendar (gold: years; bronze: months) and document the trade once, not per incident.
- Consent as a join, not a pipeline step. Filtering consent during ETL bakes yesterday's consent into today's tables. Consent-filtered views make revocation instantaneous and auditable by construction.
- Benchmark GPUs per stage, then stop arguing. The two-day benchmark spreadsheet ended a six-month internal debate. Alignment and DL calling won decisively; everything else stayed CPU. Numbers beat positions.
09 · Key Takeaways for Practitioners
Raw files immutable in object storage; parsed variants in Delta beside them; one provenance chain binding both.
Auto Loader lanes at 50GB/hr, checksummed, schema-validated, raw-line-preserving. Throughput is provisioning, not architecture.
ACID commits + time travel + pinned code and references: the rerun is the audit response.
Buy for alignment and DL calling (5–10×); skip for I/O-bound joins; benchmark the rest. Mixed fleets are normal.
Chromosome partitions, Z-order on (sample, position) — per-sample and per-region queries both prune.
Cohorts as consent-filtered views; revocation immediate; erasure provable via deletion vectors + VACUUM.
The lakehouse engineering pattern here is the one we run in production — documented in the geospatial AI lakehouse case study (a different high-cardinality science domain, same architecture spine). For format selection beneath the pipelines, see Delta vs Iceberg vs Hudi.