Vipra Software Articles What Is a Data Lakehouse
Lakehouse Apache Iceberg Delta Lake Apache Hudi Medallion Explainer

What Is a Data Lakehouse?
Definition, Architecture & When You Need One

A data lakehouse stores data in cheap, open object storage while providing warehouse-grade guarantees — ACID transactions, schema enforcement, time travel, fast SQL — through open table formats. One copy of data serving BI, ML, and streaming. The definitive explainer, with the architecture, the format decision, and the honest cases where you should skip it.

Audience
Architects / Data Leaders
Core Idea
Lake economics + warehouse guarantees
Formats
Iceberg · Delta · Hudi
Vipra Production
Multi-cloud lakehouses shipped
Pattern
Medallion (Bronze/Silver/Gold)
Published
June 2026
Executive Summary

For two decades, enterprises ran two parallel systems: a warehouse (fast, governed, expensive, SQL-only) and a lake (cheap, flexible, ungoverned — and prone to becoming a swamp). Every team paid twice: once to store raw data in the lake, again to copy curated subsets into the warehouse. ML read the lake; finance read the warehouse; the numbers disagreed.

The lakehouse collapses the two: object storage (S3, GCS, ADLS) holds data in open formats, and a table-format layer — Apache Iceberg, Delta Lake, or Apache Hudi — adds the transactional guarantees that previously required a warehouse engine: ACID, schema enforcement and evolution, time travel, and the layout features that make SQL fast.

This explainer is grounded in production practice: Vipra ships and operates lakehouses across industries — a hybrid multi-cloud geospatial platform on Databricks, a GCP multi-region supply-chain lakehouse unifying 15 systems, and petabyte-scale reference architectures. The 'when to skip it' section is just as load-bearing as the rest.

01 · The Problem the Lakehouse Solves

The two-system era had a precise failure economics. The warehouse was correct but expensive and closed: proprietary storage, SQL-only access, and every byte loaded was a byte billed. The lake was cheap and open but lawless: no transactions, no schema enforcement, no delete that meant anything — files appeared, partial writes corrupted readers, and within three years most lakes earned the "swamp" epithet honestly. So enterprises ran both and paid three times: storage twice, and the reconciliation tax forever — ML trained on lake data that finance's warehouse numbers contradicted, and both were "right" per their system.

The lakehouse's claim is precise: one copy of data, in open formats on object storage, with warehouse-grade guarantees added by a metadata layer — so BI, ML, and streaming consume the same governed tables, and the reconciliation tax goes to zero by construction.

02 · The Architecture, Layer by Layer

storage
Object storage: S3 / GCS / ADLS. Parquet files. The cheapest durable bytes in computing; no compute married to them.
format
Table format: Iceberg / Delta / Hudi. The metadata layer that turns files into tables — ACID, schema, time travel, layout. The enabling invention.
catalog
Catalog + governance: Unity Catalog / Glue / Polaris-class. Discovery, access control, lineage, row/column security — the layer audits actually examine.
compute
Engines, plural, interchangeable. Spark for transformation, Trino/warehouse-external for SQL, streaming writers, ML readers — same tables, no copies.
consume
BI dashboards, ML training, streaming apps, governed sharing. One truth; the finance number and the model's training set finally agree.

The architectural sentence worth memorizing: the lakehouse decouples storage, table semantics, and compute — each layer swappable, each priced independently. That decoupling is also the lock-in escape hatch: open formats mean the data outlives any vendor decision above it.

03 · How the Table Format Layer Works

The enabling technology is a metadata layer over Parquet that provides four guarantees:

GuaranteeMechanismWhat it replaces
ACID transactionsAtomic metadata swaps; snapshot isolation"Don't read the lake while the job runs" folklore
Schema enforcement & evolutionSchema in metadata; writes validated; columns add/rename safelySchema-on-read archaeology
Time travelEvery version retained and queryable (VERSION AS OF)Backup restores and apologies
Performance layoutPartition pruning, Z-ordering/clustering, compaction, statsThe "lakes are slow" truism
the guarantees, demonstrated in four statements (Delta/Spark SQL)
-- ACID: concurrent writers, readers never see partial state MERGE INTO gold.orders t USING staging.updates s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; -- schema evolution without a migration project ALTER TABLE gold.orders ADD COLUMN channel STRING; -- time travel: audit, reproduce, recover SELECT * FROM gold.orders TIMESTAMP AS OF '2026-03-31 23:59:59'; -- layout: make the scan skip 99% of the table OPTIMIZE gold.orders ZORDER BY (customer_id, order_date);

04 · Iceberg vs Delta vs Hudi: The Format Decision

FormatStrengthsTypical home
Apache IcebergEngine-neutral standard, hidden partitioning, broadest vendor adoption (Snowflake, BigQuery, Databricks all read it)Multi-engine estates; teams optimizing for the longest horizon
Delta LakeDeepest Spark/Databricks integration, most mature tooling, change data feed, Delta SharingDatabricks-centric platforms; fastest path to production
Apache HudiRecord-level upserts, incremental pulls, near-real-time ingestion primitivesCDC-heavy and streaming-first pipelines

The honest 2026 summary: the formats have converged on the guarantees and diverge on ecosystem. Choose by estate shape, not benchmarks — Databricks-centric → Delta without agonizing; multi-engine or vendor-cautious → Iceberg's neutrality compounds; streaming-upsert-dominant → Hudi earns its operational quirks. And increasingly the answer is "both via interop" (UniForm, XTable-class) — the format war is ending in mutual readability. The full decision tree with five scenarios called honestly is our companion piece, Delta vs Iceberg vs Hudi.

05 · The Medallion Pattern: Bronze, Silver, Gold

sources (apps · CDC · files · streams · APIs) │ ▼ BRONZE — raw, immutable, as-ingested │ exact source payloads + lineage columns; never edited, only appended │ the "re-process history when the parser bug surfaces" insurance ▼ transformations: dbt / Spark, tested at every gate SILVER — cleaned, conformed, deduplicated │ typed, unit-normalised, quality-quarantined; entity-resolved │ the layer ML trains on and analysts trust ▼ GOLD — business-level aggregates & dimensional models what BI queries; what contracts govern; what executives see property that pays for everything: each layer is rebuildable from the one below → disaster recovery and logic fixes are re-runs, not crises

The medallion pattern's value is not the naming — it is the rebuildability contract: every layer derives deterministically from the one below, so a transformation bug discovered in month nine is a backfill, not an incident review. Production disciplines per layer: bronze is append-only with source lineage columns; silver owns quality gates (quarantine, never silent drops — the two-layer testing discipline lives here); gold is where data contracts bind producers to consumers.

06 · Lakehouse vs Warehouse vs Lake — Honestly

Data lakeWarehouseLakehouse
Storage costLowestHighestLowest (same object storage)
Transactions / governanceNoneExcellentStrong (format + catalog dependent)
SQL performancePoorBest-in-classNear-warehouse; gap closing yearly
ML / unstructured accessNativeAwkward exportsNative — same tables
Streaming write/readFragileEngine-specificFirst-class (format-dependent)
Lock-inNoneSubstantialLow — open formats, swappable engines
Operational maturity requiredLow (and it shows)Low — vendor-managedModerate — compaction, vacuum, layout are yours

The last row is the one vendor decks omit: a lakehouse hands you warehouse guarantees plus warehouse-adjacent responsibilities — file compaction, snapshot expiry, statistics, layout tuning. Managed platforms absorb much of it; pure-OSS estates own all of it. Price that honestly before the migration, not after.

07 · Production Evidence: Lakehouses We Operate

This explainer is backed by shipped systems: the hybrid multi-cloud geospatial lakehouse (Databricks across AWS + GCP, high-cardinality spatial data feeding real-estate AI — the architecture detailed in our AVM piece); the GCP multi-region supply-chain lakehouse (15 regional logistics systems unified, 35% forecast accuracy gain); and the petabyte-scale reference architectures in our genomics and IoT digital-twin playbooks, where ACID + time travel carry regulatory weight. Across all of them, the medallion rebuildability contract and the format-layer guarantees are not slideware — they are what the auditors and the 3am incidents actually exercised.

2
Clouds, One Lakehouse —
Vipra Production
15
Systems Unified —
GCP Multi-Region
35%
Forecast Accuracy Gain —
Documented
1
Copy of the Data —
The Whole Point

08 · When You Should Skip the Lakehouse

  • Your data is small and structured. Under a few TB of relational data with BI-only consumers: a warehouse (or even Postgres) is simpler, cheaper to operate, and faster to ship. The lakehouse solves problems you don't have.
  • You have no ML or streaming roadmap. The lakehouse's killer feature is one copy serving many workload types. If the only workload is dashboards, the warehouse's managed simplicity wins.
  • Nobody will own table maintenance. Compaction, vacuum, layout — unowned, they decay into the slow swamp you left. A lakehouse without an owner is a lake with extra YAML.
  • You're escaping a warehouse for cost reasons alone. Run the FinOps audit first (checklist here) — 20–40% of most warehouse bills is recoverable in place, without a migration's risk.
  • The org isn't ready for schema discipline. Open formats enforce schemas; teams used to dumping files will experience that as friction. The cultural readiness is part of the architecture.
  • A migration has no measured outcome attached. "Modernize to lakehouse" is not an outcome. "One copy serving BI + ML, TCO −40%, ML features point-in-time-correct" is — and it makes the project auditable.

09 · Key Takeaways for Practitioners

🏛️
One copy, many workloads

Object storage + table format + swappable engines. BI, ML, and streaming read the same governed truth.

📜
The format layer is the invention

ACID, schema, time travel, layout — metadata over Parquet replacing an entire second system.

🥉
Medallion = rebuildability

Bronze immutable, silver conformed, gold governed — each layer re-derivable, so fixes are re-runs.

⚖️
Choose formats by estate shape

Delta for Databricks-centric, Iceberg for multi-engine horizons, Hudi for upsert-heavy streaming. Interop is ending the war.

🔧
Own the maintenance or don't start

Compaction, vacuum, layout are your responsibilities now. Unowned lakehouses decay into swamps with better marketing.

🚪
Skipping is a valid architecture

Small structured data, BI-only, no owner? The warehouse is the right answer, and this explainer says so.

Go deeper: the format decision framework, a production multi-cloud lakehouse anatomized, and the documented engagements — geospatial AI and supply chain.

FAQ · Frequently Asked Questions

What is a data lakehouse in simple terms?
It is cheap cloud file storage that behaves like a database. Open table formats (Iceberg, Delta Lake, Hudi) add transactions, schema enforcement, and fast SQL on top of object storage — so BI, machine learning, and streaming all work from one copy of the data.
What is the difference between a data lake and a lakehouse?
A data lake is raw object storage with no guarantees — easy to fill, hard to trust. A lakehouse adds a transactional metadata layer providing ACID, schema enforcement, and time travel, making the same cheap storage reliable enough for production analytics.
Is BigQuery or Snowflake a lakehouse?
They are warehouses that have absorbed lakehouse features — both can now query open-format tables (Iceberg) in external storage. The architectural distinction is converging; what matters is whether your data lives in open formats you control or proprietary formats you rent.
Iceberg vs Delta Lake — which should we choose?
Choose by ecosystem: Delta Lake if you are Databricks-centric; Iceberg if you need engine neutrality across Spark, Trino, Flink, and warehouse external tables. Both are production-mature in 2026. Hudi remains the specialist for record-level upsert-heavy CDC workloads.
Do small companies need a lakehouse?
Usually not. Under roughly 10TB with SQL-only analytics, a serverless warehouse plus dbt is simpler and cheaper. Adopt a lakehouse when ML workloads, streaming, unstructured data, or multi-engine needs actually arrive — it is a workload decision, not a fashion decision.