TL;DR — Direct Answer
A data lakehouse is an architecture that stores data in cheap, open cloud object storage (like a data lake) while providing warehouse-grade guarantees — ACID transactions, schema enforcement, time travel, and fast SQL — through open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi. It exists so one copy of data can serve BI, machine learning, and streaming workloads without the cost and lock-in of loading everything into a proprietary warehouse.
The problem the lakehouse solves
For two decades, enterprises ran two parallel systems: a data warehouse (fast, governed, expensive, SQL-only) and a data lake (cheap, flexible, ungoverned — and prone to becoming a "data swamp"). Every team paid twice: once to store raw data in the lake, again to copy curated subsets into the warehouse. ML teams read the lake; finance read the warehouse; the numbers disagreed.
The lakehouse collapses the two: object storage (S3, GCS, ADLS) holds the data in open formats, and a table format layer adds the transactional guarantees that previously required a warehouse engine.
How it works: the table format layer
The enabling technology is a metadata layer over Parquet files that provides:
- ACID transactions — concurrent writers without corruption; readers never see partial writes.
- Schema enforcement and evolution — bad-shape data is rejected; columns can be added or renamed safely.
- Time travel — query the table as it was last Tuesday; audit and reproduce any historical state.
- Performance features — partition pruning, Z-ordering/clustering, compaction — bringing scan speeds near warehouse-native tables.
| Format | Strengths | Typical home |
|---|---|---|
| Apache Iceberg | Engine-neutral standard, hidden partitioning, broad vendor adoption | Multi-engine estates (Spark + Trino + Snowflake/BigQuery external) |
| Delta Lake | Deep Spark/Databricks integration, mature tooling, change data feed | Databricks-centric platforms |
| Apache Hudi | Record-level upserts, incremental pulls, near-real-time ingestion | CDC-heavy and streaming-first pipelines |
The medallion architecture
Most production lakehouses organise data in three zones: Bronze (raw, immutable, as-ingested), Silver (cleaned, conformed, deduplicated), and Gold (business-level aggregates and dimensional models that BI tools query). Each layer is rebuildable from the one below — which turns disaster recovery and logic fixes into re-runs instead of crises. Transformations between layers are typically managed in dbt or Spark with tests at every gate.
Lakehouse vs warehouse vs lake
| Data lake | Warehouse | Lakehouse | |
|---|---|---|---|
| Storage cost | Lowest | Highest | Lowest (object storage) |
| ACID / governance | No | Yes | Yes (table format) |
| ML / unstructured | Yes | Limited | Yes |
| BI SQL performance | Poor | Best | Near-warehouse |
| Lock-in risk | Low | High | Low (open formats) |
When you need one — and when you don't
Choose a lakehouse when you have mixed workloads (BI + ML + streaming) on the same data, multi-petabyte growth ahead, unstructured or semi-structured sources, or a strategic aversion to warehouse lock-in. Our geospatial lakehouse engagement is a working example: high-cardinality spatial telemetry on Databricks + AWS Athena serving AI valuation models and analytics from one copy of data.
Skip it when you are an SQL-only analytics shop under ~10TB. A serverless warehouse like BigQuery with dbt is simpler, cheaper to operate, and faster to ship — our 62% TCO reduction migration is exactly that pattern. Architecture should follow workloads, not conference talks.
Common lakehouse mistakes
- Skipping compaction and optimization — thousands of small files quietly destroy scan performance.
- No medallion discipline — letting BI query Bronze recreates the swamp you escaped.
- Choosing the format before the engines — pick the query engines first; let them vote on Iceberg vs Delta.
- Ignoring governance — open storage still needs RBAC, masking, and lineage; "open" must not mean "exposed."