An automated valuation model is a fusion problem wearing an ML costume: parcels, transactions, listings, tax rolls, school and flood zones, and increasingly satellite imagery — each in its own coordinate system, vintage, and quality regime. Force them through a conventional warehouse and the spatial joins that define the feature set (parcels × flood zones × school districts × comparable sales) either time out or bankrupt you.
The architecture that works in production: a lakehouse with H3 hexagonal indexing as the spatial join currency, imagery-derived features computed once and joined as columns, and a point-in-time-correct feature store feeding valuation models. Real-time market movement detection rides the same spine.
This article builds directly on Vipra's documented production work: a hybrid multi-cloud (AWS + GCP) geospatial data lakehouse on Databricks, engineered for real-estate AI models with high-cardinality spatial data. The architecture below is that engagement's pattern, generalised; scale figures beyond it are labelled reference targets.
01 · Why Property Data Breaks Conventional Platforms
Three properties make real-estate data a worst case for warehouses. Spatial cardinality: a national parcel fabric is 150M+ polygons, and every meaningful feature is a join against another spatial layer — naive geometry joins are O(n×m) comparisons that no amount of warehouse credits can outrun. Vintage chaos: assessor rolls update annually, listings hourly, imagery quarterly, transactions on recording lag; a valuation is only honest if every feature respects its as-of date. Source heterogeneity: county formats vary wildly (the United States has 3,000+ recording jurisdictions), coordinate systems disagree, and address strings are a parsing project of their own.
The lakehouse answers all three: object-storage economics for imagery and raw county files, Delta/BigQuery tables for the structured layers, H3 as the universal spatial key, and time travel for point-in-time correctness. This is not theoretical — it is the architecture of our production geospatial engagement.
02 · The Architecture: Hybrid Multi-Cloud Lakehouse
The multi-cloud split is deliberate, not accidental: heavy Spark geospatial processing and model training live on Databricks (AWS), interactive market analytics on BigQuery (GCP) where analysts already work, with the lakehouse tables as the single source both read. Our production engagement runs exactly this hybrid — the lesson it taught: pick one canonical store per layer and replicate derived tables, never raw ones.
03 · The Data Flow: Imagery + Transactions to Features
The imagery branch deserves emphasis: computer-vision models run once per tile vintage — roof condition, pool presence, vegetation indices, construction activity — and their outputs join the feature tables as ordinary columns keyed by H3-to-parcel mapping. The AVM never touches pixels; it consumes imagery as features with vintages, which keeps training cheap and the lineage auditable.
04 · H3: Making Billion-Row Spatial Joins Tractable
The single most consequential decision in the platform. Raw geometry joins (point-in-polygon, polygon overlap) at parcel scale are computationally hostile; H3 converts them into integer equality joins:
H3 spatial join — geometry becomes equality (PySpark + h3 udf)# one-time: polygons → hex coverings at the working resolution flood_hexes = (flood_zones .withColumn("h3_r9", polyfill_udf("geometry", F.lit(9))) # zone → [hex,...] .select("zone_id", "fema_class", F.explode("h3_r9").alias("h3_r9"))) parcels_h3 = parcels.withColumn("h3_r9", point_to_h3_udf("centroid", F.lit(9))) # the "spatial join" is now a hash join Spark eats for breakfast: parcel_flood = parcels_h3.join(flood_hexes, "h3_r9", "left") # billion-row scale: minutes on a modest cluster, not hours on a huge one
Resolution strategy from production: r7 (~5 km²) for market-area statistics, r8/r9 for neighborhood features and zone overlays, r10 (~0.015 km²) where parcel-adjacency matters. Store keys at all working resolutions (they're cheap integers; parent/child traversal is bit arithmetic) and Z-order feature tables on the join resolution. Edge honesty: hex coverings approximate polygon boundaries — for the ~2% of parcels straddling zone edges, fall back to exact PostGIS point-in-polygon, flagged by a covering-boundary bit. Approximate for the bulk, exact for the edge, explicit about which is which.
05 · The AVM Feature Pipeline
Valuation models are only as good as their features, and valuation features have a vicious time dimension: a model trained on comp statistics computed with post-sale knowledge backtests brilliantly and fails in production. The pipeline rules:
| Feature family | Examples | The discipline |
|---|---|---|
| Structure | sqft, beds, age, assessor quality codes | Vintage = assessor roll date, not query date |
| Location (H3) | school/flood/zoning overlays, distance-to-amenity per hex | Overlay version pinned; zones change, history retained |
| Market | comps per hex: median $/sqft, DOM, inventory, velocity (30/90/365d) | Point-in-time windows ending at as-of date — the classic leak lives here |
| Imagery | roof score, pool, vegetation, construction activity | Tile vintage recorded; features carry their observation date |
Training set assembly is a time-travel join: for each historical sale, features AS OF the listing date. Delta time travel plus as-of columns makes this a query, not an archaeology project — the same reproducibility property our genomics lakehouse piece builds on, applied to comps instead of cohorts. Serving uses the identical feature code via the gold tables: train/serve skew eliminated by construction.
06 · Real-Time Market Movement Detection
Quarterly market reports are history; pricing desks and acquisition teams want to know this week that a submarket turned. The detection layer rides the existing spine: listing events stream into per-hex rolling statistics (new listings, price cuts, DOM, pending-to-active ratios), and humble change-point detection per r7/r8 hex flags movements against each hex's own seasonal baseline.
Detected movements flow three places: the AVM's market-velocity features (models react to turning markets within days, not quarters), an analyst alert queue with the evidence series attached, and the valuation API's confidence intervals — a valuation in a fast-moving hex carries honest, wider bands. Streaming infrastructure here is the standard Kafka pattern from our fraud-detection architecture, at gentler throughput.
07 · Business Implementation: The Production Engagement
The foundation of this article is a documented Vipra engagement: a hybrid multi-cloud geospatial data lakehouse on Databricks, built for a real-estate AI client whose models demanded high-cardinality spatial data that their previous warehouse architecture could not serve. The production scope: AWS + GCP hybrid (Databricks lakehouse + BigQuery analytics), H3-indexed spatial layers, and imagery-derived features feeding valuation and market models — the architecture of Sections 02–05 is that system, generalised.
Implementation sequence that worked: phase 1 — parcel identity and H3 enrichment spine (the unglamorous core everything joins through); phase 2 — three feature families end-to-end (structure, overlays, comps) with point-in-time discipline from day one; phase 3 — imagery branch and market detection, riding rails already proven. The anti-pattern we were called in to replace: feature engineering scattered across data-science notebooks, each with its own subtly different comp logic — the platform's job was making the one correct feature computation cheaper to use than the twelve wrong ones.
Vipra Production (AWS+GCP)
H3 vs Geometry (Typical)
PostGIS Serving Target
Tolerated in Training
08 · Lessons Learned: The Hard Truths
- Parcel identity is the project. County IDs collide, change, and split; address matching is probabilistic. The golden-record service took longer than any pipeline and was worth more than all of them — every feature joins through it.
- Pick H3 resolutions once, deliberately. Re-indexing a feature estate from r8 to r9 mid-project is a full backfill. We model resolution choices against the feature families up front and document the trade.
- Imagery as columns, never as a model-time dependency. The first design had the AVM calling a CV service at inference; latency and version chaos followed. Precompute per vintage, join as features, pin versions.
- Comp leakage is subtle and everywhere. A 90-day median that includes the subject sale, a window ending at query date instead of listing date — every one inflates backtests. The point-in-time join framework exists to make the correct thing the lazy thing.
- Multi-cloud needs one canonical store per layer. Bidirectional sync of raw layers between clouds produced subtle drift; the fix was directional: lakehouse canonical, BigQuery receives derived marts, never the reverse.
- The 2% boundary parcels will find you. Hex-covering approximations on zone edges produced exactly the support tickets you'd predict (flood-zone disputes). The exact-fallback bit and an explanation in the API response ended them.
09 · Key Takeaways for Practitioners
The parcel golden record is the platform's core; budget it like one. Every join depends on it.
Geometry joins become integer equality. Pick resolutions per feature family, store all working keys, Z-order on them.
CV runs once per tile vintage; models consume scores with observation dates, never pixels.
Comp windows end at the as-of date; training joins are time-travel joins; serving reads the same gold tables.
Change-points per hex against seasonal baselines feed models, analysts, and honest confidence intervals.
One canonical store per layer; derived marts replicate one way. Multi-cloud is a design, not an accident.
The production foundation: the geospatial AI lakehouse case study. Sector context on the real estate industry page; the time-travel discipline in depth in our genomics lakehouse piece.