Geospatial Intelligence at Scale: A Multi-Cloud Lakehouse for AI-Driven Property Valuation

Q: Why H3 instead of PostGIS geometry joins everywhere?

Scale: hex-indexed joins are integer equality joins that distribute trivially in Spark — billion-row parcel × overlay joins run in minutes. PostGIS remains in the architecture for exact boundary cases and low-latency serving; H3 handles the bulk, exact methods handle the edge, and the platform flags which is which.

Q: How do you fuse satellite imagery with transaction data?

Imagery is processed once per tile vintage by CV models (roof condition, pools, vegetation, construction); outputs map through H3 to parcels and join feature tables as ordinary vintaged columns. Valuation models consume scores, never pixels — keeping training cheap and lineage auditable.

Q: What prevents data leakage in AVM training?

Point-in-time discipline everywhere: every feature carries an as-of date, comp windows end at the listing date (never the query date), and training sets assemble via time-travel joins against the exact historical state. The platform makes the correct computation the cheap default.

Q: Is this architecture actually in production?

The foundation is: Vipra's documented engagement built a hybrid AWS + GCP geospatial lakehouse on Databricks for real-estate AI with high-cardinality spatial data. Scale figures beyond that engagement (national parcel fabric) are labelled reference targets on the same pattern.

Executive Summary

An automated valuation model is a fusion problem wearing an ML costume: parcels, transactions, listings, tax rolls, school and flood zones, and increasingly satellite imagery — each in its own coordinate system, vintage, and quality regime. Force them through a conventional warehouse and the spatial joins that define the feature set (parcels × flood zones × school districts × comparable sales) either time out or bankrupt you.

The architecture that works in production: a lakehouse with H3 hexagonal indexing as the spatial join currency, imagery-derived features computed once and joined as columns, and a point-in-time-correct feature store feeding valuation models. Real-time market movement detection rides the same spine.

This article builds directly on Vipra's documented production work: a hybrid multi-cloud (AWS + GCP) geospatial data lakehouse on Databricks, engineered for real-estate AI models with high-cardinality spatial data. The architecture below is that engagement's pattern, generalised; scale figures beyond it are labelled reference targets.

01 · Why Property Data Breaks Conventional Platforms

Three properties make real-estate data a worst case for warehouses. Spatial cardinality: a national parcel fabric is 150M+ polygons, and every meaningful feature is a join against another spatial layer — naive geometry joins are O(n×m) comparisons that no amount of warehouse credits can outrun. Vintage chaos: assessor rolls update annually, listings hourly, imagery quarterly, transactions on recording lag; a valuation is only honest if every feature respects its as-of date. Source heterogeneity: county formats vary wildly (the United States has 3,000+ recording jurisdictions), coordinate systems disagree, and address strings are a parsing project of their own.

The lakehouse answers all three: object-storage economics for imagery and raw county files, Delta/BigQuery tables for the structured layers, H3 as the universal spatial key, and time travel for point-in-time correctness. This is not theoretical — it is the architecture of our production geospatial engagement.

02 · The Architecture: Hybrid Multi-Cloud Lakehouse

sources

→

County rolls, MLS feeds, transactions, GIS layers, satellite imagery. Raw files land immutably in S3/GCS; imagery as COG (cloud-optimized GeoTIFF).

normalize

→

Spark standardization. One CRS (EPSG:4326), parsed addresses, parcel identity resolution, geometry validation — rejects quarantined with jurisdiction lineage.

index

→

H3 enrichment. Every entity gets hex keys at resolutions 7–10; polygons become hex coverings. The spatial join currency for everything downstream.

features

→

Feature tables. Parcel-level: structure, zone overlays, imagery-derived signals, comp statistics per hex — point-in-time-correct, Z-ordered on (h3_r8, as_of).

serve

→

AVM training (Databricks) + interactive analytics (BigQuery) + PostGIS serving for low-latency parcel lookups behind the valuation API.

The multi-cloud split is deliberate, not accidental: heavy Spark geospatial processing and model training live on Databricks (AWS), interactive market analytics on BigQuery (GCP) where analysts already work, with the lakehouse tables as the single source both read. Our production engagement runs exactly this hybrid — the lesson it taught: pick one canonical store per layer and replicate derived tables, never raw ones.

03 · The Data Flow: Imagery + Transactions to Features

┌─ COUNTY ROLLS ─┐ ┌─ MLS/LISTINGS ─┐ ┌─ TRANSACTIONS ─┐ ┌─ SATELLITE COG ─┐ │ annual, messy │ │ hourly stream │ │ recording lag │ │ quarterly tiles │ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ └────────┬────────┘ ▼ ▼ ▼ ▼ parse + CRS schema-validate dedupe + lag-date CV models → normalize + parcel match annotation roof/pool/veg │ │ │ scores per tile └─────────┬─────────┴─────────┬─────────┘ │ ▼ ▼ ▼ ┌──────────────────────────────────────┐ ┌───────────────────────┐ │ PARCEL IDENTITY (golden record) │◄───────│ tile → H3 → parcel │ │ one parcel, every source ID, │ │ imagery features as │ │ full merge lineage │ │ columns, not pixels │ └──────────────────┬───────────────────┘ └───────────────────────┘ ▼ ┌──────────────────────────────────────┐ │ H3-KEYED FEATURE TABLES │ comps per hex · zone overlays │ point-in-time-correct, versioned │ market velocity · imagery scores └───────┬──────────────────┬───────────┘ ▼ ▼ AVM TRAINING VALUATION API (PostGIS serving, <100ms parcel lookup)

The imagery branch deserves emphasis: computer-vision models run once per tile vintage — roof condition, pool presence, vegetation indices, construction activity — and their outputs join the feature tables as ordinary columns keyed by H3-to-parcel mapping. The AVM never touches pixels; it consumes imagery as features with vintages, which keeps training cheap and the lineage auditable.

04 · H3: Making Billion-Row Spatial Joins Tractable

The single most consequential decision in the platform. Raw geometry joins (point-in-polygon, polygon overlap) at parcel scale are computationally hostile; H3 converts them into integer equality joins:

H3 spatial join — geometry becomes equality (PySpark + h3 udf)
# one-time: polygons → hex coverings at the working resolution
flood_hexes = (flood_zones
    .withColumn("h3_r9", polyfill_udf("geometry", F.lit(9)))   # zone → [hex,...]
    .select("zone_id", "fema_class", F.explode("h3_r9").alias("h3_r9")))

parcels_h3 = parcels.withColumn("h3_r9", point_to_h3_udf("centroid", F.lit(9)))

# the "spatial join" is now a hash join Spark eats for breakfast:
parcel_flood = parcels_h3.join(flood_hexes, "h3_r9", "left")
# billion-row scale: minutes on a modest cluster, not hours on a huge one

Resolution strategy from production: r7 (~5 km²) for market-area statistics, r8/r9 for neighborhood features and zone overlays, r10 (~0.015 km²) where parcel-adjacency matters. Store keys at all working resolutions (they're cheap integers; parent/child traversal is bit arithmetic) and Z-order feature tables on the join resolution. Edge honesty: hex coverings approximate polygon boundaries — for the ~2% of parcels straddling zone edges, fall back to exact PostGIS point-in-polygon, flagged by a covering-boundary bit. Approximate for the bulk, exact for the edge, explicit about which is which.

05 · The AVM Feature Pipeline

Valuation models are only as good as their features, and valuation features have a vicious time dimension: a model trained on comp statistics computed with post-sale knowledge backtests brilliantly and fails in production. The pipeline rules:

Feature family	Examples	The discipline
Structure	sqft, beds, age, assessor quality codes	Vintage = assessor roll date, not query date
Location (H3)	school/flood/zoning overlays, distance-to-amenity per hex	Overlay version pinned; zones change, history retained
Market	comps per hex: median $/sqft, DOM, inventory, velocity (30/90/365d)	Point-in-time windows ending at as-of date — the classic leak lives here
Imagery	roof score, pool, vegetation, construction activity	Tile vintage recorded; features carry their observation date

Training set assembly is a time-travel join: for each historical sale, features AS OF the listing date. Delta time travel plus as-of columns makes this a query, not an archaeology project — the same reproducibility property our genomics lakehouse piece builds on, applied to comps instead of cohorts. Serving uses the identical feature code via the gold tables: train/serve skew eliminated by construction.

06 · Real-Time Market Movement Detection

Quarterly market reports are history; pricing desks and acquisition teams want to know this week that a submarket turned. The detection layer rides the existing spine: listing events stream into per-hex rolling statistics (new listings, price cuts, DOM, pending-to-active ratios), and humble change-point detection per r7/r8 hex flags movements against each hex's own seasonal baseline.

💡Hexes are the right unit for market detection precisely because they're not neighborhoods: uniform area, no gerrymandered boundaries, trivially aggregable to any custom market definition downstream. Detect at hex grain; report at whatever grain the business speaks.

Detected movements flow three places: the AVM's market-velocity features (models react to turning markets within days, not quarters), an analyst alert queue with the evidence series attached, and the valuation API's confidence intervals — a valuation in a fast-moving hex carries honest, wider bands. Streaming infrastructure here is the standard Kafka pattern from our fraud-detection architecture, at gentler throughput.

07 · Business Implementation: The Production Engagement

The foundation of this article is a documented Vipra engagement: a hybrid multi-cloud geospatial data lakehouse on Databricks, built for a real-estate AI client whose models demanded high-cardinality spatial data that their previous warehouse architecture could not serve. The production scope: AWS + GCP hybrid (Databricks lakehouse + BigQuery analytics), H3-indexed spatial layers, and imagery-derived features feeding valuation and market models — the architecture of Sections 02–05 is that system, generalised.

Implementation sequence that worked: phase 1 — parcel identity and H3 enrichment spine (the unglamorous core everything joins through); phase 2 — three feature families end-to-end (structure, overlays, comps) with point-in-time discipline from day one; phase 3 — imagery branch and market detection, riding rails already proven. The anti-pattern we were called in to replace: feature engineering scattered across data-science notebooks, each with its own subtly different comp logic — the platform's job was making the one correct feature computation cheaper to use than the twelve wrong ones.

Clouds, One Lakehouse —
Vipra Production (AWS+GCP)

10×+

Spatial Join Speedup —
H3 vs Geometry (Typical)

<100ms

Parcel Lookup —
PostGIS Serving Target

Time-Leaked Features
Tolerated in Training

08 · Lessons Learned: The Hard Truths

Parcel identity is the project. County IDs collide, change, and split; address matching is probabilistic. The golden-record service took longer than any pipeline and was worth more than all of them — every feature joins through it.
Pick H3 resolutions once, deliberately. Re-indexing a feature estate from r8 to r9 mid-project is a full backfill. We model resolution choices against the feature families up front and document the trade.
Imagery as columns, never as a model-time dependency. The first design had the AVM calling a CV service at inference; latency and version chaos followed. Precompute per vintage, join as features, pin versions.
Comp leakage is subtle and everywhere. A 90-day median that includes the subject sale, a window ending at query date instead of listing date — every one inflates backtests. The point-in-time join framework exists to make the correct thing the lazy thing.
Multi-cloud needs one canonical store per layer. Bidirectional sync of raw layers between clouds produced subtle drift; the fix was directional: lakehouse canonical, BigQuery receives derived marts, never the reverse.
The 2% boundary parcels will find you. Hex-covering approximations on zone edges produced exactly the support tickets you'd predict (flood-zone disputes). The exact-fallback bit and an explanation in the API response ended them.

09 · Key Takeaways for Practitioners

🆔

Identity before features

The parcel golden record is the platform's core; budget it like one. Every join depends on it.

⬡

H3 is the join currency

Geometry joins become integer equality. Pick resolutions per feature family, store all working keys, Z-order on them.

🛰️

Imagery as vintaged columns

CV runs once per tile vintage; models consume scores with observation dates, never pixels.

🕰️

Point-in-time or it's leakage

Comp windows end at the as-of date; training joins are time-travel joins; serving reads the same gold tables.

📡

Detect markets at hex grain

Change-points per hex against seasonal baselines feed models, analysts, and honest confidence intervals.

☁️

Hybrid with direction

One canonical store per layer; derived marts replicate one way. Multi-cloud is a design, not an accident.

The production foundation: the geospatial AI lakehouse case study. Sector context on the real estate industry page; the time-travel discipline in depth in our genomics lakehouse piece.

FAQ · Frequently Asked Questions

Why H3 instead of PostGIS geometry joins everywhere?

Scale: hex-indexed joins are integer equality joins that distribute trivially in Spark — billion-row parcel × overlay joins run in minutes. PostGIS remains in the architecture for exact boundary cases and low-latency serving; H3 handles the bulk, exact methods handle the edge, and the platform flags which is which.

How do you fuse satellite imagery with transaction data?

Imagery is processed once per tile vintage by CV models (roof condition, pools, vegetation, construction); outputs map through H3 to parcels and join feature tables as ordinary vintaged columns. Valuation models consume scores, never pixels — keeping training cheap and lineage auditable.

What prevents data leakage in AVM training?

Point-in-time discipline everywhere: every feature carries an as-of date, comp windows end at the listing date (never the query date), and training sets assemble via time-travel joins against the exact historical state. The platform makes the correct computation the cheap default.

Is this architecture actually in production?

The foundation is: Vipra's documented engagement built a hybrid AWS + GCP geospatial lakehouse on Databricks for real-estate AI with high-cardinality spatial data. Scale figures beyond that engagement (national parcel fabric) are labelled reference targets on the same pattern.

Geospatial Intelligence at Scale:A Multi-Cloud Lakehouse for AI-Driven Property Valuation

01 · Why Property Data Breaks Conventional Platforms

02 · The Architecture: Hybrid Multi-Cloud Lakehouse

03 · The Data Flow: Imagery + Transactions to Features

04 · H3: Making Billion-Row Spatial Joins Tractable

05 · The AVM Feature Pipeline

06 · Real-Time Market Movement Detection

07 · Business Implementation: The Production Engagement

08 · Lessons Learned: The Hard Truths

09 · Key Takeaways for Practitioners

FAQ · Frequently Asked Questions

Geospatial Intelligence at Scale:
A Multi-Cloud Lakehouse for AI-Driven Property Valuation