How Vipra Software architected a hybrid multi-cloud geospatial lakehouse on Databricks, enabling real-estate AI models with high-cardinality spatial data processed in real-time across AWS and Azure.
A real estate technology company had built sophisticated AI models for property valuation, neighbourhood scoring, and demand forecasting — but those models were starved of the spatial data they needed to reach their accuracy potential. The company held 500M+ geospatial property records across multiple data vendors, but they lived in incompatible formats across AWS S3 and Azure Blob Storage with no unified processing capability.
Geospatial data poses unique engineering challenges beyond standard data warehousing. High-cardinality coordinates, polygon geometries, spatial join operations, and H3 hex grid indexing require specialised processing frameworks that generic SQL warehouses handle poorly. The data science team had resorted to sampling — using 2% of available spatial data for model training because full-scale processing was computationally infeasible with their existing tooling.
The multi-cloud reality was non-negotiable. AWS housed the company's core operational systems and transaction data. Azure held three years of property market intelligence data acquired through a recent acquisition. Any solution needed to operate natively across both cloud environments without forcing a costly cloud consolidation project.
Vipra Software selected Databricks as the unifying compute layer precisely because of its cloud-agnostic architecture and native geospatial processing capabilities. Delta Lake's multi-cloud storage abstraction enabled a single logical lakehouse to span S3 and Azure Blob Storage, with Databricks clusters provisioned in both clouds operating against a unified metastore.
The lakehouse architecture spans two clouds with Databricks Unity Catalog as the governance and access control layer. AWS-hosted Delta tables serve the property transaction and listing data (primary operational source), while Azure-hosted Delta tables serve the acquired market intelligence dataset. Delta Sharing enables cross-cloud reads without data replication — a critical cost optimisation for datasets at this scale.
Geospatial processing leverages Apache Sedona distributed spatial SQL running on Databricks clusters, enabling spatial joins, polygon intersection tests, and H3 grid generation across hundreds of millions of records in minutes rather than hours. The H3 indexing scheme was chosen for its hierarchical resolution model — data stored at H3 resolution 9 (individual buildings) can be instantly aggregated to resolution 5 (neighbourhood blocks) or resolution 3 (regional markets) without recomputation.
The Feature Store pre-computes the 340 spatial features that data scientists had previously calculated ad-hoc for each model training run. This shift from compute-time to storage-time calculation reduced average model training time from 18 hours to 4.5 hours, enabling the data science team to run 4x more experiments per sprint.
The most significant impact was qualitative: for the first time, the data science team trained their valuation model against the full 500M+ spatial dataset rather than a 2% sample. The resulting model showed a 23% improvement in valuation accuracy versus the sampled baseline — a direct consequence of spatial data richness previously inaccessible to the platform.
The Feature Store eliminated 60% of the data preparation work that had previously occupied senior data scientists before each model training run. This reallocation of senior engineering capacity toward model architecture rather than data wrangling increased the team's experiment velocity by 4x in the quarter following launch.
The multi-cloud architecture also resolved a business continuity concern that had been unaddressed: the company's operational dependence on a single cloud provider. Delta Sharing's cross-cloud access patterns now enable a genuine failover capability between cloud providers, a requirement that had appeared in two consecutive enterprise client security audits without a satisfactory resolution until this platform was delivered.