A US franchising company had 140+ tables of valuable operational data locked inside a SaaS vendor's Amazon S3 export — accessible only through the vendor's limited reporting tools. We built a fully automated, self-healing pipeline into Google BigQuery, delivering fresh data every morning with schema evolution detection, idempotent loads, and zero day-to-day manual steps.
The client is a US-based franchising business operating dozens of franchise locations nationwide. Their entire operation — appointments, memberships, sales, payroll, collections, gift cards — runs through a third-party SaaS platform that manages all location-level transactions.
The data existed. The SaaS vendor exported it daily to Amazon S3. But the only way to access it meaningfully was through the vendor's own built-in reports. The business was analytically captive to the analyses the vendor had decided to build. They couldn't create custom dashboards, couldn't run cross-location comparisons, couldn't build revenue forecasts, and couldn't plan franchise expansion using data they nominally owned.
Every record the SaaS platform collected — appointments, revenue, memberships, payroll — lived in Amazon S3 but was analytically inaccessible. The franchising company was limited to the reports the vendor had decided to build. Franchise comparison, cross-location KPI analysis, and financial forecasting were all blocked by vendor-defined tooling.
This is one of the most common data sovereignty problems in the franchise and multi-location retail sector. Vertical SaaS platforms consolidate operations but create analytical blind spots. The vendor owns the reporting layer, which means the business has no analytical independence. We were brought in to change that permanently.
The engagement delivered three independent but interrelated outcomes, each building on the last. The result was not a one-time migration — it was permanent, automated, and self-maintaining analytical infrastructure.
Every record the SaaS vendor had ever collected was transferred into Google BigQuery. This covered 140+ data tables across the full operational footprint: appointments and bookings, sales and revenue transactions, memberships and package data, employee payroll records, collections history, and gift card transactions. Row-count validation confirmed completeness against the source. This load is non-repeating — historical data is now permanently resident in BigQuery.
A production pipeline now runs on a defined schedule with no human involvement under normal conditions. The sequence every day:
SaaS vendors release software updates that occasionally add new data fields — a new column on an appointment record, a new status type in the memberships table. Without schema evolution handling, every such change would require an engineering intervention to update the pipeline.
Our pipeline detects schema changes automatically. When the vendor adds a new field, the Airflow DAG reads the updated data definitions, issues an ALTER TABLE ADD COLUMN on the relevant BigQuery table, and continues loading data without manual intervention. Routine vendor updates are now a zero-touch event.
The pipeline spans two cloud providers (AWS and GCP) and three BigQuery data layers. Every component is managed, monitored, and audited.
A defining engineering decision was implementing a three-layer data architecture rather than loading everything into a single BigQuery dataset. This separation is what provides both data integrity and analytical flexibility — errors in any one layer cannot corrupt the others.
An exact copy of CSV files as received from the vendor, with audit columns added: _load_timestamp and _batch_id. Used for audit trails and source reprocessing if needed. Never modified or overwritten after load. The raw layer is the source of truth for the staging transformation.
The cleaned, deduplicated, business-rule-applied layer. Soft-deleted records are resolved, duplicate rows from vendor retry behavior are removed, and data types are normalized to BigQuery-native types. All analytical models and dashboards reference this layer exclusively.
Purpose-built analytical models for each franchise operational domain: appointments, revenue, memberships, and KPIs. Built with Dataform (BigQuery-native) in Phase 2. These are the models that power Looker dashboards, Power BI reports, and ad-hoc analysis — queryable by any BI tool.
A pipeline that works in a demo can still fail in production if it hasn't been designed for the failure modes that occur in real multi-vendor, multi-cloud environments. These four decisions addressed the specific failure modes we knew would appear.
Every batch of data is tracked in a metadata table with a unique batch ID. Before loading any file, the pipeline checks the log. If the batch has already been successfully processed, the load is skipped. This means that pipeline retries on transient failures never produce duplicate rows in BigQuery — even if the same file is presented to the pipeline multiple times. This is non-negotiable for any production pipeline that handles financial data.
Rather than hardcoding column lists in the pipeline DAG (which breaks the moment the vendor adds a field), we read BigQuery's table schema at runtime and compare it against the incoming CSV headers. For any new column detected, the pipeline executes ALTER TABLE ADD COLUMN before the load task runs. Routine SaaS vendor software updates — which previously required engineering intervention — are now a zero-touch event.
Keeping Raw, Staging, and Marts as distinct BigQuery datasets provides three independent safety nets. Raw preserves the exact vendor output for audit and replay. Staging holds the cleaned, analytics-ready data. Marts are built from Staging for each business domain. An error in the Staging transformation logic doesn't corrupt the Raw data — you can reprocess from source without a full re-migration from S3.
We configured Storage Transfer Service to run every 8 hours rather than once nightly. If the vendor's overnight export is delayed by 2–3 hours (a common occurrence with SaaS platforms on maintenance windows), the pipeline still has two additional transfer windows before business users arrive. An 8-hour cycle provides a meaningful reliability buffer at negligible additional cost.
The schema evolution detection is the most technically differentiated component of this pipeline. Here is the core logic from the Airflow DAG task that handles new vendor columns:
# Read the actual BigQuery table schema def detect_and_apply_schema_changes(table_id: str, incoming_csv_headers: list[str]): client = bigquery.Client() table = client.get_table(table_id) # Get existing column names from BigQuery existing_columns = {field.name for field in table.schema} incoming_columns = {col.lower() for col in incoming_csv_headers} # Detect new columns the vendor has added new_columns = incoming_columns - existing_columns - AUDIT_COLUMNS if new_columns: logging.info(f"Schema evolution detected: {new_columns}") updated_schema = list(table.schema) for col_name in new_columns: # Infer type from sample data — default to STRING for safety bq_type = infer_bq_type(col_name, sample_rows) updated_schema.append( bigquery.SchemaField(col_name, bq_type, mode="NULLABLE") ) logging.info(f"Adding column: {col_name} ({bq_type})") # Apply ALTER TABLE ADD COLUMN equivalent via schema update table.schema = updated_schema client.update_table(table, ["schema"]) logging.info(f"Schema updated for {table_id} — added {len(new_columns)} columns") return len(new_columns) # 0 if no changes needed
When the SaaS vendor's development team ships a new data field — a new appointment status, a new payment type, a new membership attribute — the pipeline detects it on the next run, updates BigQuery's schema, and continues loading. No Slack message to engineering. No pipeline pause. No missed data. The client has been running this in production for months with zero schema-evolution incidents.
# Check metadata table before loading any batch def is_batch_already_processed(batch_id: str, table_id: str) -> bool: client = bigquery.Client() query = f""" SELECT COUNT(*) as cnt FROM `{METADATA_TABLE}` WHERE batch_id = '{batch_id}' AND target_table = '{table_id}' AND status = 'SUCCESS' """ result = client.query(query).result() return next(result).cnt > 0 def load_batch_with_guard(batch_id: str, gcs_uri: str, table_id: str): if is_batch_already_processed(batch_id, table_id): logging.info(f"Batch {batch_id} already loaded for {table_id} — skipping") return # Idempotent: safe to retry at any point # Proceed with load job execute_bq_load_job(gcs_uri, table_id) # Record success in metadata table write_batch_metadata(batch_id, table_id, status='SUCCESS')
_load_timestamp, _batch_id). Historical validation sign-off from client stakeholders.Operational data existed in S3 but was only accessible through the vendor's reporting interface. The client had no independent access to their own records. Custom analysis was impossible without manual CSV exports.
140+ tables of operational data live in a BigQuery environment the client fully controls. No dependency on vendor reporting. Custom dashboards, cross-location analysis, and financial forecasting are all available from the same dataset.
Comparing performance across franchise locations required manually downloading location-specific reports, combining CSVs in Excel, and reconciling inconsistencies. Finance team spent hours weekly on this process.
All 140+ tables from all locations are in a single BigQuery dataset. Cross-location queries execute in seconds. Revenue, membership, and appointment analysis across the full franchise network is now a SQL query away.
Business users were limited to what the vendor's reporting UI showed — often delayed by the vendor's own processing cycles. Real-time insight into previous day's performance was not possible.
Previous day's operational data is in BigQuery before business hours every morning. The 8-hour transfer cycle ensures even delayed vendor exports are captured before the start of the working day. Data lag is consistently under 12 hours.
Any pipeline that hardcodes expected columns from a SaaS vendor will break when the vendor ships an update. Always read the incoming schema at runtime and handle new columns programmatically. The cost of schema evolution logic upfront is far lower than the cost of pipeline failures in production.
Every data load must have a unique, trackable batch ID. Without it, retries produce duplicate rows that corrupt aggregate analyses. A metadata tracking table costs nothing to build and prevents the most common class of pipeline data quality failures.
Raw data should be immutable. Never run business logic transformations directly on the source load. If a transformation logic bug produces incorrect staging data, you need the ability to re-derive from raw without re-migrating from the vendor. Layer separation is cheap; re-migrations are expensive.
An 8-hour transfer cycle instead of a 24-hour one costs a negligible amount more but provides a meaningful reliability buffer for vendor export delays, GCP maintenance windows, and network hiccups. Budget for reliability at the transfer scheduling layer, not just the processing layer.
Configure Airflow alerts on task duration as well as task failure. If a load that normally takes 5 minutes starts taking 25, it will fail at 30. An SLA miss alert at 10 minutes gives you time to investigate before the pipeline fails and business users are impacted.
Vendor CSV exports often have undocumented columns, inconsistent naming conventions, and unexpected data types. The first week of any SaaS pipeline project should include a thorough schema audit and data profiling pass — not assumption-based schema definition. What the vendor's docs say and what the actual exports contain are often different.
The SaaS-vendor S3 export → BigQuery pattern appears across virtually every industry that uses vertical SaaS platforms. If any of these indicators apply to your business, you have the same problem this franchising company had before this engagement:
The total engineering effort for a full historical migration plus automated daily pipeline is typically 6–10 weeks depending on source complexity and table count. The ongoing operational cost — primarily Cloud Composer and BigQuery storage — is a fraction of the value of the analytical capability delivered. The pipeline pays for itself in the first week of analyst time saved on manual CSV exports.
This pattern works for any SaaS platform that offers S3 export access: salon and wellness management, restaurant POS, franchise operations software, property management systems, event management platforms, HR and payroll systems. If the data is in S3, we can get it into BigQuery — automated, idempotent, and self-healing — in 6–10 weeks.