Zero to Data Warehouse: Automated AWS S3 → BigQuery Pipeline for a US Franchising Business

140+

Operational Tables Migrated

Data Layers: Raw / Staging / Marts

8hr

Automated Transfer Cycle

Manual Steps Day-to-Day

∞

Schema Changes Auto-Handled

The Constraint: Data Locked in a Vendor's Ecosystem

The client is a US-based franchising business operating dozens of franchise locations nationwide. Their entire operation — appointments, memberships, sales, payroll, collections, gift cards — runs through a third-party SaaS platform that manages all location-level transactions.

The data existed. The SaaS vendor exported it daily to Amazon S3. But the only way to access it meaningfully was through the vendor's own built-in reports. The business was analytically captive to the analyses the vendor had decided to build. They couldn't create custom dashboards, couldn't run cross-location comparisons, couldn't build revenue forecasts, and couldn't plan franchise expansion using data they nominally owned.

The Strategic Problem

Every record the SaaS platform collected — appointments, revenue, memberships, payroll — lived in Amazon S3 but was analytically inaccessible. The franchising company was limited to the reports the vendor had decided to build. Franchise comparison, cross-location KPI analysis, and financial forecasting were all blocked by vendor-defined tooling.

This is one of the most common data sovereignty problems in the franchise and multi-location retail sector. Vertical SaaS platforms consolidate operations but create analytical blind spots. The vendor owns the reporting layer, which means the business has no analytical independence. We were brought in to change that permanently.

The Broader Pattern

The "SaaS vendor S3 export → BigQuery" pattern is one of the most common data engineering problems in multi-location retail, franchise operations, salon & wellness management, restaurant POS, and property management. If your SaaS vendor exports to S3 and you can only access your data through their reporting UI, you have a data sovereignty problem — regardless of industry. This is a solvable engineering problem, not a business constraint.

What We Built — Three Distinct Deliverables

The engagement delivered three independent but interrelated outcomes, each building on the last. The result was not a one-time migration — it was permanent, automated, and self-maintaining analytical infrastructure.

Deliverable 1: Complete Historical Migration

Every record the SaaS vendor had ever collected was transferred into Google BigQuery. This covered 140+ data tables across the full operational footprint: appointments and bookings, sales and revenue transactions, memberships and package data, employee payroll records, collections history, and gift card transactions. Row-count validation confirmed completeness against the source. This load is non-repeating — historical data is now permanently resident in BigQuery.

Deliverable 2: Fully Automated Daily Incremental Pipeline

A production pipeline now runs on a defined schedule with no human involvement under normal conditions. The sequence every day:

Vendor Side · AWS · Overnight

SaaS Vendor Deposits Files to Amazon S3

The SaaS platform places updated records into a shared Amazon S3 folder — CSV exports organized by GroupID folder, covering all updated tables across all franchise locations. This happens on the vendor's schedule, outside our control.

Cross-Cloud Transfer · AWS → GCP

Storage Transfer Service: S3 → Cloud Storage

Google's Storage Transfer Service automatically copies these files from the vendor's Amazon S3 bucket into a Google Cloud Storage bucket we control. Runs every 8 hours on a repeating schedule — if the vendor's nightly export is delayed, the pipeline has two additional retry windows before business hours. No cross-cloud networking configuration required by the client.

GCP Orchestration Layer

Cloud Composer (Airflow) Processes and Loads

Cloud Composer detects new files arriving in Cloud Storage and executes the processing DAG in sequence: schema validation → format cleaning → dimension table loading → fact table loading → void-record handling → watermark update. Every task is monitored, retried on transient failure, and logged. The Composer environment has alerting configured for any task that exceeds expected duration.

Business Outcome

Clean Data in BigQuery Before Business Hours

By the time franchise managers and finance teams arrive at their desks, yesterday's operational data is cleaned, validated, and ready to query in BigQuery. No CSV exports. No manual upload steps. No engineer required under normal operating conditions.

Deliverable 3: Self-Healing Schema Evolution

SaaS vendors release software updates that occasionally add new data fields — a new column on an appointment record, a new status type in the memberships table. Without schema evolution handling, every such change would require an engineering intervention to update the pipeline.

Our pipeline detects schema changes automatically. When the vendor adds a new field, the Airflow DAG reads the updated data definitions, issues an ALTER TABLE ADD COLUMN on the relevant BigQuery table, and continues loading data without manual intervention. Routine vendor updates are now a zero-touch event.

Full Pipeline Architecture

The pipeline spans two cloud providers (AWS and GCP) and three BigQuery data layers. Every component is managed, monitored, and audited.

End-to-End Architecture — SaaS Vendor S3 to BigQuery Data Warehouse

The Three-Layer BigQuery Architecture

A defining engineering decision was implementing a three-layer data architecture rather than loading everything into a single BigQuery dataset. This separation is what provides both data integrity and analytical flexibility — errors in any one layer cannot corrupt the others.

Layer 1 — Raw

Exact Source Replica

An exact copy of CSV files as received from the vendor, with audit columns added: _load_timestamp and _batch_id. Used for audit trails and source reprocessing if needed. Never modified or overwritten after load. The raw layer is the source of truth for the staging transformation.

Layer 2 — Staging

Business Rules Applied

The cleaned, deduplicated, business-rule-applied layer. Soft-deleted records are resolved, duplicate rows from vendor retry behavior are removed, and data types are normalized to BigQuery-native types. All analytical models and dashboards reference this layer exclusively.

Layer 3 — Data Marts

Domain-Specific Analytics

Purpose-built analytical models for each franchise operational domain: appointments, revenue, memberships, and KPIs. Built with Dataform (BigQuery-native) in Phase 2. These are the models that power Looker dashboards, Power BI reports, and ad-hoc analysis — queryable by any BI tool.

Four Engineering Decisions That Made This Production-Ready

A pipeline that works in a demo can still fail in production if it hasn't been designed for the failure modes that occur in real multi-vendor, multi-cloud environments. These four decisions addressed the specific failure modes we knew would appear.

Idempotent Loads

Every batch of data is tracked in a metadata table with a unique batch ID. Before loading any file, the pipeline checks the log. If the batch has already been successfully processed, the load is skipped. This means that pipeline retries on transient failures never produce duplicate rows in BigQuery — even if the same file is presented to the pipeline multiple times. This is non-negotiable for any production pipeline that handles financial data.

Schema Evolution

Rather than hardcoding column lists in the pipeline DAG (which breaks the moment the vendor adds a field), we read BigQuery's table schema at runtime and compare it against the incoming CSV headers. For any new column detected, the pipeline executes ALTER TABLE ADD COLUMN before the load task runs. Routine SaaS vendor software updates — which previously required engineering intervention — are now a zero-touch event.

3-Layer Separation

Keeping Raw, Staging, and Marts as distinct BigQuery datasets provides three independent safety nets. Raw preserves the exact vendor output for audit and replay. Staging holds the cleaned, analytics-ready data. Marts are built from Staging for each business domain. An error in the Staging transformation logic doesn't corrupt the Raw data — you can reprocess from source without a full re-migration from S3.

8-Hour Transfer Cycle

We configured Storage Transfer Service to run every 8 hours rather than once nightly. If the vendor's overnight export is delayed by 2–3 hours (a common occurrence with SaaS platforms on maintenance windows), the pipeline still has two additional transfer windows before business users arrive. An 8-hour cycle provides a meaningful reliability buffer at negligible additional cost.

Schema Evolution — Airflow Implementation

The schema evolution detection is the most technically differentiated component of this pipeline. Here is the core logic from the Airflow DAG task that handles new vendor columns:

dags/franchising_pipeline/tasks/schema_evolution.py Cloud Composer · Schema Auto-Detection

# Read the actual BigQuery table schema
def detect_and_apply_schema_changes(table_id: str, incoming_csv_headers: list[str]):
    client = bigquery.Client()
    table = client.get_table(table_id)

    # Get existing column names from BigQuery
    existing_columns = {field.name for field in table.schema}
    incoming_columns = {col.lower() for col in incoming_csv_headers}

    # Detect new columns the vendor has added
    new_columns = incoming_columns - existing_columns - AUDIT_COLUMNS

    if new_columns:
        logging.info(f"Schema evolution detected: {new_columns}")
        updated_schema = list(table.schema)

        for col_name in new_columns:
            # Infer type from sample data — default to STRING for safety
            bq_type = infer_bq_type(col_name, sample_rows)
            updated_schema.append(
                bigquery.SchemaField(col_name, bq_type, mode="NULLABLE")
            )
            logging.info(f"Adding column: {col_name} ({bq_type})")

        # Apply ALTER TABLE ADD COLUMN equivalent via schema update
        table.schema = updated_schema
        client.update_table(table, ["schema"])
        logging.info(f"Schema updated for {table_id} — added {len(new_columns)} columns")

    return len(new_columns)  # 0 if no changes needed

Result

When the SaaS vendor's development team ships a new data field — a new appointment status, a new payment type, a new membership attribute — the pipeline detects it on the next run, updates BigQuery's schema, and continues loading. No Slack message to engineering. No pipeline pause. No missed data. The client has been running this in production for months with zero schema-evolution incidents.

Idempotent Load Guard — Metadata Table Pattern

dags/franchising_pipeline/tasks/idempotent_load.py Cloud Composer · No-Duplicate Load Guard

# Check metadata table before loading any batch
def is_batch_already_processed(batch_id: str, table_id: str) -> bool:
    client = bigquery.Client()
    query = f"""
        SELECT COUNT(*) as cnt
        FROM `{METADATA_TABLE}`
        WHERE batch_id = '{batch_id}'
          AND target_table = '{table_id}'
          AND status = 'SUCCESS'
    """
    result = client.query(query).result()
    return next(result).cnt > 0


def load_batch_with_guard(batch_id: str, gcs_uri: str, table_id: str):
    if is_batch_already_processed(batch_id, table_id):
        logging.info(f"Batch {batch_id} already loaded for {table_id} — skipping")
        return  # Idempotent: safe to retry at any point

    # Proceed with load job
    execute_bq_load_job(gcs_uri, table_id)

    # Record success in metadata table
    write_batch_metadata(batch_id, table_id, status='SUCCESS')

8-Week Delivery Timeline

Wk 1–2

Discovery, Schema Mapping & S3 Access SetupFull audit of the vendor's S3 export structure across all 140+ table types. Schema mapping from vendor CSV formats to BigQuery-native data types. Storage Transfer Service configuration. GCS bucket architecture design. Security model: IAM roles, service accounts, cross-account S3 access policy.

Wk 3–4

Historical Migration — All 140+ TablesFull historical data load from S3 into BigQuery Raw layer. Row-count validation against vendor source. Data type anomaly detection and resolution. Audit column injection (_load_timestamp, _batch_id). Historical validation sign-off from client stakeholders.

Wk 5–6

Airflow DAG Development — Incremental PipelineCloud Composer environment provisioning. DAG development: GCS trigger → schema check → dimension load → fact load → void handling → watermark update. Metadata table design for idempotent loads. Schema evolution detection logic. Monitoring alerts and retry policies.

Wk 7

Staging Layer + Business RulesStaging transformation logic for deduplication, soft-delete resolution, and data type standardization across all 140+ tables. Row-count and completeness validation for staging vs raw. Data quality checks added as Airflow tasks.

Wk 8

Production Cutover, Documentation & HandoverProduction deployment of the daily incremental pipeline. End-to-end pipeline test across 2 full transfer cycles. Operations runbook: what to do if a transfer fails, how to trigger a manual backfill, how to monitor DAG health. Handover to client team. Phase 2 scope definition: Dataform Data Marts and BI dashboard build.

Business Impact: What Changed for the Client

Before

Data Sovereignty

Operational data existed in S3 but was only accessible through the vendor's reporting interface. The client had no independent access to their own records. Custom analysis was impossible without manual CSV exports.

After

Data Sovereignty

140+ tables of operational data live in a BigQuery environment the client fully controls. No dependency on vendor reporting. Custom dashboards, cross-location analysis, and financial forecasting are all available from the same dataset.

Before

Cross-Location Analysis

Comparing performance across franchise locations required manually downloading location-specific reports, combining CSVs in Excel, and reconciling inconsistencies. Finance team spent hours weekly on this process.

After

Cross-Location Analysis

All 140+ tables from all locations are in a single BigQuery dataset. Cross-location queries execute in seconds. Revenue, membership, and appointment analysis across the full franchise network is now a SQL query away.

Before

Data Freshness

Business users were limited to what the vendor's reporting UI showed — often delayed by the vendor's own processing cycles. Real-time insight into previous day's performance was not possible.

After

Data Freshness

Previous day's operational data is in BigQuery before business hours every morning. The 8-hour transfer cycle ensures even delayed vendor exports are captured before the start of the working day. Data lag is consistently under 12 hours.

Engineering Best Practices for SaaS-to-BigQuery Pipelines

Never Hardcode Column Lists

Any pipeline that hardcodes expected columns from a SaaS vendor will break when the vendor ships an update. Always read the incoming schema at runtime and handle new columns programmatically. The cost of schema evolution logic upfront is far lower than the cost of pipeline failures in production.

Batch IDs Are Non-Negotiable

Every data load must have a unique, trackable batch ID. Without it, retries produce duplicate rows that corrupt aggregate analyses. A metadata tracking table costs nothing to build and prevents the most common class of pipeline data quality failures.

Separate Raw from Analytics

Raw data should be immutable. Never run business logic transformations directly on the source load. If a transformation logic bug produces incorrect staging data, you need the ability to re-derive from raw without re-migrating from the vendor. Layer separation is cheap; re-migrations are expensive.

Over-Engineer the Transfer Schedule

An 8-hour transfer cycle instead of a 24-hour one costs a negligible amount more but provides a meaningful reliability buffer for vendor export delays, GCP maintenance windows, and network hiccups. Budget for reliability at the transfer scheduling layer, not just the processing layer.

Alert Before Failure, Not After

Configure Airflow alerts on task duration as well as task failure. If a load that normally takes 5 minutes starts taking 25, it will fail at 30. An SLA miss alert at 10 minutes gives you time to investigate before the pipeline fails and business users are impacted.

Document the Vendor's Schema Carefully

Vendor CSV exports often have undocumented columns, inconsistent naming conventions, and unexpected data types. The first week of any SaaS pipeline project should include a thorough schema audit and data profiling pass — not assumption-based schema definition. What the vendor's docs say and what the actual exports contain are often different.

Who Else Has This Problem

The SaaS-vendor S3 export → BigQuery pattern appears across virtually every industry that uses vertical SaaS platforms. If any of these indicators apply to your business, you have the same problem this franchising company had before this engagement:

You run cross-location analysis by manually exporting CSVs and combining them in Excel or Google Sheets.
Your vendor charges extra for API access, data exports, or BI connectors — but offers a basic S3 export for free.
Finance or operations teams wait on reports that could be automated if the data were in a proper warehouse.
Your business is scaling faster than the vendor's reporting tools can support — you've outgrown their dashboard suite.
You suspect data quality issues but have no way to audit the vendor's source records independently.
A BI team wants to build custom dashboards but can't because the data isn't accessible in a queryable format.

The total engineering effort for a full historical migration plus automated daily pipeline is typically 6–10 weeks depending on source complexity and table count. The ongoing operational cost — primarily Cloud Composer and BigQuery storage — is a fraction of the value of the analytical capability delivered. The pipeline pays for itself in the first week of analyst time saved on manual CSV exports.

The Generalization

This pattern works for any SaaS platform that offers S3 export access: salon and wellness management, restaurant POS, franchise operations software, property management systems, event management platforms, HR and payroll systems. If the data is in S3, we can get it into BigQuery — automated, idempotent, and self-healing — in 6–10 weeks.

Pipeline Metrics

140+

Operational table types migrated from S3

BigQuery data layers: Raw, Staging, Marts

8hr

Automated transfer cycle (S3 → GCS)

Manual steps in day-to-day operation

8 wk

Historical migration to production pipeline

← Back to Launchpad Build Your Data Pipeline →

Zero to Data Warehouse — Automated AWS S3 → BigQuery Pipelinefor a US Franchising Business

The Constraint: Data Locked in a Vendor's Ecosystem

The Broader Pattern

What We Built — Three Distinct Deliverables

Deliverable 1: Complete Historical Migration

Deliverable 2: Fully Automated Daily Incremental Pipeline

Deliverable 3: Self-Healing Schema Evolution

Full Pipeline Architecture

The Three-Layer BigQuery Architecture

Exact Source Replica

Business Rules Applied

Domain-Specific Analytics

Four Engineering Decisions That Made This Production-Ready

Schema Evolution — Airflow Implementation

Idempotent Load Guard — Metadata Table Pattern

8-Week Delivery Timeline

Business Impact: What Changed for the Client

Data Sovereignty

Data Sovereignty

Cross-Location Analysis

Cross-Location Analysis

Data Freshness

Data Freshness

Engineering Best Practices for SaaS-to-BigQuery Pipelines

Never Hardcode Column Lists

Batch IDs Are Non-Negotiable

Separate Raw from Analytics

Over-Engineer the Transfer Schedule

Alert Before Failure, Not After

Document the Vendor's Schema Carefully

Who Else Has This Problem

Zero to Data Warehouse —
Automated AWS S3 → BigQuery Pipeline
for a US Franchising Business