Enterprise Legacy Modernization with PySpark — Case Study

80%

Processing Time Reduction

10TB+

Data Successfully Migrated

12M

Records Masked Per Minute

20w

Delivery Timeline

The Challenge

A major financial services institution was operating a critical reconciliation platform built on SQL Server Integration Services (SSIS) and Oracle stored procedures — a legacy stack that had accumulated over 15 years of business logic, workarounds, and undocumented transformations. Every night, a cascade of SSIS packages ran for 10 hours, processing transaction and reconciliation data before business operations could commence each morning.

The human cost was severe. The operations team had adapted their entire working pattern around the batch window — staff arriving early to assess whether overnight runs had succeeded, manual intervention protocols for the frequent partial failures, and a growing on-call burden as pipeline fragility increased with data volume. Three failed batch runs in a single quarter had caused material delays to regulatory reporting submissions.

The technical debt was compounding. Oracle licensing costs were escalating, the SSIS developer who had built much of the original pipeline had left the organisation, and newer engineering talent had limited appetite to maintain the ageing stack. The platform needed wholesale modernisation, but 10+ TB of historical data and deeply embedded business logic made a straightforward migration implausible without a structured approach.

Our Approach

Vipra Software began with an intensive discovery phase to fully map the existing pipeline logic before writing a single line of replacement code. This investment in understanding the legacy system proved critical — our discovery uncovered 14 undocumented business rules embedded in SSIS transformation scripts that would have been lost in a naive migration.

Legacy Audit & Logic Extraction (Weeks 1–3): Full inventory of 180 SSIS packages and 340 Oracle stored procedures. Automated analysis of stored procedure dependencies using Python AST parsing. Documented all 14 undocumented business rules and had them formally sign-off by the finance operations team.
PySpark Architecture Design (Weeks 4–5): Designed a Hadoop-native PySpark architecture on a managed HDP cluster. Defined domain-partitioned data zones (Raw, Curated, Mart) with partition strategies optimised for the daily reconciliation access pattern. Selected Delta Lake for ACID transaction support on financial records.
Data Masking Engine (Weeks 6–9): Engineered a PySpark-native data masking framework capable of processing 12M records/minute — required for GDPR and regulatory compliance on non-production environments. Implemented format-preserving encryption (FPE) for account numbers and PAN data, and tokenisation for personal identifiers.
Transformation Rebuild (Weeks 10–16): Rewrote all 180 SSIS packages as parameterised PySpark jobs with full unit test coverage. Implemented idempotent processing patterns to eliminate the partial-failure recovery issues. Built a lightweight orchestration layer using Apache Airflow with dependency-aware DAG design.
Parallel Validation & Cutover (Weeks 17–20): Ran both pipelines in parallel for 3 weeks, comparing output datasets row-by-row for 15 business days. Achieved 100% reconciliation match before executing phased cutover. Zero-downtime migration with same-night cutover from SSIS to PySpark.

Technical Architecture

The modernised platform is built on a Hadoop/PySpark foundation with Hive metastore for schema management and Apache Airflow for orchestration. The ingestion layer reads from Oracle source databases via JDBC parallel read partitioning — splitting large Oracle tables across multiple Spark executors for maximum parallelism during the nightly load window.

Delta Lake provides ACID semantics critical for financial data reconciliation — enabling time-travel queries for audit purposes and ensuring that partial job failures can be cleanly rolled back without manual intervention. Partition pruning on transaction date columns reduces I/O by 85% for the daily incremental loads that constitute the majority of nightly processing.

The data masking engine deserves particular note: it operates as a standalone PySpark library that can be invoked as part of any pipeline. Masking rules are configuration-driven (YAML schema definitions), enabling the compliance team to add new sensitive field classifications without code changes. The 12M records/minute throughput was validated under production-equivalent data volumes before go-live.

Business Impact

The nightly reconciliation window dropped from 10 hours to under 2 hours in the first week of production operation — an 80% reduction that fundamentally changed the operations team's working pattern. Early morning manual intervention protocols were retired, and the on-call burden was eliminated through Airflow's automated retry and alerting capabilities.

The data masking engine unlocked a capability the team had been unable to achieve for years: production-equivalent data in development and testing environments. This removed a major bottleneck in the engineering release cycle, as developers could now validate changes against realistic data volumes without the 6-week turnaround previously required to obtain anonymised data sets.

Oracle licensing costs were eliminated entirely, representing an additional $340K annual saving beyond the processing efficiency gains. The modernised platform has since been extended to support two additional reconciliation domains that had previously been excluded from the legacy system due to performance constraints.

← Previous: LXP Streaming Next: Geospatial Lakehouse →

Enterprise Legacy
Modernization

The Challenge

Our Approach

Technical Architecture

Business Impact

Technology Stack

Services Delivered

Stuck on Legacy Infrastructure?

Enterprise LegacyModernization

The Challenge

Our Approach

Technical Architecture

Business Impact

Technology Stack

Services Delivered

Stuck on Legacy Infrastructure?

Enterprise Legacy
Modernization