How Vipra Software replaced crippling 10-hour SSIS nightly batch runs with a PySpark-native data platform, cutting processing time by 80% and engineering a 12-million-records-per-minute data masking engine.
A major financial services institution was operating a critical reconciliation platform built on SQL Server Integration Services (SSIS) and Oracle stored procedures — a legacy stack that had accumulated over 15 years of business logic, workarounds, and undocumented transformations. Every night, a cascade of SSIS packages ran for 10 hours, processing transaction and reconciliation data before business operations could commence each morning.
The human cost was severe. The operations team had adapted their entire working pattern around the batch window — staff arriving early to assess whether overnight runs had succeeded, manual intervention protocols for the frequent partial failures, and a growing on-call burden as pipeline fragility increased with data volume. Three failed batch runs in a single quarter had caused material delays to regulatory reporting submissions.
The technical debt was compounding. Oracle licensing costs were escalating, the SSIS developer who had built much of the original pipeline had left the organisation, and newer engineering talent had limited appetite to maintain the ageing stack. The platform needed wholesale modernisation, but 10+ TB of historical data and deeply embedded business logic made a straightforward migration implausible without a structured approach.
Vipra Software began with an intensive discovery phase to fully map the existing pipeline logic before writing a single line of replacement code. This investment in understanding the legacy system proved critical — our discovery uncovered 14 undocumented business rules embedded in SSIS transformation scripts that would have been lost in a naive migration.
The modernised platform is built on a Hadoop/PySpark foundation with Hive metastore for schema management and Apache Airflow for orchestration. The ingestion layer reads from Oracle source databases via JDBC parallel read partitioning — splitting large Oracle tables across multiple Spark executors for maximum parallelism during the nightly load window.
Delta Lake provides ACID semantics critical for financial data reconciliation — enabling time-travel queries for audit purposes and ensuring that partial job failures can be cleanly rolled back without manual intervention. Partition pruning on transaction date columns reduces I/O by 85% for the daily incremental loads that constitute the majority of nightly processing.
The data masking engine deserves particular note: it operates as a standalone PySpark library that can be invoked as part of any pipeline. Masking rules are configuration-driven (YAML schema definitions), enabling the compliance team to add new sensitive field classifications without code changes. The 12M records/minute throughput was validated under production-equivalent data volumes before go-live.
The nightly reconciliation window dropped from 10 hours to under 2 hours in the first week of production operation — an 80% reduction that fundamentally changed the operations team's working pattern. Early morning manual intervention protocols were retired, and the on-call burden was eliminated through Airflow's automated retry and alerting capabilities.
The data masking engine unlocked a capability the team had been unable to achieve for years: production-equivalent data in development and testing environments. This removed a major bottleneck in the engineering release cycle, as developers could now validate changes against realistic data volumes without the 6-week turnaround previously required to obtain anonymised data sets.
Oracle licensing costs were eliminated entirely, representing an additional $340K annual saving beyond the processing efficiency gains. The modernised platform has since been extended to support two additional reconciliation domains that had previously been excluded from the legacy system due to performance constraints.