Architecture Entry: 001

Zero-Downtime
Database_Migrations

Published: 2025.03.15 Read_Time: 8 min
Zero-Downtime Migration Illustration

The Challenge

PHASE_01

Dual-Write Proxy

Every write operation is mirrored to both the legacy Oracle database and the new PostgreSQL cluster simultaneously. This creates a real-time shadow of the production data.

input
PHASE_02

Shadow Traffic

Read queries are progressively routed to the new database while maintaining Oracle as the source of truth. We compare results in real-time to validate consistency.

hub
PHASE_03

Progressive Cut-Over

Once consistency is proven at >99.999%, traffic is shifted to PostgreSQL as primary. Oracle becomes read-only fallback with automatic rollback capability.

database

Implementation Script

// ZERO_DOWNTIME_MIGRATION :: DUAL_WRITE_PROXY
class DualWriteProxy {
    constructor(sourceDB, targetDB) {
        this.source = sourceDB;
        this.target = targetDB;
        this.metrics = new ConsistencyMonitor();
    }

    async write(operation) {
        // Execute on primary (source) first
        const sourceResult = await this.source.execute(operation);

        // Mirror to target asynchronously
        this.target.execute(operation)
            .then(targetResult => {
                this.metrics.compare(sourceResult, targetResult);
            })
            .catch(err => {
                this.metrics.recordDivergence(operation, err);
                AlertSystem.notify('SHADOW_WRITE_FAIL');
            });

        return sourceResult;
    }

    async cutOver() {
        if (this.metrics.consistency < 0.99999) {
            throw new Error('CONSISTENCY_THRESHOLD_NOT_MET');
        }
        // Swap primary and secondary
        [this.source, this.target] = [this.target, this.source];
        Log.critical('CUT_OVER_COMPLETE');
    }
}

When we took on the Project Nimbus platform redesign, the most daunting challenge wasn't the new architecture — it was the migration. We had 12TB of media metadata in a legacy Oracle database, serving 14,000+ requests per second at peak. Downtime wasn't an option.

The traditional approach of "maintenance window" migrations simply doesn't work at this scale. Even a 30-minute window would mean thousands of failed uploads, broken CDN links, and angry content managers across multiple time zones.

Our solution was a three-phase progressive migration strategy that allowed us to validate every single record before committing to the cut-over. The dual-write proxy pattern was key — it let us build confidence in the new system while keeping the old one as the safety net.

The entire migration took 6 weeks from first shadow write to final cut-over. Zero records lost. Zero requests dropped. The new PostgreSQL cluster now handles the same load with 40% lower latency and significantly reduced operational costs.

Key Learnings

Critical takeaways from executing a zero-downtime migration on a production system serving thousands of concurrent users.

PostgreSQL Oracle Migration
01

Never Trust, Always Verify

Shadow traffic comparison caught 23 edge-case data transformation bugs that unit tests missed entirely. Production traffic is the ultimate integration test.

02

Rollback is Non-Negotiable

Every cut-over step maintained a 30-second automatic rollback. We never needed it, but its existence made the team confident enough to proceed.

03

Metrics Before Commitment

The 99.999% consistency threshold was chosen deliberately. Anything less, and we waited. This discipline prevented two premature cut-over attempts.