ARCHITECTURE2019-04-20BY ELI NAVARRO

Migrations: keep the old path working

If the old path breaks during a migration, you lose your escape hatch. We prefer expand/contract patterns that keep rollback real.

architecturemigrationsdatareliabilitychange-management

Most migration failures are not caused by the final cutover.

They’re caused by the weeks of partial state leading up to it.

During a migration, you are running two versions of reality at once: old code and new code, old data and new data. The dangerous part is the in-between, when it “mostly works” until an edge case hits.

If the old path breaks during that in-between, you lose your escape hatch. You can’t roll back because there’s nothing to roll back to.

Constraints

Data migrations live inside uncomfortable constraints:

production is serving real traffic while data is changing shape
multiple versions of the application exist during rollout
partial backfills are indistinguishable from bugs unless you have clear checks

There are a few additional constraints that don’t get enough airtime:

migrations compete with customers. Backfills consume DB connections, cache, and queue throughput.
migrations create mixed populations. Some rows are in the new shape, some aren’t, and both must be valid.
verification is harder than the script. “Job finished” is not the same as “system is correct.”
the worst failures look like product bugs. You don’t get a clean exception. You get “some users can’t do X.”

In that world, “we can always roll back” is often fiction.

Rollback only exists if the old path still works.

What we changed

We stopped doing migrations as a single event.

Instead, we favor an expand/contract pattern.

The point is not the shape of the steps. The point is to keep a real escape hatch at every stage.

Expand (add without switching)

First we add the new shape alongside the old.

new columns, new tables, new indexes
new code paths that can write or read the new shape (behind a flag)

At this stage we do not remove anything. We make it possible for the new world to exist without making the old world invalid.

Dual-write (only when needed)

Sometimes we can backfill and then switch reads without dual-writing. Often we can’t.

When we do need dual-write, we treat it like a high-risk feature:

writes must be idempotent
failures must be observable (we need a counter we can page/ticket on)
we choose an ordering (old then new, or new then old) and define what “partial success” means

Dual-write is where migrations quietly fail. If you can’t explain how to recover from “old wrote, new failed,” don’t ship dual-write yet.

Backfill (throttled + measurable)

Backfills are long-running production work. We run them like we run batch jobs:

small batches
a throttle you can turn down
retry with limits
progress metrics (rows processed, rows remaining, error rate)
explicit stop conditions

We also make backfills restartable. If a backfill can’t be safely resumed, you don’t have a backfill—you have a cutover.

Dual-read / shadow compare (prove it before it’s the truth)

Before we switch the source of truth, we try to prove the new path in production.

Two patterns we use:

dual-read with old as source of truth: read old, read new, compare, and log mismatches (sampled)
shadow reads: the request uses the old read path, but we run the new read path in the background and compare

We want checks that catch missing rows, wrong mappings, off-by-one logic, timezone issues, and null-handling differences.

Contract (delete last)

Only after the new path has proven itself do we remove the old one.

Contract is not “delete the table.” Contract is a sequence:

remove dual-read/shadow compare
remove dual-write
remove the old read path
remove the old write path
remove the old schema

If you delete the old code early, you don’t have rollback. You have a theory.

Two guardrails that matter

We keep the migration honest with two guardrails:

explicit stop conditions (“if error rate rises, stop the backfill”)
verification that is not “looks fine” (counts, checksums, spot comparisons)

We also attach a runbook to the migration work itself.

A migration runbook answers:

where to see progress
how to pause/stop the job
how to turn off the new path (flag / config)
how to verify correctness after a pause
when to page vs ticket

If you need to wake up a specific engineer to operate the migration, the migration isn’t ready.

What we stopped doing

We explicitly stopped doing a few “normal” migration habits:

shipping the new path and immediately deleting the old one because “it’s cleaner”
backfilling at max speed because “we want it done”
declaring victory when the job finishes, without defining correctness checks
treating rollback as “we’ll figure it out if it breaks”

A migration is not a script. It’s a rollout.

Results / Measurements

The visible result is fewer midnight cutovers.

The operational result is smaller incidents when something goes wrong: you can turn off the new path and continue serving traffic.

We look at a few proxies:

time-to-reverse a migration-related regression. If reversal requires a bespoke plan, the migration was not staged enough.
migration-induced pages. If the backfill itself pages on-call frequently, the job needs throttling/guardrails.
time-to-verify after a pause. If you can’t tell whether it’s safe to resume, you don’t have good checks.

We also look for a quieter kind of result: fewer “we’re not sure what state it’s in” conversations. When rollback is real, the incident room spends less time debating theories and more time choosing the next safe step.

Example: migrating a billing table

On one client system, we migrated a billing table that had grown from “a few hundred rows” to “millions of rows with a decade of history.” The first design suggestion on the table was a single in-place rewrite: new schema, one script, one big cutover night.

Instead, we applied the expand/contract pattern.

Expand: we added a new table with the target schema, plus a small adapter to write new events into both tables.
Backfill: we ran a throttled job that copied data in batches, with progress logged every few thousand rows.
Dual-read: for read paths, we added a shadow compare that warned us when the new table disagreed with the old one.
Contract: only after a week of clean comparisons did we flip the primary reads, then gradually remove the old writes.

Halfway through the backfill, a bug in the adapter showed up: a handful of rows in the new table were missing a field that the old table still had.

Because the old path still worked, our options were boring:

stop the backfill
fix the adapter
re-run the failing chunk

If we had rewritten the table in place, the options would have been different: data repair under pressure, or accepting a permanent gap.

How we measure “boring” migrations

A migration is successful when it’s boring for everyone else.

We track that in a few ways:

Incident shape. Migration-related incidents shift from “full outage” to “degraded behavior on a small slice” that we can reverse.
On-call posture. Migrations move from “we need everyone in a room” to “one person can watch a dashboard and pause if needed.”
Lead time to safely retry. If a migration fails, we should be able to fix the issue, reset the job, and try again without inventing a new plan.

When those numbers improve, we know the pattern is working even if no one outside the team ever hears about the migration.

What changes in planning

Treating “keep the old path working” as a requirement changes how we plan work:

Product conversations include migration cost. We talk about “how we’ll get there” before we commit to “what we’re building.”
Design docs include rollout steps and rollback steps, not just target schemas and API shapes.
Estimation includes backfill time, verification work, and cleanup—not just implementation.

That can feel slower at first. In practice, it replaces one big unknown with a series of smaller, reviewable decisions.

We also make a small but important change in ownership: the team that designs the migration owns the runbook and sits the first rollout. Afterward, anyone on the rotation should be able to run it.

When we started treating “keep the old path working” as non-negotiable, rollback became real. The incident room got quieter because the first ten minutes had an obvious move: turn off the new path, stabilize, then investigate.

Takeaways