DELIVERY2021-08-03BY JONAS "JO" CARLIN

Story: the migration that stalled because nobody owned the last 10%

A cross-team migration mostly worked, then stalled on edge cases. We describe what finally got it finished.

deliverymigrationsownershipscope

What happened

We planned a migration that touched three teams.

It looked simple on paper:

move a set of records from an old table to a new one
point reads and writes at the new schema
clean up the leftovers

We agreed on the high-level plan.

We did not agree on who owned the last 10%.

The first parts went smoothly:

the new schema was created
the bulk of data was backfilled
most traffic moved to the new path

Then we hit the edge cases:

records with surprising combinations of flags
legacy states nobody had seen in months
partially-migrated data from an even older system

Progress slowed.

Each team assumed another team would pick up the remaining work.

The migration sat in a "mostly done" state for weeks, with both schemas live and extra complexity in code and operations.

Why this mattered

While the migration was stalled, we:

had to keep both schemas healthy
wrote new features twice (or not at all)
carried extra branching logic in hot paths

During an unrelated incident, the existence of two paths made debugging harder:

some calls hit the new schema
some fell back to the old
dashboards had to be interpreted through the lens of "which path did this go through?"

The migration’s last 10% had quietly become everyone’s and no one’s job.

What we changed

1. Assign an explicit migration owner

We learned that "shared migration" does not mean "no owner."

For subsequent migrations, we:

assign a single migration owner (a person, not just a team)
give them authority to coordinate work across teams
make it clear they are responsible for getting from 90% to 100%

The owner is not expected to do all the work, but they are expected to:

keep a canonical view of status
push for decisions on edge cases
call out when scope needs to change

2. Make edge cases part of the plan

Instead of treating edge cases as "we’ll see when we get there," we:

identify known legacy states upfront when possible
reserve explicit time and scope for investigating and handling unknown states

We also add a simple section to migration docs:

"What we’ll do with records that don’t fit the plan"

This keeps the last 10% from being a surprise.

3. Define exit criteria clearly

We wrote down what "done" means:

old schema is read-only or removed
new schema is the only path for writes
monitoring and runbooks no longer reference the old path

We avoid calling a migration "done" until these conditions are met.

4. Time-box and surface stall conditions

We treat long stalls as signals:

if a migration sits partially complete beyond an agreed window, it gets escalated
we talk about it in planning and reliability reviews, not just in the migration doc

Sometimes the right answer is to:

cut scope
explicitly keep some legacy behavior

But we make that a decision, not an accident.

Takeaways

Cross-team migrations need a single owner for the last 10%, or they will linger.
Edge cases are part of the work, not an optional add-on.
Clear exit criteria and time-boxed plans keep migrations from becoming permanent complexity.