DELIVERY2021-08-03BY JONAS "JO" CARLIN

Story: the migration that stalled because nobody owned the last 10%

A cross-team migration mostly worked, then stalled on edge cases. We describe what finally got it finished.

deliverymigrationsownershipscope

What happened

We planned a migration that touched three teams.

It looked simple on paper:

  • move a set of records from an old table to a new one
  • point reads and writes at the new schema
  • clean up the leftovers

We agreed on the high-level plan.

We did not agree on who owned the last 10%.

The first parts went smoothly:

  • the new schema was created
  • the bulk of data was backfilled
  • most traffic moved to the new path

Then we hit the edge cases:

  • records with surprising combinations of flags
  • legacy states nobody had seen in months
  • partially-migrated data from an even older system

Progress slowed.

Each team assumed another team would pick up the remaining work.

The migration sat in a "mostly done" state for weeks, with both schemas live and extra complexity in code and operations.

Why this mattered

While the migration was stalled, we:

  • had to keep both schemas healthy
  • wrote new features twice (or not at all)
  • carried extra branching logic in hot paths

During an unrelated incident, the existence of two paths made debugging harder:

  • some calls hit the new schema
  • some fell back to the old
  • dashboards had to be interpreted through the lens of "which path did this go through?"

The migration’s last 10% had quietly become everyone’s and no one’s job.

What we changed

1. Assign an explicit migration owner

We learned that "shared migration" does not mean "no owner."

For subsequent migrations, we:

  • assign a single migration owner (a person, not just a team)
  • give them authority to coordinate work across teams
  • make it clear they are responsible for getting from 90% to 100%

The owner is not expected to do all the work, but they are expected to:

  • keep a canonical view of status
  • push for decisions on edge cases
  • call out when scope needs to change

2. Make edge cases part of the plan

Instead of treating edge cases as "we’ll see when we get there," we:

  • identify known legacy states upfront when possible
  • reserve explicit time and scope for investigating and handling unknown states

We also add a simple section to migration docs:

  • "What we’ll do with records that don’t fit the plan"

This keeps the last 10% from being a surprise.

3. Define exit criteria clearly

We wrote down what "done" means:

  • old schema is read-only or removed
  • new schema is the only path for writes
  • monitoring and runbooks no longer reference the old path

We avoid calling a migration "done" until these conditions are met.

4. Time-box and surface stall conditions

We treat long stalls as signals:

  • if a migration sits partially complete beyond an agreed window, it gets escalated
  • we talk about it in planning and reliability reviews, not just in the migration doc

Sometimes the right answer is to:

  • cut scope
  • explicitly keep some legacy behavior

But we make that a decision, not an accident.

Takeaways

  • Cross-team migrations need a single owner for the last 10%, or they will linger.
  • Edge cases are part of the work, not an optional add-on.
  • Clear exit criteria and time-boxed plans keep migrations from becoming permanent complexity.

Further reading