DELIVERY2021-03-10BY JONAS "JO" CARLIN

Story: the migration that almost doubled latency

We migrated a core read path to a new backend and watched P95 latency climb. This is the story of how we noticed, rolled back, and changed how we plan migrations.

deliverymigrationsperformancerollbackreliability

What happened

We had a classic good-news problem: the old data store backing a user-facing API was reaching its limits.

Reads were still within SLO, but we were running out of room to grow without increasingly awkward sharding. A newer backend promised better scalability and a cleaner data model.

The plan looked sensible on paper:

  • write new data to both old and new stores
  • backfill historical data into the new store
  • flip reads over once we were confident

We did most of this.

Where we stumbled was the last step.

The flip

On a Wednesday afternoon, we flipped the read path for a subset of traffic to the new backend.

We had done the math on average latency:

  • old store P95: ~220ms
  • new store P95 in staging: ~230–240ms

That difference looked acceptable.

Within 20 minutes of the flip, our production P95 climbed to ~400ms for the migrated cohort.

SLO alerts started to fire.

At first, the graphs were muddy:

  • some endpoints were still hitting the old backend
  • retries and fallbacks were blending signals
  • we had not broken out metrics by "old vs new" in all the right places

It took another 15 minutes to make the pattern unmistakable: requests hitting the new backend were materially slower at the tail.

The investigation

We paired an engineer who had lived with the old system for years with someone newer to the team who had led the new backend work.

They walked through what the request actually did end-to-end:

  • auth check
  • fetch from primary data store
  • a couple of denormalized lookups
  • a join against a per-user settings table

Nothing in that list was surprising.

What was surprising was the shape of the latency distribution:

  • median latency was fine
  • P95 and P99 were significantly worse

That usually means one of a few things:

  • cache behavior is different
  • a minority of requests take a slower path
  • contention or locking issues show up only under load

We pulled sample traces for slow requests in both the old and new paths.

In the traces for the new backend, a particular query stood out: a join that we had added for convenience, consolidating what used to be two round-trips into one.

In staging, with small data sets and no contention, this looked elegant.

Under real traffic, against a much larger dataset, it turned into an occasional full-table walk when an index hint wasn’t applied.

The decision

By this point, we were ~45 minutes into the incident.

We had three options:

  1. Try to "fix" the query on the new backend in place.
  2. Reduce the percentage of traffic hitting the new backend and watch.
  3. Roll back all reads to the old backend and regroup.

We chose option 3.

The rule we agreed on ahead of time helped: if a migration threatens user-facing SLOs and we don’t have a small, well-understood change to fix it, we roll back first and debug second.

We rolled back the read path and watched P95 return to baseline over the next few minutes.

What we changed

The interesting part came after the incident.

We didn’t abandon the new backend. We changed how we approach migrations.

1. Break out "old vs new" explicitly

We added explicit dimensions to our metrics and traces for migration state:

  • backend=old|new
  • migration_phase=control|canary|rolled-out

This did two things:

  • made it obvious which cohort was suffering
  • let us compare distributions side by side, not averaged together

2. Define SLOs for the migration itself

Previously, we had SLOs for the service, not for the migration.

We added simple rules:

  • During a migration, the canary cohort must not violate SLOs by more than a small margin.
  • If the canary exceeds that margin for more than N minutes, we roll back automatically.

This turned the migration into a thing we could reason about, not just a one-time event.

3. Rehearse the rollback

We realized that while we had written down "we can flip reads back," we had never rehearsed it.

After the incident, we treated rollback like any other operation:

  • we wrote explicit steps
  • we ran them in a lower environment
  • we measured how long it took and what signals we would watch

When we later retried the migration, rollback felt like a normal, boring tool—not a last resort.

4. Revisit the data model

The slow join taught us something about the new backend: it made certain cross-cutting reads easy and tempting.

We pulled those reads out of the hot path:

  • denormalized some data into precomputed tables updated asynchronously
  • made other data accessible via separate, clearly-labeled endpoints

The result was a hot path that asked simpler questions of the backend, with richer, slower queries moved to places where they wouldn’t hurt user-perceived latency.

5. Document the "migration shape"

We wrote a short internal guide for future migrations:

  • start with a tiny canary (e.g., 1% or one internal cohort)
  • instrument old vs new explicitly
  • define roll-forward and roll-backward criteria
  • keep the old path healthy until the new one has earned trust

It wasn’t a new idea in the industry. It was just something we had to make explicit for ourselves.

Takeaways

  • Migrations are product changes. They deserve SLOs, canaries, and clear rollback rules.
  • Median latency can look fine while tails get worse; watch the whole distribution.
  • New backends often make it easy to ask more complex questions. Keep those out of the hot path.
  • Practicing rollback before you need it turns a risky decision into a routine operation.
  • Instrumenting "old vs new" explicitly is cheaper than trying to tease them apart under pressure.

Further reading