Story: the migration that almost doubled latency
We migrated a core read path to a new backend and watched P95 latency climb. This is the story of how we noticed, rolled back, and changed how we plan migrations.
What happened
We had a classic good-news problem: the old data store backing a user-facing API was reaching its limits.
Reads were still within SLO, but we were running out of room to grow without increasingly awkward sharding. A newer backend promised better scalability and a cleaner data model.
The plan looked sensible on paper:
- write new data to both old and new stores
- backfill historical data into the new store
- flip reads over once we were confident
We did most of this.
Where we stumbled was the last step.
The flip
On a Wednesday afternoon, we flipped the read path for a subset of traffic to the new backend.
We had done the math on average latency:
- old store P95: ~220ms
- new store P95 in staging: ~230–240ms
That difference looked acceptable.
Within 20 minutes of the flip, our production P95 climbed to ~400ms for the migrated cohort.
SLO alerts started to fire.
At first, the graphs were muddy:
- some endpoints were still hitting the old backend
- retries and fallbacks were blending signals
- we had not broken out metrics by "old vs new" in all the right places
It took another 15 minutes to make the pattern unmistakable: requests hitting the new backend were materially slower at the tail.
The investigation
We paired an engineer who had lived with the old system for years with someone newer to the team who had led the new backend work.
They walked through what the request actually did end-to-end:
- auth check
- fetch from primary data store
- a couple of denormalized lookups
- a join against a per-user settings table
Nothing in that list was surprising.
What was surprising was the shape of the latency distribution:
- median latency was fine
- P95 and P99 were significantly worse
That usually means one of a few things:
- cache behavior is different
- a minority of requests take a slower path
- contention or locking issues show up only under load
We pulled sample traces for slow requests in both the old and new paths.
In the traces for the new backend, a particular query stood out: a join that we had added for convenience, consolidating what used to be two round-trips into one.
In staging, with small data sets and no contention, this looked elegant.
Under real traffic, against a much larger dataset, it turned into an occasional full-table walk when an index hint wasn’t applied.
The decision
By this point, we were ~45 minutes into the incident.
We had three options:
- Try to "fix" the query on the new backend in place.
- Reduce the percentage of traffic hitting the new backend and watch.
- Roll back all reads to the old backend and regroup.
We chose option 3.
The rule we agreed on ahead of time helped: if a migration threatens user-facing SLOs and we don’t have a small, well-understood change to fix it, we roll back first and debug second.
We rolled back the read path and watched P95 return to baseline over the next few minutes.
What we changed
The interesting part came after the incident.
We didn’t abandon the new backend. We changed how we approach migrations.
1. Break out "old vs new" explicitly
We added explicit dimensions to our metrics and traces for migration state:
backend=old|newmigration_phase=control|canary|rolled-out
This did two things:
- made it obvious which cohort was suffering
- let us compare distributions side by side, not averaged together
2. Define SLOs for the migration itself
Previously, we had SLOs for the service, not for the migration.
We added simple rules:
- During a migration, the canary cohort must not violate SLOs by more than a small margin.
- If the canary exceeds that margin for more than N minutes, we roll back automatically.
This turned the migration into a thing we could reason about, not just a one-time event.
3. Rehearse the rollback
We realized that while we had written down "we can flip reads back," we had never rehearsed it.
After the incident, we treated rollback like any other operation:
- we wrote explicit steps
- we ran them in a lower environment
- we measured how long it took and what signals we would watch
When we later retried the migration, rollback felt like a normal, boring tool—not a last resort.
4. Revisit the data model
The slow join taught us something about the new backend: it made certain cross-cutting reads easy and tempting.
We pulled those reads out of the hot path:
- denormalized some data into precomputed tables updated asynchronously
- made other data accessible via separate, clearly-labeled endpoints
The result was a hot path that asked simpler questions of the backend, with richer, slower queries moved to places where they wouldn’t hurt user-perceived latency.
5. Document the "migration shape"
We wrote a short internal guide for future migrations:
- start with a tiny canary (e.g., 1% or one internal cohort)
- instrument old vs new explicitly
- define roll-forward and roll-backward criteria
- keep the old path healthy until the new one has earned trust
It wasn’t a new idea in the industry. It was just something we had to make explicit for ourselves.
Takeaways
- Migrations are product changes. They deserve SLOs, canaries, and clear rollback rules.
- Median latency can look fine while tails get worse; watch the whole distribution.
- New backends often make it easy to ask more complex questions. Keep those out of the hot path.
- Practicing rollback before you need it turns a risky decision into a routine operation.
- Instrumenting "old vs new" explicitly is cheaper than trying to tease them apart under pressure.