RELIABILITY2019-05-11BY STORECODE

Incident report: A cache stampede

A deploy invalidated hot cache keys and the database became the cache. We rolled back and added stampede protection.

reliabilityincident-responsecachingdatabasesrollback

Summary

On May 11, 2019, a deploy changed cache key behavior for a hot path.

A large fraction of requests became cache misses, shifting load to the database and causing elevated latency and timeouts. The database effectively became the cache.

We mitigated by rolling back. We followed up by adding cache stampede protection (request coalescing and bounded stale serving) and by adding safer rollout checks for cache hit rate.

This is a common incident shape: a cache change that looks like a correctness refactor becomes an availability event because “cache miss” is not free.

Caching is often described as “just performance,” but on high-volume paths it’s also availability. If the cache disappears, your database is now on the hot path.

What we observed

  • Latency and timeout rate increased shortly after deploy.
  • Database CPU and query volume climbed rapidly.
  • Cache hit rate dropped sharply.

Impact

  • Duration: 27 minutes (16:03–16:30 ET).
  • Customer impact: ~3% of requests to a high-traffic page timed out at the peak. Many other requests were slower.
  • Latency impact: P95 latency for the affected path increased from a baseline of ~200–350ms to ~1.0–1.8s.
  • Cache impact: cache hit rate dropped from a baseline in the ~90–95% range to below ~30% for the hot path.
  • Database impact: database CPU increased sharply and query volume spiked as requests fell through.
  • Internal impact: elevated on-call load and a spike in “site is slow” messages with limited correlation.

We did not observe data loss. The primary user impact was timeouts and degraded performance.

What the user experienced

  • Requests were slow, then some timed out.
  • Refreshing sometimes helped, which is consistent with a backend queue and variable load.

What we measured during the window

We were able to correlate the incident quickly because we had three signals on one screen:

  • latency / timeout rate
  • database CPU and query volume
  • cache hit rate

The cache hit rate drop was the key diagnostic. It explained why the database became saturated and why scaling the application made things worse.

Timeline

All times ET.

  • 15:59 — Deploy begins.
  • 16:03 — Latency alert fires; database CPU climbs.
  • 16:05 — On-call opens service dashboards; sees elevated query volume and rising tail latency.
  • 16:06 — Cache hit rate drops sharply.
  • 16:08 — First mitigation attempt: scale instances and database capacity to reduce immediate pressure.
  • 16:10 — Scaling provides limited relief. Query volume continues to climb because cache misses remain high.
  • 16:12 — Investigation focuses on cache behavior changes in the deploy.
  • 16:16 — Hypothesis confirmed: cache key behavior changed; hot keys are now missing for the majority of requests.
  • 16:18 — Decision: rollback, because the system is unstable and the change is not safely reversible in place.
  • 16:20 — Rollback initiated.
  • 16:24 — Cache hit rate begins recovering.
  • 16:26 — Database load stabilizes; latency starts returning toward baseline.
  • 16:30 — Incident closed.

Investigation notes (what we needed to know)

We needed to answer, in order:

  • Did load increase? (No: traffic was within expected range.)
  • Did the database get slower first? (No: database load increased after cache hit rate collapsed.)
  • Did the deploy change cache behavior? (Yes: cache key behavior changed.)

Once we had those answers, rollback was the correct mitigation.

Root cause

The deploy changed the cache key generation for a hot path in a way that effectively invalidated existing cache entries.

What changed in the cache key

The intent of the change was reasonable: clean up how we generated keys so different request shapes didn’t collide.

The operational effect was that most hot keys changed at once.

When keys change, the cache becomes cold even though the data itself didn’t change.

On a low-volume path, that’s a performance hit.

On a high-volume path, it’s an incident.

A safe cache key change needs one of these:

  • a versioned key scheme where old and new keys coexist
  • a staged rollout that warms new keys before removing old ones
  • a fallback (stale serving) that prevents cold-start fan-out

When traffic arrived, many requests missed the cache and fell through to the database. Because the hot path is high volume, the database became saturated.

The failure mode was amplified by the absence of stampede protection:

  • No request coalescing (multiple concurrent cache misses triggered multiple identical database queries).
  • No stale serving (we had no “serve a slightly stale value while rebuilding” behavior).

Why stampedes happen

A cache miss is not one query.

On a hot path, it’s many queries at once.

Without coalescing, the system does this under load:

  • 100 requests arrive
  • the cache is cold for the same key
  • 100 requests each compute the miss independently
  • 100 database queries run concurrently

If the query is expensive or the database is near saturation, that fan-out can create a self-inflicted incident.

Stale serving is the second half of the fix.

If the value is safe to serve slightly stale (bounded freshness), you can keep serving traffic while one request rebuilds the cache.

Why scaling didn’t fix it

Scaling the application increased request capacity, which increased database pressure.

Scaling the database bought time but did not remove the underlying cause: a high cache miss rate on a hot path.

The correct mitigation was restoring previous cache key behavior via rollback.

Contributing factors

  • No canary check or alert on cache hit rate during deploy.
  • No stampede protection (coalescing or stale-while-revalidate) for the hot path.
  • No runbook guidance for “database CPU spike after deploy” that included cache hit rate as a first check.

What we would design for now

After this incident we started describing caches in operational terms:

  • What is the expected hit rate on this path?
  • What happens if the hit rate drops by half?
  • What is the database’s failure mode when it becomes the cache?

If we can’t answer those questions quickly, the cache is a liability.

We also started tracking a simple derived metric:

  • “database load per request on the hot path”

If that number changes after a deploy, we treat it the same way we treat a latency regression. Cache changes can hide inside “performance improvements” until the cache is cold.

That metric also gave us a practical alerting rule:

  • page only when database load per request rises and cache hit rate drops

Either signal by itself can be noise. Together, they describe the actual failure mode: the database is doing work the cache used to do.

What we changed

We treated caching as part of the production system, not as a performance optimization.

Immediate remediation

  • Rolled back to restore previous cache behavior.
  • Verified cache hit rate, database load, and latency returned to baseline.

Stampede protection

  • Added request coalescing for hot keys so that concurrent cache misses do not fan out into duplicate database queries.
  • Added bounded stale serving for a small set of safe-to-stale values so the system can degrade gracefully during rebuild.

We also wrote down the intended behavior for operators:

  • If cache hit rate collapses, rollback is a valid mitigation.
  • If the cache is cold but stable, coalescing should prevent database overload.
  • If the cache is rebuilding, stale serving should keep the system responsive within a bounded freshness window.

Those expectations are important because caches fail quietly. Without explicit behavior, teams reach for scaling, which can worsen a stampede.

We were explicit about which data is safe to serve stale.

Some values are safe to be a few minutes old (reference data, config, lists).

Some values are not (inventory counts, payments, anything that must be strongly consistent).

Stale serving is a reliability feature only when the data has a clear freshness boundary.

Rollout and observability

  • Added deploy checks/alerts for cache hit rate changes.
  • Added dashboard annotations for deploys to make correlation immediate.
  • Documented the cache key format and the intended change process so “small refactors” don’t invalidate hot keys unintentionally.

We also added a specific review question for cache changes:

  • “What happens to the database if this cache is cold?”

If the answer is “the database will fall over,” then the cache change needs a rollout plan and stampede protection.

We also adopted a rule for cache key changes:

  • If a key change would invalidate hot keys, it needs a rollout plan.

Sometimes that plan is as simple as versioning keys and letting the old and new coexist until the new cache is warm.

We also learned that “warming the cache” is a real production activity.

If a hot path depends on cached values, warming those values should not be done implicitly by user traffic.

It should be done deliberately (prefill jobs, staged rollout, or background rebuild), with a stop condition if database load rises.

A concrete rollout pattern we now prefer:

  • Introduce a new key version while continuing to read the old key.
  • Populate the new cache in the background (or on a small percentage of traffic).
  • Measure hit rate and database load while both paths exist.
  • Only then stop reading/writing the old key.

That sounds like overhead.

It’s cheaper than letting peak traffic warm your cache via your database.

Follow-ups

Testing

  • Add a small load test for cache behavior regressions.
  • Add a canary stage that validates cache hit rate stays within a bounded delta during rollout.
  • Add a “hot key” regression test: verify that a cache key refactor does not turn a bounded key into an unbounded one.

The goal of these tests is not perfect simulation. It’s to catch the obvious changes that would make a cache miss rate spike after deploy.

Runbooks

  • Add a runbook entry: “database CPU spike after deploy,” with cache hit rate and key changes as first checks.

We also made the “safe first action” explicit.

If cache hit rate collapses after a deploy, the first action is to stop the bleeding (rollback or disable the change), not to tune the database under pressure.

Tuning can come later, when the system is stable enough to measure.

This is the same principle we use for migrations: keep rollback real, and prefer staged changes over big-bang behavior shifts.

It’s cheaper.

Correlation

  • Improve correlation from user reports to internal traces/logs so support can attach a reference ID to a slow-page report.

Operational habit

We added one habit that’s hard to encode in code but easy to enforce in review:

  • If a cache change can affect a hot path, we deploy it earlier in the day and leave time to observe.

“Change management” sounds bureaucratic until you’ve done a rollback at peak traffic.

Further reading