ARCHITECTURE2022-08-19BY JONAS "JO" CARLIN

Story: when 'just add caching' made things worse

We added a cache to protect a slow path and accidentally created a new failure mode. This is the story and what we changed.

architecturecachingperformanceincidents

What happened

A service had a slow, expensive endpoint.

It wasn’t failing SLOs, but it was close. Under peak load, P95 hovered in the "users start to notice" range.

The idea was simple: add a cache.

The data was relatively stable. Many requests asked the same question. A short-lived cache seemed like a harmless way to buy headroom while we planned deeper optimizations.

We put a cache in front of the expensive path.

In a small test, it worked beautifully:

  • cache hit rate was high
  • backend load dropped
  • median and P95 latency improved

Then we rolled it out fully.

The regression

A few days later, we hit an unrelated spike in traffic.

Under load, instead of protecting the backend, the cache turned into a new source of pressure:

  • cache misses turned into thundering herds
  • the cache store itself became a bottleneck
  • expiration patterns caused synchronized refreshes

P95 latency was worse than before, and we saw new failure modes:

  • timeouts talking to the cache
  • increased connection errors under bursty load

The caching layer had multiplied the number of ways the system could be slow or broken.

The investigation

We looked at the traces for slow requests and noticed three things:

  1. Many "cache hits" weren’t hits at all.
  2. When the cache key expired, multiple requests recomputed the same value.
  3. The cache backend wasn’t tuned for the new traffic pattern.

The root issues were design choices:

  • cache keys were more granular than necessary
  • we used a simple time-based expiration with no jitter
  • we treated the cache as a free optimization, not a new dependency with its own SLO

What we changed

1. Treat the cache as a dependency

We wrote down expectations for the cache:

  • availability and latency targets
  • what happens if it’s down or slow

We added:

  • metrics and alerts specific to the cache layer
  • a clear fallback path for cache failures (use the underlying store, with bounded retries)

This forced us to consider how much we could rely on the cache under stress.

2. Coarser keys and better hit rates

We simplified cache keys to reflect what users actually needed:

  • grouped similar requests that could share a value
  • avoided including high-cardinality fields that didn’t change the response in meaningful ways

This increased hit rates and reduced the work required for each refresh.

3. Staggered expirations and background refresh

Instead of expiring many keys at once and forcing users to pay the recompute cost, we:

  • added jitter to expiration times
  • introduced a background refresh mechanism for the hottest keys

For some flows, we used a "stale-while-revalidate" pattern:

  • serve slightly stale data for a short period
  • refresh in the background

This kept latency predictable while still keeping data reasonably fresh.

4. Limits and backpressure

We added simple protections:

  • caps on concurrent recomputations for the same key
  • fallbacks when the cache backend was overloaded

If the cache was clearly struggling, we preferred to:

  • bypass it temporarily for some traffic
  • or reduce load through feature flags

Takeaways

  • Caching is an architectural decision, not a free performance knob.
  • A cache introduces a new dependency with its own failure modes and SLOs.
  • Coarse keys and jittered expirations prevent thundering herds.
  • "Stale but fast" is often better for users than "fresh but timing out".
  • Any cache that matters should be visible in your dashboards and runbooks.

Further reading