RELIABILITY2022-03-31BY STORECODE

Incident report: Cascading timeouts from a slow dependency

A degraded downstream service caused slow responses that turned into a wave of timeouts upstream. We describe the chain and what we changed.

reliabilityincident-responsetimeoutsdependenciescascading-failures

Summary

On March 31, 2022, a downstream dependency became significantly slower under load.

The dependency did not fail outright. It continued to return successful responses, but with latency several times higher than normal.

Our upstream service treated this as "still healthy" and continued sending traffic at the usual rate. Over the course of roughly 40 minutes, those slow responses accumulated into a wave of timeouts and retries across multiple services.

We mitigated the incident by lowering timeouts, adding aggressive fallbacks, and temporarily reducing traffic to the dependency. Afterward, we changed how we set and test timeouts so that slow dependencies fail fast instead of dragging the whole system with them.

Impact

Duration: approximately 1 hour and 5 minutes of elevated latency and timeouts for a subset of user-facing requests.
User impact:
- P95 latency for some endpoints increased from ~250ms to ~1.8s at the peak.
- Roughly 3–4% of affected requests resulted in timeouts or retried failures.
Internal impact:
- Increased on-call load for two teams.
- A spike in support tickets about "spinning" and "stuck" screens.

No data was lost, but some operations were delayed significantly.

Timeline

All times local.

10:07 — The downstream dependency begins to show increased latency (P95 ~700ms instead of ~200ms). No alerts fire yet.
10:14 — Our upstream service's P95 and P99 latencies begin to climb. Retries per request increase modestly.
10:19 — User-facing latency SLO alert triggers for one API. On-call acknowledges and starts investigation.
10:23 — On-call notices that application error rates are not spiking in step with latency. The initial hypothesis is "slow dependency, not an internal error."
10:27 — Metrics confirm that a single downstream dependency is much slower than baseline. No explicit alert exists for its latency.
10:31 — Upstream services start to hit their own timeouts, leading to a noticeable increase in 5xx responses and retries.
10:35 — A second team is paged for elevated error rates in a related service that also depends on the slow downstream.
10:41 — Joint incident channel created. The teams compare timeout settings and behavior across services.
10:48 — Decision is made to reduce the timeout for calls to the dependency and to add a simpler fallback path for some requests.
10:52 — Configuration change is rolled out to lower the timeout and cap retries.
10:56 — Latency starts to improve as slow calls are cut off earlier and fewer retries are attempted.
11:05 — P95 and P99 latencies approach baseline. Timeouts drop back to normal levels.
11:12 — Incident wrapped up. Follow-ups are drafted focusing on timeouts, circuit breakers, and dependency SLOs.

Root cause

The immediate cause was a mismatch between the dependency's degraded performance and our timeouts and retry policies.

Key points:

The dependency's SLO for latency was not clearly defined or enforced.
Our upstream service used a timeout that was too close to the user-facing SLO, leaving little room for retries and overall processing.
Retries were configured aggressively and did not include jitter or backoff tuned for this dependency.

When the dependency slowed down, this combination led to:

many in-flight requests waiting close to their timeout thresholds
retries stacking on top of slow responses, increasing load
threads and connections tied up waiting, reducing overall capacity

The dependency itself did not experience a traditional outage. From its perspective, it was "just slower." From ours, it was effectively unavailable for timely responses.

Contributing factors

Lack of explicit dependency SLOs. We had not written down what "acceptable latency" for the dependency looked like under normal and degraded conditions.
Timeouts set by guesswork. Timeouts were chosen once and rarely revisited, rather than based on measured distributions.
Inconsistent retry policies. Different services called the dependency with different retry counts and patterns, making it hard to reason about combined load under degradation.
No early-warning alerts on dependency latency. We only noticed the issue after it had cascaded into user-facing timeouts.

What we changed

1. Timeouts based on budgets, not guesses

We recalibrated timeouts for calls to the dependency using a simple budget:

start from the user-facing SLO (e.g., "P95 500ms")
subtract known overhead (network, application processing, other dependencies)
allocate a clear slice for this specific dependency

We set the timeout for the dependency low enough that:

slow responses fail fast and surface as dependency errors
we have room for a bounded number of retries without violating the overall SLO

2. Bounded and coordinated retries

We standardized retry policies for this dependency:

a small number of retries (e.g., 1–2) with backoff and jitter
no cross-service retry storms (e.g., one service retrying another that is already retrying)

We documented the policy so new services calling the dependency follow the same pattern.

3. Circuit breakers and fallbacks

We introduced a circuit breaker around the dependency in the upstream services that rely on it most:

when error rates or timeouts exceed a threshold, the circuit opens
while open, we either serve cached or degraded data, or surface a clear partial failure

For some flows, we added explicit fallbacks:

if the dependency is slow, we return a simpler response and queue a background job to complete non-critical work

This prevents the entire user experience from hanging on a single slow call.

4. Dependency SLOs and alerts

We defined SLOs for the dependency itself:

target latency distributions (P95, P99)
acceptable error and timeout rates

We added alerts that fire when the dependency exceeds these targets, even if user-facing services are still within their SLOs.

This gives us a chance to react before latency cascades.

5. Runbook updates

We updated runbooks for the affected services with:

steps to differentiate between internal and dependency-caused latency
where to look for dependency-specific dashboards
how to adjust timeouts and circuit breaker thresholds safely if needed

We also added a checklist item for new dependencies: "define timeouts, retries, and SLOs before sending production traffic."

Follow-ups

Completed

Recalibrated timeouts and retry policies for the slow dependency across the main calling services.
Added dependency-specific SLOs and alerts.
Implemented circuit breakers and simple fallbacks in the highest-impact paths.

Planned / in progress

Extend the timeout and retry review to other critical dependencies.
Build a small tool that suggests timeout values based on observed latency distributions.
Add load tests that simulate dependency degradation and verify that timeouts and fallbacks behave as expected.

Takeaways

A dependency that is "only slow" can be as harmful as one that is completely down.
Timeouts should be derived from user-facing budgets, not picked once and forgotten.
Retries without coordination and backoff can amplify latency problems.
Circuit breakers and fallbacks turn slow dependencies into partial, understandable failures instead of cascading timeouts.
Explicit SLOs and alerts for dependencies give you earlier, more actionable signals.