RELIABILITY2019-01-27BY STORECODE

Incident report: A retry storm

A vendor degraded, our retries amplified it, and checkout suffered. We changed retry defaults and added clearer degradation paths.

reliabilityincident-responseintegrationsretriesresilience

Summary

On January 27, 2019, checkout latency increased and a portion of requests timed out after an external dependency began returning slow responses.

The underlying trigger was upstream degradation. The amplifying failure mode was ours: the integration client retried timeouts too aggressively, without sufficient backoff and without jitter. That increased concurrent work precisely when the system needed less of it.

We mitigated by reducing retry pressure and temporarily degrading a non-critical dependency call. We followed up by changing retry defaults, adding circuit-breaker behavior, and documenting safe first actions for upstream instability.

This incident is a reminder that “retry” is not a harmless reliability feature. Retries are traffic multipliers.

A retry policy is part of your capacity plan.

If you don’t bound it, a dependency outage becomes a concurrency outage in your own system.

What we observed

  • Elevated upstream latency from a single dependency correlated with increased checkout latency.
  • Retry attempts increased request concurrency and reduced capacity for unrelated work.
  • Scaling helped briefly but did not change the dynamic because retries kept amplifying load.

Impact

  • Duration: 41 minutes (13:02–13:43 ET).
  • Customer impact: ~4% of checkout attempts timed out at the peak. Some users succeeded after retrying.
  • Latency impact: P95 checkout latency increased from a baseline of ~250–400ms to ~1.2–2.0s during the incident window.
  • Internal impact: elevated on-call load and a spike in “something went wrong” reports with limited actionable details.

We did not observe data loss. The primary user impact was timeouts.

What the user experienced

  • Checkouts intermittently hung and then failed.
  • Retrying often worked, but it was not a reliable recovery path.

What we measured during the window

We did not have perfect instrumentation for “retry pressure” at the start of the incident, which slowed diagnosis.

We could still infer the shape from:

  • upstream latency rising
  • a rise in in-flight request concurrency
  • a spike in timeout-driven retry attempts

Once we added ad-hoc queries for retries and in-flight counts, the feedback loop became obvious.

Timeline

All times ET.

  • 12:58 — External dependency begins intermittent slow responses.
  • 13:02 — Checkout latency alert fires; timeouts begin.
  • 13:04 — On-call opens dashboards and confirms the symptom is on the critical checkout path.
  • 13:06 — Initial hypothesis: upstream dependency latency. First action is scaling, to buy time while investigating.
  • 13:10 — Scaling provides limited relief. The system remains unstable because retry volume continues to increase concurrent work.
  • 13:12 — We add ad-hoc queries to measure retry rate and in-flight request counts.
  • 13:14 — Evidence points to a retry storm: upstream calls are slow; timeouts trigger retries; retries increase load; load increases timeouts.
  • 13:18 — We decide on a mitigation sequence:
    • reduce retries immediately
    • add a short-circuit for repeated upstream failures
    • degrade a non-critical dependency call to preserve the critical path
  • 13:21 — Retry policy adjusted (reduced retry attempts and increased backoff).
  • 13:28 — Short-circuit/degrade mode enabled for the non-critical call.
  • 13:35 — Latency begins returning toward baseline; timeout rate declines.
  • 13:43 — Incident closed.

Investigation notes (why it took time)

A retry storm can look like a general capacity problem.

If you only look at the downstream symptom (timeouts), the natural response is scaling.

The faster path is to ask two questions:

  • “What external dependency got slower?”
  • “What did we do in response to it getting slower?”

That second question is where retries live.

Root cause

The integration client retried on timeouts with insufficient backoff and no jitter.

When the upstream began returning slow responses, those retries amplified traffic and increased in-flight concurrency. That reduced capacity for normal checkout handling and increased tail latency.

The amplification math (small example)

Retries are easy to underestimate because the first attempt “looks normal.” The amplification shows up when the upstream is slow.

If a request makes 1 upstream call and retries 3 times on timeout, then a single user action can generate up to 4 upstream attempts.

At 150 requests/second, that can become 600 upstream attempts/second during degradation. If the upstream is slow, those attempts also stay in flight longer, increasing concurrency and tying up worker pools.

This is why “retry count” and “retry time” are different.

  • A small number of retries with a long timeout can still exhaust capacity.
  • A higher retry count with a strict overall budget can be survivable.

What made this integration risky

Two characteristics made this dependency fragile in practice:

  • It was on the critical path.
  • It was treated as “required” even though the business value was not always required to complete checkout.

That combination creates a trap: if it slows, checkout slows.

Contributing factors we’d call out explicitly now

  • We did not have a consistent per-call timeout policy. Timeouts were “best effort” rather than an explicit budget.
  • We did not have an agreed definition of “retriable.” Timeouts were retried by default.
  • We did not have a degrade mode already practiced.

Why scaling didn’t fix it

Scaling increased capacity, but it didn’t break the feedback loop.

More capacity allowed more retries to run concurrently, which kept pressure on the upstream dependency and on our own request handling.

The right fix was reducing retry pressure and adding a controlled degrade mode.

Why jitter matters

Without jitter, retries synchronize.

If many requests time out at the same time, they also retry at the same time. That creates waves of load that can keep an upstream in a degraded state longer.

Jitter spreads retries out. It doesn’t magically fix an outage, but it reduces synchronized spikes and makes recovery smoother.

The bigger lesson is still the same: retries must be bounded by a budget, or they become an incident response that manufactures work.

Contributing factors

  • The integration call ran inline on the critical path.
  • Retry behavior was not bounded by a budget (no “total retry time” limit tied to a user request).
  • We had no alerting on retry rate or in-flight request concurrency. We only saw the downstream symptom (timeouts).
  • We did not have a documented degrade mode for upstream instability.

What we changed

We made the retry behavior explicit, bounded, and observable.

Retry policy changes

  • Changed retry defaults to exponential backoff with jitter.
  • Introduced a strict retry budget per request (time and attempts), so one slow upstream cannot dominate the checkout path.
  • Ensured timeouts are explicit and consistent (no indefinite waits that accumulate work).

We also tightened what “retry” means.

  • Retries require idempotency. If we can’t prove the call is safe to retry, we don’t retry it automatically.
  • We treat timeouts as a signal that we are already in degraded mode. Retrying timeouts without a budget is usually just doing more of the failing thing.

In practice, the safer default is:

  • fewer retries
  • faster failure
  • clear degrade behavior

Circuit breaker / degrade mode

  • Added a circuit-breaker style short-circuit for repeated upstream failures.
  • Implemented a degrade mode for a non-critical upstream call so checkout can proceed when the dependency is unstable.

“Degrade mode” here does not mean “hide errors.” It means we decide, explicitly, what the system should do when a non-critical dependency is slow.

In practice, degrade mode had three parts:

  1. A clear boundary: what we will skip when the dependency is unhealthy.
  • The checkout path remains available.
  • The non-critical call is bypassed.
  1. A visible signal: the system tells operators it is degraded.
  • We emit a single bounded log/metric indicating “dependency bypassed” (so it can be monitored without exploding volume).
  • We track the bypass rate as a percentage of checkouts.
  1. A stop condition: when we turn the bypass back off.
  • After upstream latency recovers and stays stable for a window.
  • After we have a clear rollback path if enabling the call again causes regressions.

This is the difference between “we got lucky” and “we have a plan.”

Without an explicit degrade path, the only available move during upstream instability is to wait (or to keep retrying). Both are failure modes.

Operational documentation

  • Updated the runbook with an “upstream latency” section:
    • how to confirm upstream degradation
    • how to measure retry pressure
    • safe first actions (reduce retries, enable degrade mode)
    • rollback and stop conditions

We also made the safe first actions concrete.

The runbook now starts with:

  • check upstream latency vs our own latency
  • check retry rate and in-flight concurrency
  • reduce retries before scaling

Scaling is sometimes the right move, but it’s the wrong first move when the system is manufacturing work via retries.

Follow-ups

Monitoring

  • Add alerts on retry rate and in-flight request concurrency (not just error rate).
  • Add a synthetic check for upstream dependency latency.
  • Add an alert for “retries per successful checkout” to catch amplification early.

A subtle point: error rate can look “fine” during a retry storm because retries mask failures until capacity collapses. Retry metrics surface the risk earlier.

Engineering guardrails

  • Standardize retry defaults across services so behavior is consistent and reviewable.
  • Require a documented degrade mode for any non-critical upstream dependency on a critical path.
  • Require a runbook section for upstream dependencies: “what we do when this gets slow.”

The long-term goal is simple: upstream slowness should degrade the system predictably, not invent a new failure mode.

Testing / rehearsal

A retry storm is easy to reintroduce accidentally.

We added a lightweight rehearsal to keep the behavior honest:

  • In a safe environment, introduce artificial upstream slowness and verify that:
    • retries do not exceed the per-request budget
    • the circuit breaker trips when expected
    • degrade mode can be enabled quickly and is visible in dashboards

This isn’t “chaos engineering.” It’s a check that the knobs we rely on during incidents still work.

Vendor operations

We also documented the operational side of integrations:

  • how to confirm upstream degradation (status checks and synthetic probes)
  • escalation path and expected response time
  • what we tell support while the system is in degrade mode

A vendor outage is not under your control. Your retry policy is.

One practical change we made after this incident: we stopped treating retries as “extra resilience” and started treating them as “planned load.”

If a dependency is down, the system should fail fast or degrade predictably.

Retries are only safe when they are bounded, observable, and tied to a decision.

If you can’t answer “how many retries are we doing right now?” during an incident, you don’t really know whether you’re recovering or digging deeper.

Customer-facing clarity

  • Improve customer-facing error messaging for known upstream failure modes (still generic, but more actionable than “something went wrong”).

Further reading