RELIABILITY2018-08-23BY STORECODE

Incident report: Timeouts caused by request logging

A small observability change increased log volume enough to overload the service. Rollback fixed it; we changed how we ship logging changes.

reliabilityincident-responseobservabilityloggingrollback

Summary

On August 23, 2018, the primary API behind checkout experienced elevated latency and timeouts immediately after a deploy.

The regression was introduced by an observability change: we added a new info-level structured log line on every request path. Under peak traffic, the service became I/O bound on log writes and spent a material amount of time logging instead of serving requests.

We mitigated by rolling back the deploy. Latency returned to baseline within minutes.

This incident is worth writing down because it’s a common failure mode: a “small” telemetry change that looks harmless in code review becomes dominant work in production.

In most systems, observability code ships more often than reliability work.

That creates a trap: teams treat logging and metrics as “safety,” and therefore assume they can’t be the thing that breaks production. But observability runs in the same hot paths as everything else. It consumes the same CPU, the same I/O, and the same human attention.

If you don’t make telemetry changes bounded and reversible, you’re eventually going to ship a telemetry incident.

What we observed

  • Latency increased right after the deploy finished.
  • Timeouts increased without a corresponding spike in application errors.
  • Host-level signals (CPU and disk I/O) rose in parallel with log volume.

Mitigation

  • We attempted to scale instances first. That reduced the symptom briefly but did not address the underlying per-request work.
  • We rolled back, which immediately reduced log volume and restored baseline latency.

Impact

  • Duration: 34 minutes (09:14–09:48 ET).
  • Customer impact: ~6% of requests to the checkout endpoint timed out at the peak. Some users succeeded after retrying.
  • Latency impact: P95 latency for the endpoint increased from a baseline of ~180–250ms to ~1.5–2.5s during the incident window.
  • Internal impact: elevated on-call load and a spike in “checkout is slow / failing” reports. Support could not provide a more specific user-facing explanation because the failure was a generic timeout.

We did not observe data loss. The primary user impact was failed or delayed requests.

What the user experienced

  • Some checkout attempts hung and eventually failed.
  • Retry often succeeded, but not predictably (which increases user frustration and support contact volume).

What we measured during the window

We didn’t have a perfect “log volume per service” dashboard at the time, but we could triangulate from:

  • disk write saturation on instances
  • elevated CPU time spent waiting
  • a sharp increase in log lines emitted per request

That combination is the signature of “the service is spending time writing logs.”

Timeline

All times ET.

  • 09:10 — Deploy starts.
  • 09:14 — Latency alert triggers (P95 up; timeout rate starts climbing).
  • 09:16 — On-call opens the service dashboard. The change is correlated with the deploy window.
  • 09:18 — First mitigation attempt: scale instances. Immediate effect: a short dip in timeouts, followed by regression as traffic and log volume continue.
  • 09:21 — Initial hypotheses:
    • database degradation
    • upstream dependency degradation
    • regressions from the deploy (serialization, new computation on request path)
  • 09:24 — Host metrics show a different pattern: CPU rises and disk writes spike; request handling threads appear to be waiting.
  • 09:27 — Investigation shifts to per-request work added by the deploy.
  • 09:29 — We compare the change diff to the incident signals and notice a new info-level request log line on the critical path.
  • 09:31 — Working hypothesis: synchronous log flushing is making request handling I/O bound.
  • 09:33 — We confirm log volume increased sharply compared to baseline (bytes/second and log lines/second).
  • 09:36 — Rollback initiated.
  • 09:40 — Latency returns to baseline; timeout rate falls.
  • 09:44 — We keep the service under observation to confirm stability and avoid a “flap.”
  • 09:48 — Incident closed; follow-ups drafted.

Investigation notes (what we ruled out)

The first ten minutes were mostly about scoping the blast radius.

  • Database: we did not see the expected signs of a database-originated regression (query latency increases preceding application latency).
  • Upstream dependencies: no correlated latency spikes on the primary dependencies.
  • Code path changes unrelated to logging: no obvious new heavy computation in the deploy.

The turning point was noticing that the shape of the incident looked like “per-request overhead” and then matching that to the deploy diff.

Root cause

The deploy introduced a new info-level structured log line on every request path.

The service’s logging configuration flushed synchronously. Under peak traffic, the additional logging work moved from “background noise” to “dominant cost.” Requests spent time blocked on log writes, increasing tail latency and timeouts.

How logging became the bottleneck

Two things happened at once:

  1. We increased the amount of work per request.
  • serializing a structured payload
  • formatting and emitting a log line
  1. We made that work blocking. When the downstream log pipeline slowed (or local disk I/O became saturated), request handling threads waited.

A rough way to think about it:

  • If a request takes 200ms and logging adds 5ms, you won’t notice.
  • If a request takes 200ms and logging adds 50ms under load (because writes block), you will.

We also learned an uncomfortable fact: scaling can make this worse. Adding instances increases total log volume. If the bottleneck is shared (or if each instance hits the same I/O ceiling), you can spend money and still time out.

Why we didn’t catch it before production

The change looked correct in isolation.

We missed the operational dimension:

  • staging traffic was not comparable to peak production traffic
  • the logging pipeline behaved differently under bursty load
  • we did not have a “log bytes/second” budget for this service

This isn’t a criticism of one deploy. It’s a gap in our guardrails.

Why scaling didn’t fix it

Scaling reduced per-instance request load, but it did not reduce total log volume. In fact, scaling can increase logging overhead when the bottleneck is shared I/O or a shared logging pipeline.

Because the regression was “extra work per request,” adding more instances didn’t remove the extra work. It just spread it out and prolonged the time it took us to reach the correct mitigation.

What made this change risky

This wasn’t “logging exists.” Logging already existed.

The risk was introducing:

  • A log line that fired on every request, on the hottest endpoint.
  • A structured payload that was larger than the previous baseline log messages.
  • A configuration where the write path could block request handling.

Contributing factors

  • We did not have a guardrail alert on log volume, log write latency, or disk write saturation.
  • We did not have a “safe first action” for logging regressions documented in the runbook.
  • Our deploy checklist did not include a specific check for “does this add per-request work?” (logging, serialization, tracing).
  • We did not have a canary signal tied to logging changes (e.g., “log bytes/second increased by X% within Y minutes of deploy”).

What we changed

We treated this as an operational failure mode, not a one-off bug.

Immediate remediation

  • Rolled back to restore the previous behavior.
  • Verified latency and timeout rates returned to baseline.

Logging changes

  • Moved high-volume request logs behind sampling.
  • Made the default request path cheaper by removing unnecessary per-request logging.
  • Reduced or removed unbounded/high-cardinality fields from the default logs.

We also made the difference between “debugging” and “operating” explicit.

  • Debugging logs are useful, but they are not free.
  • Operating logs should be bounded and queryable.

A rule we adopted after this incident:

  • If it can fire on every request, it must be sampled or gated.

That rule is intentionally blunt. It prevents “harmless” log lines from turning into production work.

Operational controls

  • Made log level and sampling adjustable without requiring a full deploy.
  • Added a “logging regression” section to the runbook, including:
    • what to check (log volume, disk I/O)
    • the safest first actions (turn down sampling, reduce log level)
    • when to roll back

We also made the “first ten minutes” explicit.

The runbook now tells you, in order:

  1. Confirm correlation with deploy.
  2. Check log volume (lines/second, bytes/second) alongside disk writes.
  3. If log volume spiked, turn down sampling / log level first.
  4. If the service is still timing out, roll back.

That ordering matters. The goal is to prevent “random debugging” from becoming the response pattern.

Deployment guardrails

  • Added a deployment checklist item: “does this change increase per-request work (logging, serialization, tracing)?”
  • Added dashboards and alerts for:
    • log lines/second and bytes/second
    • disk write saturation
    • request latency correlated with deploy windows

We also changed the review questions we ask on any telemetry change:

  • What decision will this enable during an incident?
  • What’s the expected volume (per request, per minute, per deploy)?
  • How do we turn it down without redeploying?
  • What’s the safe failure mode if the telemetry pipeline is slow?

Follow-ups

Completed

  • Added alerts for abnormal log volume and disk I/O saturation.
  • Added a runbook entry for “latency after deploy” that includes log volume as a first-class suspect.

Planned / in progress

  • Add a simple load test that fails if logging overhead exceeds a threshold for the hot path.
  • Ensure telemetry changes are reviewed like production changes: scoped, bounded, and reversible.
  • Make log level / sampling configuration discoverable from the runbook (so support and on-call can find it quickly).
  • Add a deploy canary check: alert when log bytes/second increases sharply right after a deploy.

A subtle goal here is reducing “rollback as the only volume knob.” If the only way to reduce log volume is to roll back the whole deploy, you’ll hesitate.

A safe system has multiple knobs:

  • sampling rate
  • log level
  • targeted disabling of one emitter

Those knobs need to be documented and accessible.

What we’d do differently next time

  • Canary telemetry changes the same way we canary feature changes.
  • Prefer “turn it down” mitigations that don’t require rollbacks when safe (sampling/level), but keep rollback as the default when the blast radius is unclear.

Closing note

We treat this as a reliability incident, not a logging incident.

The user doesn’t care what subsystem caused their checkout to time out.

They care that it did.

The lesson we keep is operational: anything on the critical path needs a budget, a rollback story, and a safe way to reduce pressure.

Telemetry is no exception.

After this incident, we started treating “log volume” as something we can regress the same way we regress latency.

The goal isn’t to stop logging.

The goal is to make logging predictable: bounded in volume, fast to query, and easy to turn down when it becomes a cost.

Further reading