SECURITY2024-06-18BY STORECODE

Incident report: Partial outage from a misconfigured rate limiter tier

A change to rate-limiting configuration in only one tier caused uneven outages. We describe what broke and how we aligned limits with SLOs.

securityrate-limitingincidentsreliability

Summary

On June 18, 2024, a misconfigured rate limiter in one of our HTTP tiers caused a partial outage for a subset of traffic.

The change was intended to tighten limits for abusive patterns.

Due to a configuration mistake, it:

applied aggressive limits to some legitimate clients
did not meaningfully reduce the abusive traffic it targeted

The result was an outage that looked uneven:

some users experienced repeated "too many requests" errors
others saw normal behavior

We rolled back the change and aligned rate-limiting configuration with our SLOs and application behavior.

Impact

Duration: approximately 42 minutes of elevated 429 responses for affected clients.
User impact:
- some users were blocked from completing actions despite low actual request rates
- support saw a spike in "locked out" and "rate-limited" complaints
Internal impact:
- confusion about whether the issue was application-level or edge-level
- time spent correlating logs and metrics across multiple layers

No data loss occurred, but trust in the affected flows took time to repair.

Timeline

All times local.

13:02 — Rate-limiter configuration change is deployed to one of the edge tiers.
13:08 — Metrics show a rise in 429 (Too Many Requests) responses from that tier.
13:12 — On-call is paged for elevated 4xx rates on a set of endpoints.
13:18 — Initial investigation focuses on the application. Application logs show no matching increase in request volume or internal error rates.
13:24 — Comparison between regions shows that only the region with the new rate-limiter config has elevated 429s.
13:28 — On-call inspects rate-limiter logs and sees that many blocked clients are within normal request patterns.
13:31 — Decision is made to roll back the new configuration to the previous version.
13:35 — Rollback completes; 429 rates begin to fall toward baseline.
13:44 — Metrics confirm that normal clients are no longer being blocked.
14:05 — Incident closed with follow-ups focused on configuration review and testing.

Root cause

The root cause was a configuration change that:

applied new limits based on a header that did not reliably distinguish abusive from normal traffic
used thresholds that were out of line with real-world request patterns

Specifically:

the limiter was keyed on a combination of client identifier and path
the threshold for some paths assumed far fewer requests per minute than legitimate clients actually produced

This issue was compounded by:

deploying the change to only one tier first, causing regional and cohort inconsistencies
limited pre-deploy testing against realistic traffic patterns

Contributing factors

Unclear ownership. Multiple teams touched rate-limiting (infrastructure, application security, product), but no single owner had the full picture.
Inadequate simulation. We tested the new rules against synthetic traffic, not real client patterns.
Sparse documentation. The intended behavior and rationale for the new limits were not clearly documented.

What we changed

1. Align rate limits with SLOs and real usage

We revisited our rate-limiting strategy:

for each protected endpoint, we looked at typical and peak legitimate request rates
we based limits on those patterns plus safety margins, not guesses
we ensured that limits did not conflict with our SLOs (e.g., latency, success rates)

We also documented:

which endpoints are protected by which limits
which client dimensions we rely on (IP, auth token, tenant, etc.)

2. Test rules against replayed traffic

Before deploying new rate-limiting rules, we now:

replay a slice of real traffic in a lower environment
simulate how the new rules would classify and throttle requests

We compare:

which requests would be limited
whether those are truly abusive or within normal patterns

3. Centralize ownership and review

We clarified ownership for rate-limiting:

one team is responsible for the overall strategy and review
changes from other teams go through a short review that checks:
- alignment with existing limits
- impact on legitimate usage

This doesn’t block experimentation; it makes it visible.

4. Improve observability at the limiting layer

We added metrics and logs that show:

which rules are triggering most often
breakdown of limited traffic by client attributes
correlations with application-level errors and SLOs

Dashboards now have a clear "rate limiting" section so incidents can distinguish between application and edge-level causes quickly.

5. Safer rollout patterns

We adjusted how we roll out rate-limiter changes:

start in shadow mode where possible (evaluate but don’t enforce)
roll out by small cohorts with clear rollback paths
avoid serving different behavior to similar users in different regions for extended periods

Follow-ups

Completed

Rolled back the misconfigured rule.
Updated rate-limiting configs for key endpoints based on real usage.
Improved dashboards for rate-limiting events.

Planned / in progress

Build small tools to help simulate and visualize rule impact before deployment.
Integrate rate-limiting configuration into the same review and change-management flows as other critical configs.

Takeaways

Rate limits need to be designed with real traffic and SLOs in mind, not just worst-case guesses.
Misconfigurations at edge tiers can create uneven, hard-to-debug outages.
Shadow evaluation and replayed-traffic testing are cheap compared to production surprises.
A short review question—"who are we protecting, and how will we see if we’re blocking the wrong people?"—catches many mistakes earlier than complex rule syntax reviews.