RELIABILITY2022-12-12BY STORECODE

Incident report: Rate limit applied to the wrong cohort

A misconfigured rate limit rule throttled legitimate traffic instead of abusive clients. We describe how it happened and what we changed.

reliabilityrate-limitingincident-responseconfiguration

Summary

On December 12, 2022, a misconfigured rate-limit rule applied aggressive limits to a cohort of legitimate users instead of the abusive clients it was intended to throttle.

The rule was deployed as part of a mitigation for a burst of automated traffic.

Due to an error in how the rule matched requests, it primarily affected a large group of normal users sharing a common characteristic.

For roughly 50 minutes, some legitimate requests were rejected with rate-limit errors while the abusive traffic continued largely unaffected.

We rolled back the rule, restored normal behavior, and then redesigned how we define and test rate limits.

Impact

  • Duration: ~50 minutes of degraded experience for affected users.
  • User impact:
    • a subset of users saw rate-limit errors or were forced to retry actions
    • some flows (such as repeated form submissions) became temporarily unusable
  • Internal impact:
    • increased support tickets about "blocked" actions
    • on-call engineers diverted from other work to debug the rule

No data was lost, but some users were unable to complete actions until the rule was rolled back.

Timeline

All times local.

  • 16:03 — Abnormal traffic pattern detected for a specific endpoint: sustained high request rate from a small number of clients.
  • 16:12 — A new rate-limit rule is drafted to target the suspected abusive pattern.
  • 16:19 — The rule is deployed with a low threshold, intended to be tightened later.
  • 16:24 — Error rates begin to rise for the affected endpoint, but not from the suspected source IP ranges.
  • 16:28 — First support tickets arrive from users who report being blocked after a small number of actions.
  • 16:31 — On-call notices that the rate-limit counter is increasing for a broader set of users than expected.
  • 16:36 — Investigation shows that the rule is keying on a property shared by many legitimate users (e.g., a shared tag or region) instead of the abusive pattern.
  • 16:40 — Decision made to roll back the new rule entirely and rely on existing, broader protections while a fix is prepared.
  • 16:45 — Rollback completed. Error rates and reported blocks begin to drop.
  • 16:53 — Metrics confirm that legitimate traffic is no longer being throttled.
  • 17:10 — Incident closed with follow-ups captured.

Root cause

The root cause was a mis-specified match condition in the rate-limit configuration.

Instead of keying on a combination of attributes that identified the abusive clients (such as IP reputation and specific request patterns), the rule:

  • keyed primarily on a header that many legitimate clients shared
  • applied the limit at a scope much broader than intended

This meant legitimate users quickly consumed the shared limit, while the abusive clients slipped through a different path.

Contributing factors:

  • Rushed rule design. The rule was created under time pressure, with limited peer review.
  • Insufficient test coverage. We did not have a way to safely simulate how the rule would affect different cohorts before deploying it.
  • Limited visibility. Our dashboards showed aggregate rate-limit counters but not breakdowns by cohort, making it harder to see who was being throttled until user reports arrived.

What we changed

1. Safer rate-limit rule design

We introduced a small design checklist for new rate-limit rules:

  • What is the exact behavior we are trying to constrain?
  • What attributes uniquely identify that behavior?
  • What are the plausible false-positive cohorts?

Rules must:

  • key on attributes that are as specific as possible to the abusive pattern
  • avoid relying solely on broad identifiers shared by normal users

2. Shadow evaluation mode

We added a "shadow" mode for rate-limit rules:

  • the rule evaluates and records what would have been limited
  • no real traffic is blocked yet

This allows us to:

  • see which cohorts would be affected
  • adjust thresholds before enforcing the rule

3. Better breakdowns in dashboards

We updated dashboards to break down rate-limit events by:

  • client attributes (as appropriate and anonymized)
  • region or other relevant dimensions

This makes it faster to see if a rule is unexpectedly hitting a large group of normal users.

4. Runbook updates and review process

We changed the runbook for responding to abusive traffic:

  • prefer incremental tightening of existing, well-understood rules
  • use shadow mode before enforcing new, narrow rules
  • include support in the loop earlier when behavior could affect many users

We also added a light peer review step for rate-limit changes, even under pressure.

In practice, this means that even in noisy situations, one other engineer glances at the rule and asks "who, exactly, will this slow down?" before it goes live.

Follow-ups

Completed

  • Implemented shadow mode for rate-limit rules.
  • Improved dashboards with cohort breakdowns for limit events.
  • Documented a design checklist and review step for new rules.

Planned / in progress

  • Add automated tests that exercise rate-limit rules against synthetic traffic patterns.
  • Integrate rate-limit configuration changes into the same review tooling we use for other production config.

Takeaways

  • Rate limits are sharp tools; misconfigurations can hurt legitimate users more than attackers.
  • Designing rules under time pressure is risky; shadow evaluation and peer review help catch mistakes.
  • Good visibility into who is being throttled turns rate limiting from guesswork into an observable system.
  • Documenting a small set of "known good" rule patterns gives people safer starting points than inventing each rule from scratch.
  • Treat rate limits as part of the product surface, not just a firewall rule.