RELIABILITY2022-07-25BY PRIYA PATEL

How we reduced alert noise without losing signal

The concrete changes we made to our alerting so pages became rarer, clearer, and more actionable.

reliabilityalertingon-callobservability

For years, our alerting grew one incident at a time.

Each painful event left behind a new rule:

  • alert on this ratio
  • alert on this HTTP code
  • alert on this log pattern

Individually, each alert made sense.

Together, they produced a steady stream of noise:

  • pages about brief blips that self-resolved
  • near-duplicate alerts for the same underlying issue
  • "informational" notifications that trained people to swipe them away

We knew this was unsustainable when:

  • on-call engineers routinely muted channels during off-hours
  • incident reviews spent time just listing which alerts fired
  • we still occasionally missed real incidents because the critical alerts were buried

This post is about what we changed to reduce alert noise without losing signal.

Constraints

  • We could not pause feature work for a quarter to rebuild alerting from scratch.
  • Different teams owned different services and alerts; we needed a shared approach, not central control.
  • Our observability stack was shared; heavy queries for alerts affected others.

We also had cultural constraints:

  • some teams viewed more alerts as safer
  • others were skeptical of removing any existing alerts

What we changed

We treated alerting as a product with users: the on-call engineers.

1. Define what deserves a page

We wrote down a simple rule:

A page is for something that requires a human to act within the next few minutes to protect users or the system.

Everything else is:

  • a ticket
  • a dashboard
  • or nothing at all

We applied this rule service by service.

2. Group alerts by user impact

We reorganized alerts into a few categories:

  • User-visible errors or latency. SLO breaches, spikes in 5xx, sustained latency increases.
  • Resource exhaustion. Disk, memory, connection pools approaching limits when they threaten user traffic.
  • Critical control-plane failures. Deploy system or feature flags being unavailable when needed.

If an alert couldn’t be mapped to one of these categories, we asked why it should page.

3. Reduce duplicates and flapping

We found several patterns:

  • multiple alerts on different metrics for the same incident
  • alerts that fired on small, transient spikes
  • alerts that recovered and re-fired repeatedly

We fixed these by:

  • consolidating related alerts into a single "symptom" alert
  • adding short windows and hysteresis to avoid flapping
  • tying alerts more closely to SLOs instead of raw metrics

4. Make every page point to a starting point

Borrowing from our dashboard work, we made each paged alert:

  • link to a single "first dashboard"
  • include a short "first step" in the description

For example:

  • "Check the service dashboard for error rate and latency. If both are elevated, follow the 'latency after deploy' runbook section."

This turned alerts from vague warnings into starting moves.

5. Run alert reviews like API reviews

We added a light review process:

  • new pages require a short design doc (one or two paragraphs)
  • reviews ask:
    • what action should the on-call take?
    • what’s the expected frequency?
    • what happens if this never fires?

We also set a small budget per service for paged alerts.

When a service hit its budget, adding a new page required revisiting existing ones.

Results / Measurements

After a few months of changes, we measured:

  • Page volume. Pages per week for a sample of core services dropped by ~40–50%.
  • Coverage. The percentage of major incidents that had at least one useful early alert increased.
  • On-call sentiment. In informal surveys, engineers reported fewer "useless" pages and more confidence that a page meant something important.

We also saw:

  • faster time-to-engage, because people weren’t conditioned to ignore alerts
  • cleaner incident narratives, with fewer "and then five more alerts fired" sections

We did not eliminate all noisy alerts. Some still slip through, especially during new feature launches. The difference is that now we treat them as regressions and tune them, instead of accepting them as background noise.

Takeaways

  • Pages are an interface. They should be rare, clear, and actionable.
  • Defining what deserves a page (and writing it down) is the first step toward better alerting.
  • Grouping alerts by user impact helps avoid a forest of metric-specific rules.
  • Consolidating and de-flapping alerts reduces fatigue without reducing coverage.
  • Small, ongoing alert reviews keep the system healthy; this is not a one-time cleanup.

Further reading