RELIABILITY2022-07-25BY PRIYA PATEL

How we reduced alert noise without losing signal

The concrete changes we made to our alerting so pages became rarer, clearer, and more actionable.

reliabilityalertingon-callobservability

For years, our alerting grew one incident at a time.

Each painful event left behind a new rule:

alert on this ratio
alert on this HTTP code
alert on this log pattern

Individually, each alert made sense.

Together, they produced a steady stream of noise:

pages about brief blips that self-resolved
near-duplicate alerts for the same underlying issue
"informational" notifications that trained people to swipe them away

We knew this was unsustainable when:

on-call engineers routinely muted channels during off-hours
incident reviews spent time just listing which alerts fired
we still occasionally missed real incidents because the critical alerts were buried

This post is about what we changed to reduce alert noise without losing signal.

Constraints

We could not pause feature work for a quarter to rebuild alerting from scratch.
Different teams owned different services and alerts; we needed a shared approach, not central control.
Our observability stack was shared; heavy queries for alerts affected others.

We also had cultural constraints:

some teams viewed more alerts as safer
others were skeptical of removing any existing alerts

What we changed

We treated alerting as a product with users: the on-call engineers.

1. Define what deserves a page

We wrote down a simple rule:

A page is for something that requires a human to act within the next few minutes to protect users or the system.

Everything else is:

a ticket
a dashboard
or nothing at all

We applied this rule service by service.

2. Group alerts by user impact

We reorganized alerts into a few categories:

User-visible errors or latency. SLO breaches, spikes in 5xx, sustained latency increases.
Resource exhaustion. Disk, memory, connection pools approaching limits when they threaten user traffic.
Critical control-plane failures. Deploy system or feature flags being unavailable when needed.

If an alert couldn’t be mapped to one of these categories, we asked why it should page.

3. Reduce duplicates and flapping

We found several patterns:

multiple alerts on different metrics for the same incident
alerts that fired on small, transient spikes
alerts that recovered and re-fired repeatedly

We fixed these by:

consolidating related alerts into a single "symptom" alert
adding short windows and hysteresis to avoid flapping
tying alerts more closely to SLOs instead of raw metrics

4. Make every page point to a starting point

Borrowing from our dashboard work, we made each paged alert:

link to a single "first dashboard"
include a short "first step" in the description

For example:

"Check the service dashboard for error rate and latency. If both are elevated, follow the 'latency after deploy' runbook section."

This turned alerts from vague warnings into starting moves.

5. Run alert reviews like API reviews

We added a light review process:

new pages require a short design doc (one or two paragraphs)
reviews ask:
- what action should the on-call take?
- what’s the expected frequency?
- what happens if this never fires?

We also set a small budget per service for paged alerts.

When a service hit its budget, adding a new page required revisiting existing ones.

Results / Measurements

After a few months of changes, we measured:

Page volume. Pages per week for a sample of core services dropped by ~40–50%.
Coverage. The percentage of major incidents that had at least one useful early alert increased.
On-call sentiment. In informal surveys, engineers reported fewer "useless" pages and more confidence that a page meant something important.

We also saw:

faster time-to-engage, because people weren’t conditioned to ignore alerts
cleaner incident narratives, with fewer "and then five more alerts fired" sections

We did not eliminate all noisy alerts. Some still slip through, especially during new feature launches. The difference is that now we treat them as regressions and tune them, instead of accepting them as background noise.

Takeaways

Pages are an interface. They should be rare, clear, and actionable.
Defining what deserves a page (and writing it down) is the first step toward better alerting.
Grouping alerts by user impact helps avoid a forest of metric-specific rules.
Consolidating and de-flapping alerts reduces fatigue without reducing coverage.
Small, ongoing alert reviews keep the system healthy; this is not a one-time cleanup.