RELIABILITY2018-12-03BY ELI NAVARRO

Page or ticket

Not every alert deserves a page. We separate pages from tickets so on-call attention stays reserved for real impact.

reliabilityalertingon-callmonitoringoperations

An alert is information.

A page is an interruption.

If you use the same channel for both, you train people to ignore the channel.

On-call attention is a finite resource. You will spend it either on real incidents or on noise. You don’t get to not spend it.

Constraints

Most teams inherit alerting the same way they inherit everything else: by accretion.

Old alerts don’t get deleted.
New alerts are added “just in case.”
Thresholds get set without a baseline.
Ownership is unclear, so every alert targets “someone.”

The result is predictable: too many pages that don’t correspond to user impact.

Then a real incident happens, and the page doesn’t carry the urgency it should.

Paging is a coordination tool, not a telemetry sink. A page should mean: stop what you’re doing, we need a human now.

False pages tax the next real page. People build workarounds—mute rules, filters, “ignore anything from X”—and the pager stops being reliable.

Two failure modes show up over and over.

1) We page on things we can’t safely act on

CPU is high. Memory is high. A queue is non-empty. A host is noisy.

Those can be useful signals. They just aren’t always page-worthy.

If the page doesn’t tell you what action to take in the next five minutes, it’s not a page. It’s a chart.

A common trap is paging on a metric that is upstream of the real problem. You wake up to a symptom you can’t fix directly.

2) We page on missing information

Absence-of-data alerts fire when monitoring breaks, agents die, or a dashboard query times out.

That’s not impact. That’s blindness.

Blindness matters, but it needs a different response: restore visibility, don’t drag someone out of sleep without a clear first action.

What we changed

We forced a decision on every alert: page or ticket.

Pages (interruptions)

A page must meet at least one of these:

users are actively failing to complete a critical action
data loss is likely without intervention
the system is approaching a hard limit (disk full, queue exhaustion) on a short horizon

Everything else is a ticket.

This forces clarity. If you can’t write the user-impact sentence, you can’t justify the page.

Examples:

“Checkout is timing out for >5% of requests” → page.
“Disk is at 82% and growing slowly” → ticket.
“Disk will be full in ~2 hours at current write rate” → page.
“CPU is high” → almost always ticket, unless it correlates to user failure.

Tickets (drift)

Tickets are for things that are real but not urgent:

slow-growing saturation (disk, storage, quota)
noisy background jobs
dashboards that are too slow to load
missing runbooks
alerts that need baseline work

Tickets still need an owner and a due date. Otherwise “ticket” just means “never.”

We also treat tickets as work with a rhythm: triage weekly, pick a handful, and close them.

If a ticket is about a hard limit, we attach a horizon (“full in ~30 days at current growth”) so it can graduate into a page later when it becomes urgent.

Making pages usable

Once we decided what pages are for, we rewrote pages to be usable:

the alert name describes the symptom, not the metric
the page links to the one dashboard we expect you to open first
the page links to the runbook section with a safe first action and rollback path
the page includes a stop condition (“if X isn’t true, stop and roll back”)

We treat the page payload like UI copy. A tired human reads it. It needs to answer “what do I do next?”

A minimal page template we use:

What breaks for the user: …
Where to look first: (dashboard link)
Safe first action: …
Rollback / backout: …
Escalate when: …

If we can’t fill that in, it’s not a page yet.

No pages for monitoring gaps

Finally, we stopped paging on “absence of information.”

If an alert can be triggered by a monitoring gap, it is not a page. Fix the monitoring.

If the system genuinely becomes unsafe when monitoring is down, create a specific page for that condition (for example: “we cannot see error rate and deployments are in flight”). Don’t let a broken agent masquerade as production impact.

Keeping alerting clean

We also treat pages as a living system:

pages must have an owner
pages must be reviewed after they fire
pages that never correlate to impact get deleted
pages that correlate to impact but are not actionable get rewritten

If the pager gets noisy, we don’t “toughen up.” We reduce noise.

Results / Measurements

The immediate effect was fewer false pages.

In one system, pages dropped from ~15–20/week to ~4–6/week.

The more important effect was behavioral: when a page happened, people treated it as real.

We also watched:

% of pages that led to a change (rollback, mitigation, escalation). Pages that never lead to action are expensive.
time-to-first-safe-action. When the page includes a first dashboard and a safe first action, the first ten minutes stop being a debate.
repeat pages with the same root cause. If the same page fires every week, you don’t have an alerting problem. You have an operational debt problem.

Takeaways

Pages are for impact. Tickets are for drift.

If you can’t say what breaks for the user, don’t page.

If a page doesn’t tell a tired person where to look and what to do first, it’s not finished.