RELIABILITY2019-11-02BY PRIYA PATEL

The one dashboard

If an alert doesn’t point to a single starting dashboard, the first ten minutes turn into archaeology. We keep one “first dashboard” per service.

reliabilityobservabilitymonitoringdashboardson-call

At 2:13am, the alert fired.

Three people joined a call.

And the first question was not “how do we fix this?”

It was “which dashboard is the real one?”

Constraints

Dashboards fail in predictable ways:

They grow without a purpose. A new graph gets added for every incident. Old graphs never leave.
They don’t match the alert. The page is about user impact; the dashboard is about CPU.
They can’t be queried quickly. High-cardinality breakdowns make simple questions slow.
They don’t encode a starting point. Everyone opens a different tab and narrates their own view.

They also fail organizationally:

nobody owns the dashboard, so nobody deletes graphs
the “important” dashboard is the one a senior person bookmarked three years ago
the dashboard answers yesterday’s incident, not today’s question

In those conditions, a dashboard becomes a museum.

A tired human doesn’t need a museum. They need a first move.

There’s also a subtler failure mode:

The dashboard is “correct,” but it’s not fast enough to use.

If the first screen takes 30 seconds to load, the incident room will route around it. People will start guessing.

What we changed

We started treating dashboards like runbooks: an interface for the first ten minutes.

For each service we maintain one “first dashboard.” It’s the page we expect an on-call engineer to open first, and the only dashboard linked from alerts.

The rule is simple:

If a graph doesn’t change a decision, it doesn’t belong on the first dashboard.

We also give it a performance budget: it has to load fast enough to be usable during an incident. If the query engine can’t answer the “first questions” quickly, we simplify the breakdowns until it can.

A boring template that answers the first questions

Our template is boring on purpose.

It is designed to answer three questions in order.

Is it broken? (top row)

request rate (or throughput)
error rate
latency (P95)

Where is it broken? (second row)

top endpoints / routes by error rate (bounded route templates, not raw URLs)
dependency health (DB, queue, vendor) if it’s on the critical path

Are we running out of something? (third row)

saturation signals that map to failure (queue depth, DB connections, disk)

That’s it.

Everything else can live on deeper dashboards.

Each graph on the first dashboard has a job: it either tells you the system is healthy, or it points to the next place to look (a deeper dashboard, a log view, or a runbook).

What we intentionally keep off the first dashboard

The first dashboard is not an appendix.

We avoid:

per-host breakdowns unless they routinely change the first action
unbounded label breakdowns (user IDs, request IDs)
graphs that are “interesting” but don’t change a decision

If you need a breakdown that requires expensive queries, it’s not a first dashboard widget.

It’s a deep-dive widget.

Making it usable under pressure

Then we make it usable:

Alerts link directly to the first dashboard. No hunting.
Deploy markers are visible. “What changed?” is a graph annotation, not a Slack debate.
Units are consistent. Seconds are seconds. Percent is percent. No mixed scales.
Default time ranges are sane. The first view should show “before and after” without you touching controls.
Links are pre-filtered. From a red graph, you can jump to logs/traces filtered to the same service and time window.
The dashboard links back to the runbook. If you can see the problem, you should be able to take a safe first action.

We still have deep dashboards.

But the first dashboard stays small enough to load quickly and answer the first two questions.

Owning the first dashboard

We assign an owner per service.

Ownership means: keep it small, keep it fast, review changes, delete stale graphs.

We review changes to the first dashboard the same way we review code: small diffs, clear intent, revertable queries.

If a graph is added because of an incident, the change should say which decision it enabled. If we can’t answer that, we shouldn’t add it.

We also track load time for the first dashboard. If it drifts above a few seconds, we simplify queries and drop expensive breakdowns. Fast beats clever during a page.

When an incident teaches us something, we try to add one graph and remove one graph. The first dashboard should not grow; it should rotate.

We also treat the link itself as part of the contract: alerts and runbooks point to the same first dashboard, and that dashboard points back to the runbook.

Results / Measurements

The outcome we care about is less time spent choosing where to look.

We tracked two practical signals:

time-to-first-decision (how long until someone can say “this is upstream latency” or “this is our deploy”)
time-to-first-safe-action (rollback, degrade mode, stop a job, turn off a flag)

On systems where we introduced a single first dashboard, both times dropped noticeably because the incident started from a shared view.

We also saw fewer “monitoring is broken” detours.

When the first dashboard is fast and bounded, you stop wasting time debugging the telemetry in the middle of the incident.

Takeaways

A dashboard is only useful if it reduces the time to a decision.

Pick one starting point. Keep it small. Link it from the alert.

If the first dashboard is slow to load, you don’t have observability—you have expensive storage.