RELIABILITY2020-09-09BY PRIYA PATEL

Story: the dashboard we stopped updating and the alert that still used it

We let a dashboard drift while an alert still depended on it. The next incident taught us why observability assets need owners.

reliabilityobservabilitydashboardsalerts

What happened

A few years into running a service, we had more dashboards than anyone could comfortably list.

Some were carefully maintained.

Some were not.

One of the latter had started life as the primary view for a new API. Over time, people created more focused dashboards. The original became a dumping ground:

new graphs added after incidents
experiments that never graduated to real views
half-finished panels wired to stale metrics

Eventually, most people stopped opening it.

The alerting system didn’t.

One morning, an alert fired based on a graph in that dashboard. The numbers looked alarming. The incident channel filled quickly.

Within minutes, we realized the graph was misconfigured and the underlying metric had changed shape months earlier.

The alert was faithfully telling us that a useless graph was moving, not that users were in trouble.

The confusion

In the first ten minutes, we:

tried to correlate the alert to user-facing errors
checked other dashboards and saw no matching signals
questioned whether the newer dashboards were broken instead

The team lost time reconciling contradictory telemetry.

The actual service was healthy.

The broken part was our observability: a dashboard nobody trusted feeding an alert everyone trusted by default.

What we changed

1. Give dashboards owners

We started treating dashboards like code:

each important dashboard has an owning team
ownership is visible in the dashboard description
changes go through light review when they affect alerts

If a team no longer wants to own a dashboard, it must be:

handed off explicitly, or
deprecated and unhooked from alerts

2. Tie alerts to maintained views

We audited which dashboards our alerts referenced.

For each one, we asked:

Is this the primary, maintained view for this signal?
Do we have a smaller, more focused dashboard that would be a better anchor?

We:

pointed alerts to dashboards that were already part of incident workflows
removed or rewired alerts that depended on dashboards nobody used

3. Deprecate safely

We introduced a simple deprecation flow for dashboards:

add a clear "deprecated" label and guidance on where to look instead
remove links from navigation and runbooks
disconnect any alerts that reference the dashboard

After a cooling-off period, we archive or delete the dashboard.

This avoids silent drift where a dashboard lingers just long enough to mislead someone.

4. Make broken graphs obvious

On dashboards we kept, we:

removed graphs with obviously incorrect or unused queries
added small annotations for graphs that were intentionally experimental

If a graph is important enough to page on, it is important enough to:

render correctly
have labels and units people trust

Takeaways

Dashboards and alerts are part of the same contract; if one is unowned, the other will eventually misfire.
It’s safer to delete or deprecate a dashboard than to let it quietly drift out of sync with reality.
Alerts should reference views that on-call engineers already use, not forgotten pages.
Observability assets deserve ownership and review just like code and runbooks.