RELIABILITY2020-09-09BY PRIYA PATEL

Story: the dashboard we stopped updating and the alert that still used it

We let a dashboard drift while an alert still depended on it. The next incident taught us why observability assets need owners.

reliabilityobservabilitydashboardsalerts

What happened

A few years into running a service, we had more dashboards than anyone could comfortably list.

Some were carefully maintained.

Some were not.

One of the latter had started life as the primary view for a new API. Over time, people created more focused dashboards. The original became a dumping ground:

  • new graphs added after incidents
  • experiments that never graduated to real views
  • half-finished panels wired to stale metrics

Eventually, most people stopped opening it.

The alerting system didn’t.

One morning, an alert fired based on a graph in that dashboard. The numbers looked alarming. The incident channel filled quickly.

Within minutes, we realized the graph was misconfigured and the underlying metric had changed shape months earlier.

The alert was faithfully telling us that a useless graph was moving, not that users were in trouble.

The confusion

In the first ten minutes, we:

  • tried to correlate the alert to user-facing errors
  • checked other dashboards and saw no matching signals
  • questioned whether the newer dashboards were broken instead

The team lost time reconciling contradictory telemetry.

The actual service was healthy.

The broken part was our observability: a dashboard nobody trusted feeding an alert everyone trusted by default.

What we changed

1. Give dashboards owners

We started treating dashboards like code:

  • each important dashboard has an owning team
  • ownership is visible in the dashboard description
  • changes go through light review when they affect alerts

If a team no longer wants to own a dashboard, it must be:

  • handed off explicitly, or
  • deprecated and unhooked from alerts

2. Tie alerts to maintained views

We audited which dashboards our alerts referenced.

For each one, we asked:

  • Is this the primary, maintained view for this signal?
  • Do we have a smaller, more focused dashboard that would be a better anchor?

We:

  • pointed alerts to dashboards that were already part of incident workflows
  • removed or rewired alerts that depended on dashboards nobody used

3. Deprecate safely

We introduced a simple deprecation flow for dashboards:

  • add a clear "deprecated" label and guidance on where to look instead
  • remove links from navigation and runbooks
  • disconnect any alerts that reference the dashboard

After a cooling-off period, we archive or delete the dashboard.

This avoids silent drift where a dashboard lingers just long enough to mislead someone.

4. Make broken graphs obvious

On dashboards we kept, we:

  • removed graphs with obviously incorrect or unused queries
  • added small annotations for graphs that were intentionally experimental

If a graph is important enough to page on, it is important enough to:

  • render correctly
  • have labels and units people trust

Takeaways

  • Dashboards and alerts are part of the same contract; if one is unowned, the other will eventually misfire.
  • It’s safer to delete or deprecate a dashboard than to let it quietly drift out of sync with reality.
  • Alerts should reference views that on-call engineers already use, not forgotten pages.
  • Observability assets deserve ownership and review just like code and runbooks.

Further reading