RELIABILITY2025-06-22BY ELI NAVARRO

Story: rethinking on-call dashboards for platform teams

A platform team’s dashboards were great for them but hard for service owners to use during incidents. We describe how we changed them.

reliabilityplatformsdashboardson-call

What happened

The platform team had beautiful dashboards.

They showed:

CPU and memory for the shared clusters
queue depths and throughput for internal buses
internal error rates for infrastructure services

Platform engineers knew how to read them.

Service teams did not.

During incidents, this gap showed up as:

platform on-call saying "everything looks fine here"
service on-call saying "users are clearly not fine"

Both were correct from their vantage point.

The dashboards were built around platform health, not around the reliability of the services built on top.

A specific incident

One incident crystalized the problem.

A shared database cluster started to see intermittent slowdowns.

The platform dashboard showed:

CPU within limits
memory stable
no obvious spikes in low-level error metrics

From the platform view, this looked like noise.

At the same time, a user-facing service saw:

increased latency for a specific set of queries
retries stacking up in application logs
a small but real error-rate increase

The service dashboard made the slowdown visible.

The platform dashboards did not make it easy to connect that slowdown to their own metrics.

We realized we needed a shared language on dashboards between platform and service teams.

What we changed

1. Add consumer-centric slices

Platform dashboards gained a new set of sections organized by consumer perspective:

per-service or per-tenant views
metrics that show how platform behavior maps to service SLOs

For example, instead of just "database CPU," we:

added breakdowns of query latency by service
showed which services were consuming the most connections

This helped answer questions like:

"Which services are most affected right now?"
"Is this a platform-wide problem or localized?"

2. Align terminology

We cleaned up naming and labels.

Platform-internal names for clusters and queues were translated into:

the service names or domains that most people knew
consistent labels used across dashboards and alerts

Instead of "cluster-db-03," the dashboard said which high-level services relied on it.

3. Tie platform metrics to service SLOs

We added simple overlays and annotations:

when a platform metric crossed a threshold that had historically affected service SLOs
pointers from service SLO breaches to relevant platform graphs

This turned some platform dashboards from "interesting metrics" into tools for explaining SLO behavior.

4. Share ownership of key views

We paired platform and service engineers to design a small set of shared dashboards:

one that both on-calls could open during a cross-cutting incident
with a clear narrative: "what’s happening, who is affected, where to look next"

Ownership for these dashboards was joint:

platform owns correctness of low-level metrics
service teams help ensure the views answer their questions

Takeaways

Platform dashboards need to tell a story that service teams can understand under pressure.
Organizing platform views around consumers, not just components, makes incidents less adversarial.
Shared dashboards and terminology help when "platform is fine" and "users are not fine" collide.