Story: rethinking on-call dashboards for platform teams
A platform team’s dashboards were great for them but hard for service owners to use during incidents. We describe how we changed them.
What happened
The platform team had beautiful dashboards.
They showed:
- CPU and memory for the shared clusters
- queue depths and throughput for internal buses
- internal error rates for infrastructure services
Platform engineers knew how to read them.
Service teams did not.
During incidents, this gap showed up as:
- platform on-call saying "everything looks fine here"
- service on-call saying "users are clearly not fine"
Both were correct from their vantage point.
The dashboards were built around platform health, not around the reliability of the services built on top.
A specific incident
One incident crystalized the problem.
A shared database cluster started to see intermittent slowdowns.
The platform dashboard showed:
- CPU within limits
- memory stable
- no obvious spikes in low-level error metrics
From the platform view, this looked like noise.
At the same time, a user-facing service saw:
- increased latency for a specific set of queries
- retries stacking up in application logs
- a small but real error-rate increase
The service dashboard made the slowdown visible.
The platform dashboards did not make it easy to connect that slowdown to their own metrics.
We realized we needed a shared language on dashboards between platform and service teams.
What we changed
1. Add consumer-centric slices
Platform dashboards gained a new set of sections organized by consumer perspective:
- per-service or per-tenant views
- metrics that show how platform behavior maps to service SLOs
For example, instead of just "database CPU," we:
- added breakdowns of query latency by service
- showed which services were consuming the most connections
This helped answer questions like:
- "Which services are most affected right now?"
- "Is this a platform-wide problem or localized?"
2. Align terminology
We cleaned up naming and labels.
Platform-internal names for clusters and queues were translated into:
- the service names or domains that most people knew
- consistent labels used across dashboards and alerts
Instead of "cluster-db-03," the dashboard said which high-level services relied on it.
3. Tie platform metrics to service SLOs
We added simple overlays and annotations:
- when a platform metric crossed a threshold that had historically affected service SLOs
- pointers from service SLO breaches to relevant platform graphs
This turned some platform dashboards from "interesting metrics" into tools for explaining SLO behavior.
4. Share ownership of key views
We paired platform and service engineers to design a small set of shared dashboards:
- one that both on-calls could open during a cross-cutting incident
- with a clear narrative: "what’s happening, who is affected, where to look next"
Ownership for these dashboards was joint:
- platform owns correctness of low-level metrics
- service teams help ensure the views answer their questions
Takeaways
- Platform dashboards need to tell a story that service teams can understand under pressure.
- Organizing platform views around consumers, not just components, makes incidents less adversarial.
- Shared dashboards and terminology help when "platform is fine" and "users are not fine" collide.