RELIABILITY2024-05-27BY PRIYA PATEL

Reshaping dashboards for SLO-first operations

What we changed about dashboards once we treated SLOs as the primary lens for reliability work.

reliabilitydashboardssloobservability

When we first introduced SLOs, we treated them as new graphs to add.

For each service, we created a set of SLO charts:

target vs actual availability
latency percentiles against objectives
error budgets consumed vs remaining

We kept all the old dashboards too.

The result was a lot of graphs and not much change in how we operated.

During incidents, on-call engineers still opened the same "classic" dashboards first. SLO charts were something people checked later, if at all.

We decided to flip the model: dashboards should start from SLOs and let everything else hang off of that.

Constraints

We could not redesign every dashboard overnight.
Different services had different levels of SLO maturity.
We didn’t want to hide detailed metrics that senior engineers relied on.

What we changed

1. Make SLO status the first row

On key dashboards, we moved SLO status to the top row:

availability SLO and error budget
latency SLO and budget burn

Everything else moved below.

This turned dashboards into answers to two questions:

Are we inside or outside our objectives?
How fast are we burning error budget?

Only after that did we look at CPU, memory, and other internals.

2. Tie alerts to SLOs, not raw metrics

We aligned alerts with SLO views:

primary pages fired when SLOs were at risk or violated, not when a single metric twitched
informational alerts pointed to SLO-adjacent metrics (e.g., increasing error rates that hadn’t burned much budget yet)

This reduced the number of alerts that were "noisy but within error budget."

3. Group diagnostics by user journey

We restructured dashboards under the SLO summary:

per-journey slices (sign-in, checkout, recovery)
each slice showing the relevant SLOs and supporting metrics

For example, checkout’s section included:

SLO status for success rate and latency
breakdown by dependency (payments, inventory, tax)
key saturation signals

This made it easier to go from "we’re burning error budget" to "which part of the journey is responsible."

4. Reduce cardinality where it didn’t help SLOs

We used SLOs to decide where high-cardinality metrics were actually valuable.

If a dimension didn’t help us explain or improve SLO behavior, we:

removed it from default dashboards
kept it accessible for ad-hoc debugging

This made dashboards faster and less overwhelming under pressure.

5. Add "error budget drills" to practice

We ran short exercises where:

we simulated an SLO burn (e.g., injecting errors in staging or replaying traffic)
on-call engineers used only the SLO-first dashboards to investigate

Feedback from these drills drove further adjustments:

missing links between SLO and supporting graphs
confusing labels or units
gaps where a heavily-used metric wasn’t visible from SLO views

Results / Measurements

We saw a few concrete changes:

Incident narratives referenced SLOs more. Instead of "latency was high," reviews said "we burned X% of the latency error budget between 10:05 and 10:25."
On-call decisions aligned more with objectives. Pages from SLO-based alerts were easier to prioritize over noisy metric-only alerts.
Dashboard performance improved. Simplifying high-cardinality views reduced query failures during peak times.

We also noticed that new engineers learned the system through SLO dashboards first, which made their mental models closer to how we wanted to operate.

Takeaways

SLOs are not just extra graphs; they should be the first row on your dashboards.
Aligning alerts and dashboards around SLOs makes it easier to tell whether a spike matters.
Grouping supporting metrics by user journey helps connect reliability work to real impact.
Practicing with SLO-first views surfaces missing links and confusing design before incidents do.