RELIABILITY2022-10-28BY STORECODE

Q&A: how we decide what gets an SLO

Answers to common questions about which services and behaviors deserve explicit SLOs and how we choose them.

reliabilitysloobservabilityoperations

Q&A

Does every service need an SLO?

No.

An SLO is a tool for setting expectations about reliability and making trade-offs visible.

We prioritize SLOs for:

user-facing flows where downtime or slowness is costly
internal tools used during incidents (deploys, rollbacks, incident dashboards)
shared services with many consumers

Small, low-risk scripts or batch jobs don’t need formal SLOs; they need monitoring and clear failure notices.

How do we choose what the SLO measures?

We start from the user’s perspective:

What does "good" look like for them?
What does "bad" look like?

For most services, that leads to one or two primary SLOs:

availability (percentage of successful requests)
latency (e.g., P95 under X ms)

For some internal tools, we care more about:

time-to-complete important actions (e.g., rollbacks)
freshness of data (e.g., dashboard reflects last N minutes)

We avoid SLOs for metrics that don’t map cleanly to user experience.

How many SLOs should a service have?

Enough to describe what matters, but not so many that nobody can remember them.

We aim for:

1–3 primary SLOs per service or flow
a few supporting metrics for debugging

If a service has more than ~5 SLOs, we ask whether some can be combined or dropped.

Who decides the SLO targets?

It’s a joint decision between:

the team that owns the service
stakeholders who depend on it (product, support, other teams)

We look at:

historical performance (what the system already does)
user expectations (e.g., "checkout should feel instant")
cost of achieving tighter targets

We prefer to start with realistic targets slightly better than current performance and tighten them over time if it makes sense.

What about error budgets?

An SLO is incomplete without an error budget.

The error budget is:

how much we’re allowed to miss the SLO over a period
the "spend" we can use on risk (deploys, experiments)

When a service consistently burns through its error budget, we:

slow down risky changes
prioritize reliability work

This turns reliability work into a trade-off, not an afterthought.

How do SLOs show up during incidents?

During incidents, SLOs help us:

decide how urgent something is (SLO impact vs. minor blip)
choose between "fix now" and "schedule later"
communicate impact clearly ("we burned X% of the weekly error budget")

We also look at whether the SLO itself needs adjustment based on real incidents.

Takeaways

Not every service needs an SLO, but critical flows and tools do.
Good SLOs describe user experience in a small number of numbers.
SLO targets should be realistic and negotiated, not aspirational slogans.
Error budgets turn reliability into an explicit trade-off instead of a vague goal.