RELIABILITY2021-05-17BY STORECODE

Note: error budgets for internal tools

Why we started giving internal tools explicit error budgets instead of treating them as best-effort.

reliabilityslointernal-toolserror-budgets

Internal tools break in quieter ways than public APIs.

A button is missing. A job queue silently stalls. A report never finishes loading.

Nobody tweets about it. They just send screenshots in a private channel and try again later.

For a long time we treated this as acceptable: "it’s an internal tool; they know it’s not perfect."

Then we realized how often incidents were made worse by those same tools:

on-call couldn’t see what they needed because an admin UI timed out
a support dashboard lagged by hours, making it hard to confirm user impact
a deploy tool flaked under load, turning a simple rollback into a multi-step dance

The failure wasn’t that the tools ever broke.

The failure was that we had no shared expectation for how often they could be broken before we would stop and fix them.

We started giving critical internal tools explicit error budgets, just like user-facing services.

For each tool, we asked:

Then we picked a small set of signals:

We defined simple SLOs:

When a tool burned through its error budget, we didn’t hold a ceremony.

We did three practical things:

The effect was quiet but real:

incident reviews started mentioning internal tools as part of the system, not as an afterthought
engineers were more willing to say "we can’t ship this until the rollback path is reliable"
support teams had more leverage to ask for fixes

Takeaways

Internal tools are part of the production surface; they deserve explicit reliability targets.
Error budgets give you a language to say "this is broken too often" without escalating every failure.
A handful of well-chosen SLOs for internal tools can shorten incidents more than another public-facing micro-optimization.