Note: error budgets for internal tools
Why we started giving internal tools explicit error budgets instead of treating them as best-effort.
Internal tools break in quieter ways than public APIs.
A button is missing. A job queue silently stalls. A report never finishes loading.
Nobody tweets about it. They just send screenshots in a private channel and try again later.
For a long time we treated this as acceptable: "it’s an internal tool; they know it’s not perfect."
Then we realized how often incidents were made worse by those same tools:
- on-call couldn’t see what they needed because an admin UI timed out
- a support dashboard lagged by hours, making it hard to confirm user impact
- a deploy tool flaked under load, turning a simple rollback into a multi-step dance
The failure wasn’t that the tools ever broke.
The failure was that we had no shared expectation for how often they could be broken before we would stop and fix them.
What we changed
We started giving critical internal tools explicit error budgets, just like user-facing services.
For each tool, we asked:
- Who relies on this under stress?
- What does a "good" week look like?
- What kind of failure would force people to improvise manual workarounds?
Then we picked a small set of signals:
- availability (can it be loaded at all?)
- success rate for key actions (e.g., triggering a rollback, looking up a user)
- latency at the moments that matter (e.g., loading the incident dashboard)
We defined simple SLOs:
- "The deploy tool completes X% of deploy and rollback actions in Y minutes."
- "The incident dashboard loads successfully for Z% of requests over 30 days."
When a tool burned through its error budget, we didn’t hold a ceremony.
We did three practical things:
- paused new feature work on that tool
- added it to the main reliability review like any other service
- made a small, time-boxed plan to get it back within budget
The effect was quiet but real:
- incident reviews started mentioning internal tools as part of the system, not as an afterthought
- engineers were more willing to say "we can’t ship this until the rollback path is reliable"
- support teams had more leverage to ask for fixes
Takeaways
- Internal tools are part of the production surface; they deserve explicit reliability targets.
- Error budgets give you a language to say "this is broken too often" without escalating every failure.
- A handful of well-chosen SLOs for internal tools can shorten incidents more than another public-facing micro-optimization.