Story: the 'temporary' admin tool that became production-critical
An internal admin tool started as a quick experiment and quietly became essential. We describe how we discovered that and what we did about it.
What happened
The admin tool began as a stopgap.
We needed a way for a small internal team to:
- look up certain records
- trigger a few operations
- check statuses without asking engineers every time
The first version shipped quickly:
- a simple web interface around a handful of APIs
- minimal authentication
- no formal SLOs
Everyone agreed it was "temporary" until we built something "proper."
Years passed.
More flows accreted onto the tool:
- support used it to resolve user issues
- operations used it to run bulk actions
- engineers used it to check the state of long-running tasks
It made work faster, so more teams relied on it.
Then, one day, it broke.
The outage
A routine dependency change removed an endpoint the admin tool relied on.
The user-facing product kept working.
The admin tool did not.
At first, the impact looked small:
- a few internal users saw errors
- some bulk actions had to be retried later
Over the next few hours, we realized how deeply the tool was wired into our operations:
- support couldn’t quickly answer certain questions
- some manual recovery and reconciliation flows stalled
- on-call engineers had to run ad-hoc scripts instead of using the usual interface
The tool had quietly become production-critical—for humans.
We had never written that down.
What we changed
1. Admit that internal tools can be critical
We updated our mental model:
- if a tool is required to operate the system safely, it is part of production
- production includes the flows used by humans during incidents and routine operations
We started tagging such tools explicitly:
- "operationally critical"
- with owners and SLOs
2. Give the tool a proper home
We:
- moved the admin tool into a repository with clear ownership
- set up CI, tests, and deploy pipelines
- added observability: metrics, logs, and dashboards
This wasn’t about making it perfect.
It was about bringing it into the same engineering discipline as the services it controlled.
3. Define its SLOs
We asked:
- who relies on this tool?
- what does "good" look like for them?
We set simple SLOs:
- availability during core support hours
- acceptable latency for key actions
We also created error budgets for the tool and surfaced them in reliability reviews.
4. Reduce risky operations
Over time, the tool had accumulated powerful actions:
- bulk changes
- force-complete operations
- direct state edits
We:
- audited these actions
- added confirmations and guardrails
- moved the riskiest flows behind stricter permissions or into safer, scripted paths
This reduced the chance that a bug or a rushed click would cause a large incident.
5. Plan for its evolution
We accepted that the tool was no longer temporary.
We created a small roadmap:
- split out the most critical flows into better-defined applications or APIs
- pay down UI and performance debt that slowed incident response
- improve accessibility so more people could use it effectively
Takeaways
- Internal tools can quietly become production-critical; impact is measured in human workflows, not just API errors.
- "Temporary" often means "permanent without a plan." Ownership and SLOs should follow usage, not intention.
- Observability, tests, and guardrails matter as much for admin tools as for user-facing services.
- Treating internal tools as product helps them evolve safely instead of accumulating hidden risk.