STEWARDSHIP2024-12-05BY STORECODE

Story: the 'temporary' admin tool that became production-critical

An internal admin tool started as a quick experiment and quietly became essential. We describe how we discovered that and what we did about it.

stewardshipinternal-toolsreliabilityincidents

What happened

The admin tool began as a stopgap.

We needed a way for a small internal team to:

look up certain records
trigger a few operations
check statuses without asking engineers every time

The first version shipped quickly:

a simple web interface around a handful of APIs
minimal authentication
no formal SLOs

Everyone agreed it was "temporary" until we built something "proper."

Years passed.

More flows accreted onto the tool:

support used it to resolve user issues
operations used it to run bulk actions
engineers used it to check the state of long-running tasks

It made work faster, so more teams relied on it.

Then, one day, it broke.

The outage

A routine dependency change removed an endpoint the admin tool relied on.

The user-facing product kept working.

The admin tool did not.

At first, the impact looked small:

a few internal users saw errors
some bulk actions had to be retried later

Over the next few hours, we realized how deeply the tool was wired into our operations:

support couldn’t quickly answer certain questions
some manual recovery and reconciliation flows stalled
on-call engineers had to run ad-hoc scripts instead of using the usual interface

The tool had quietly become production-critical—for humans.

We had never written that down.

What we changed

1. Admit that internal tools can be critical

We updated our mental model:

if a tool is required to operate the system safely, it is part of production
production includes the flows used by humans during incidents and routine operations

We started tagging such tools explicitly:

"operationally critical"
with owners and SLOs

2. Give the tool a proper home

We:

moved the admin tool into a repository with clear ownership
set up CI, tests, and deploy pipelines
added observability: metrics, logs, and dashboards

This wasn’t about making it perfect.

It was about bringing it into the same engineering discipline as the services it controlled.

3. Define its SLOs

We asked:

who relies on this tool?
what does "good" look like for them?

We set simple SLOs:

availability during core support hours
acceptable latency for key actions

We also created error budgets for the tool and surfaced them in reliability reviews.

4. Reduce risky operations

Over time, the tool had accumulated powerful actions:

bulk changes
force-complete operations
direct state edits

We:

audited these actions
added confirmations and guardrails
moved the riskiest flows behind stricter permissions or into safer, scripted paths

This reduced the chance that a bug or a rushed click would cause a large incident.

5. Plan for its evolution

We accepted that the tool was no longer temporary.

We created a small roadmap:

split out the most critical flows into better-defined applications or APIs
pay down UI and performance debt that slowed incident response
improve accessibility so more people could use it effectively

Takeaways

Internal tools can quietly become production-critical; impact is measured in human workflows, not just API errors.
"Temporary" often means "permanent without a plan." Ownership and SLOs should follow usage, not intention.
Observability, tests, and guardrails matter as much for admin tools as for user-facing services.
Treating internal tools as product helps them evolve safely instead of accumulating hidden risk.