DESIGN2018-12-18BY MARA SABOGAL

The button that looked harmless

A destructive admin action was labeled like a refresh. We changed the UI so the safe path is obvious under stress.

designoperationsinternal-toolsreliability

At 4:42pm, an operations lead clicked a button that said “Rebuild.”

It lived next to “Refresh.” Same size. Same color. Same tone.

They weren’t being careless. They were doing what the UI suggested: the harmless thing.

They were trying to answer a support question: “did the batch finish yet?” The dashboard looked stale, and the room was loud in the way it gets loud right before an incident. The tool offered two verbs—Refresh and Rebuild—and one of them sounded like “make it correct again.”

What happened

“Rebuild” didn’t refresh a view.

It kicked off a background job that reprocessed a large slice of production data. It was slow, loud, and it competed with real customer traffic.

Within minutes, the service started falling behind. The queue grew. Support saw retries. Engineers got paged.

The job had no visible progress indicator, no estimate of how much data it would touch, and no obvious “stop” button. Everyone’s first instinct was to scale, because that’s what you do when things are slow. But scaling doesn’t fix a job that is doing the wrong amount of work.

Support asked if the customer’s retry “caused it.” Engineering asked if there was a deploy. Nobody had a shared reference to point at, because the tool didn’t emit one.

So the conversation looped:

“What exactly did you click?”

“The rebuild button.”

“The one next to refresh?”

“Yes.”

That’s the operational tax of ambiguous interfaces: you spend time reconstructing what happened instead of responding.

We also learned something about labels. “Rebuild” sounded like it would make things correct again.

It did.

It also did it by doing the most expensive possible thing, immediately, with no throttle.

The incident wasn’t caused by a bad person clicking the wrong thing.

It was caused by an interface that made a high-impact action look like a low-impact one.

What we changed

We treated the admin tool like a production UI.

We separated destructive actions from routine actions.
We changed labels to describe the effect (“Reprocess last 30 days”) instead of the mechanism (“Rebuild”).
We added a confirmation step that forces a pause and shows scope (what will change, how long it might take).
We added a “danger zone” pattern: risky actions live in a different area with deliberate language.
We made the safe alternative visible (“Refresh view”) and the risky action deliberate.
We added permission scoping so only the people who own the job can run it.
We added an audit log + reference ID so support and engineering can talk about the same event.

We also improved the moment after the click:

progress is visible (started, running, percent or batches done)
the job can be paused/stopped
the system makes throttling obvious before scaling becomes the reflex

We also made the recovery path obvious:

the job can be paused/stopped
the UI links to the dashboard that shows backlog and lag
the runbook lists a safe first action before “scale it”

We updated the runbook: if backlog rises after an admin action, the safe first step is to stop the job (not to scale blindly).

Patterns we now default to

This incident turned into a small internal rule: if an action can degrade production, the UI must treat it as production work.

Concretely, we now default to:

object + scope labels: “Reprocess last 30 days” beats “Rebuild”
safe primary buttons: the default action should be reversible
confirmation copy with numbers: what it touches, how long it might run, what it will compete with
progress + escape hatches: visible progress, plus pause/stop, plus throttle
auditability: every action produces a reference ID
rate limiting: prevent repeated clicks from stacking work
defaults that prefer safety: the safe option is the easiest option

It’s not about making the UI scary.

It’s about making the safe path obvious and the risky path deliberate.

We also changed where the action lives.

The “reprocess” control moved behind a deliberate workflow (ticket → approval → button), not because we love ceremony, but because the cost of an accidental click is real.

We didn’t remove the ability to do the work.

We removed the ability to do it casually.

Takeaways

Internal tools are not “internal.” They are the interface to production.

If an action can degrade the system, the UI must make that risk obvious.

Clear labels are not polish. They are operational safety.

If a dangerous action looks like a refresh, someone will eventually click it. That’s not human error. That’s design debt.

A good internal tool makes state and scope visible.

If an action competes with customer traffic, say so. If it touches more than one entity, show the number. If it’s going to take a while, show progress. If it’s reversible, put the reversal next to it.

The goal is not to prevent mistakes forever.

The goal is to make the safe path obvious under stress.

When internal tools fail, they fail the same way public products fail: unclear labels, hidden state, and missing feedback.

Treating internal tools as “real UI” is not precious. It’s how you keep production operable.

When a tool is part of the recovery path, it should be designed like a safety system: clear labels, visible scope, and an exit.

If the UI hides state, the incident room has to reconstruct it in chat.

The button that looked harmless

What happened

What we changed

Patterns we now default to

Takeaways

Further reading