Error messages are operational tooling
A good error message reduces support load: it tells a stressed human what happened, what to do next, and what to share.
The moment an error message appears, your product stops being a flow and becomes an incident.
Not an all-hands incident. A tiny one. A single person is stuck, annoyed, and trying again.
If the message is vague, the next step becomes guesswork. Guesswork turns into screenshots. Screenshots turn into tickets. Tickets turn into interruptions.
Constraints
Error messages are usually written under constraints that push them toward uselessness:
- They arrive late. The feature is “done” and the error is a placeholder.
- They try to do everything. Apologize, explain, reassure, and not reveal anything sensitive.
- They aren’t accessible. Screen readers get the wrong focus; color is the only signal; the message disappears.
- They’re not connected to operations. Support has no reference ID; engineers can’t correlate a report with logs.
A safe error message is also a security boundary. We can’t dump internals into the UI. But “safe” doesn’t have to mean “opaque.”
What we changed
We started treating errors as part of the operational surface area.
Concretely:
- We standardized on a small pattern: plain-language message + short code + reference ID.
- We made the next action explicit (“try again,” “check your connection,” “contact support with this code”).
- We fixed focus and contrast so the message is usable with a keyboard and a screen reader.
- We added a runbook mapping for the handful of codes that matter: what to check first, and what a safe first action is.
If the runbook is the UI for the on-call engineer, the error message is the UI for the user.
A “good enough” template
We use a boring structure:
- What happened: plain language (“We couldn’t save your changes.”)
- What you can do: retry, check your connection, or contact support
- Reference: short code + reference ID
Example:
“Couldn’t update billing address. Try again. If it keeps happening, contact support with code BILLING-UPDATE and reference 7H2K9.”
It’s not elegant. It’s operable.
Reference IDs change support behavior
Without an ID, tickets start with screenshots.
With an ID, tickets start with a lookup.
We make the reference visible and copyable, and we keep it stable across retries (so support isn’t chasing a moving target).
On the backend, the same ID appears in logs and traces so engineers can jump to the right slice of time without reenacting the user’s steps.
We also keep the number of user-facing codes small. If every error has a unique code, the code becomes noise. We keep a small set that maps to real operational categories (auth, payments, vendor timeouts, validation).
Accessibility details that matter
We treat the message like a UI component:
- focus moves to the message when it appears
- the message is announced (not only color, not only placement)
- the message stays on screen long enough to read and copy
“Safe” should not mean “opaque.” It should mean “doesn’t leak,” while still giving a human the next step.
Making errors operable for support
We keep a tiny internal mapping from code → runbook section.
Support doesn’t need a stack trace. Support needs a next step:
- retry vs wait vs collect info
- what to tell the user
- when to escalate
When we introduce a new code, we ship the runbook mapping with it. Otherwise the code is just a new noun that people will have to learn under stress.
We also teach a simple support script: “please copy the reference and tell us what you were trying to do.”
The point is not to blame the user. It’s to turn a vague report into an actionable one.
We also test error copy like we test other UI: trigger it in staging, run it with a keyboard, and make sure the reference is actually copyable.
If an error message is hard to copy, it will become a screenshot.
Results / Measurements
The best outcome is less back-and-forth.
We watched for:
- fewer support tickets that start with only a screenshot
- faster triage because the report includes a reference ID
- fewer “can you try again while I watch the logs?” loops
We also watched for fewer escalations that were really “I can’t translate this error.”
When support can attach a reference and a code, engineering can answer “is this known?” quickly, and the user gets a clearer next step.
A cheap internal metric: what percentage of tickets include a reference ID on first contact.
When that number is low, it usually means the UI isn’t making the reference easy to find.
We also watched time-to-triage.
When the ticket includes a reference, the first step becomes lookup instead of reenactment.
That doesn’t just help engineering. It helps users, because the fastest support response is the one that doesn’t require a second email.
We also watched whether support could resolve the common cases without escalation.
If the message gives a clear next step and a reference, support can often close the loop in one reply.
That’s the whole point: fewer loops, fewer screenshots, fewer “can you try again while I watch the logs?” moments.
It’s not glamorous work, but it compounds. The system becomes easier to operate because people can describe what’s happening.
Takeaways
Write error messages like you write runbooks: short, specific, and designed for tired humans.
If you can’t include a reference ID, you’re choosing slower debugging later.
“Safe” should mean “doesn’t leak,” not “doesn’t explain.”