Decision record: One shared error taxonomy
We chose a shared way to categorize errors across services so dashboards, alerts, and user-facing messages line up.
Context
Each service labeled errors in its own way.
Some used HTTP status codes as the primary signal.
Some logged free-form error messages.
Some had their own internal categories like TEMPORARY_FAILURE or UPSTREAM_BAD_DATA.
This made sense locally, but it broke down when we wanted to:
- build cross-service dashboards
- reason about which errors we should page on
- align user-facing messages with operational realities
The same underlying problem could show up as:
- a 500 in one service
- a logged "time-out" string in another
- a
RETRYABLEflag buried in a third
We needed a shared language for errors.
Decision
We adopted a shared error taxonomy with a small set of categories and required metadata.
Every error we care about operationally should be classifiable as one of a few top-level kinds, such as:
user_errorsystem_errordependency_errorrate_limitedunauthorized
Each error event should also carry:
- a stable error code (not just free text)
- the origin service
- whether it is retryable from the system’s perspective
We did not try to model every nuance on day one.
We chose a small, useful set of categories and codes and left room to evolve.
Consequences
Upsides
- Clearer dashboards. We can:
- break down errors by category across services
- focus on
system_erroranddependency_errorfor reliability work
- Better user messaging. We can map:
user_errorto specific, actionable messagessystem_errorto honest but generic messages that don’t blame users
- Simpler alerting. We can:
- alert on spikes in
system_errorordependency_error - treat
user_errorspikes differently (e.g., product bugs or UX confusion)
- alert on spikes in
Downsides / costs
- Migration work. Teams need to:
- map existing errors to the shared taxonomy
- update logging and metrics
- Discipline required. New error types must:
- pick appropriate categories
- choose stable codes that don’t change with wording
Guardrails
We added basic checks:
- lint rules for error definitions in some languages
- lightweight review for new error codes and categories
We also clarified that the taxonomy is not for internal disagreements about naming; it is for making cross-service operations and support easier.