STEWARDSHIP2021-04-19BY STORECODE

Decision record: One shared error taxonomy

We chose a shared way to categorize errors across services so dashboards, alerts, and user-facing messages line up.

stewardshiperrorsobservabilitydesign

Context

Each service labeled errors in its own way.

Some used HTTP status codes as the primary signal.

Some logged free-form error messages.

Some had their own internal categories like TEMPORARY_FAILURE or UPSTREAM_BAD_DATA.

This made sense locally, but it broke down when we wanted to:

The same underlying problem could show up as:

We needed a shared language for errors.

We adopted a shared error taxonomy with a small set of categories and required metadata.

Every error we care about operationally should be classifiable as one of a few top-level kinds, such as:

Each error event should also carry:

We did not try to model every nuance on day one.

We chose a small, useful set of categories and codes and left room to evolve.

Clearer dashboards. We can:
- break down errors by category across services
- focus on system_error and dependency_error for reliability work
Better user messaging. We can map:
- user_error to specific, actionable messages
- system_error to honest but generic messages that don’t blame users
Simpler alerting. We can:
- alert on spikes in system_error or dependency_error
- treat user_error spikes differently (e.g., product bugs or UX confusion)

Migration work. Teams need to:
- map existing errors to the shared taxonomy
- update logging and metrics
Discipline required. New error types must:
- pick appropriate categories
- choose stable codes that don’t change with wording

We added basic checks:

We also clarified that the taxonomy is not for internal disagreements about naming; it is for making cross-service operations and support easier.