STEWARDSHIP2021-04-19BY STORECODE

Decision record: One shared error taxonomy

We chose a shared way to categorize errors across services so dashboards, alerts, and user-facing messages line up.

stewardshiperrorsobservabilitydesign

Context

Each service labeled errors in its own way.

Some used HTTP status codes as the primary signal.

Some logged free-form error messages.

Some had their own internal categories like TEMPORARY_FAILURE or UPSTREAM_BAD_DATA.

This made sense locally, but it broke down when we wanted to:

  • build cross-service dashboards
  • reason about which errors we should page on
  • align user-facing messages with operational realities

The same underlying problem could show up as:

  • a 500 in one service
  • a logged "time-out" string in another
  • a RETRYABLE flag buried in a third

We needed a shared language for errors.

Decision

We adopted a shared error taxonomy with a small set of categories and required metadata.

Every error we care about operationally should be classifiable as one of a few top-level kinds, such as:

  • user_error
  • system_error
  • dependency_error
  • rate_limited
  • unauthorized

Each error event should also carry:

  • a stable error code (not just free text)
  • the origin service
  • whether it is retryable from the system’s perspective

We did not try to model every nuance on day one.

We chose a small, useful set of categories and codes and left room to evolve.

Consequences

Upsides

  • Clearer dashboards. We can:
    • break down errors by category across services
    • focus on system_error and dependency_error for reliability work
  • Better user messaging. We can map:
    • user_error to specific, actionable messages
    • system_error to honest but generic messages that don’t blame users
  • Simpler alerting. We can:
    • alert on spikes in system_error or dependency_error
    • treat user_error spikes differently (e.g., product bugs or UX confusion)

Downsides / costs

  • Migration work. Teams need to:
    • map existing errors to the shared taxonomy
    • update logging and metrics
  • Discipline required. New error types must:
    • pick appropriate categories
    • choose stable codes that don’t change with wording

Guardrails

We added basic checks:

  • lint rules for error definitions in some languages
  • lightweight review for new error codes and categories

We also clarified that the taxonomy is not for internal disagreements about naming; it is for making cross-service operations and support easier.

Further reading