ARCHITECTURE2020-09-14BY ELI NAVARRO

Decision record: Keeping one primary auth store

We decided to keep a single primary source of truth for authentication and treat other stores as caches, even when duplication looks convenient.

architectureauthenticationdata-integrityreliability

Context

As we added more services that needed to check whether a user could perform an action, we kept re-encountering the same temptation: "it would be easier if this system had its own copy of auth state."

Examples:

  • a reporting service that wanted a local table of "active" users
  • a billing integration that wanted to know which accounts were suspended
  • an internal tool that wanted to flag high-risk accounts separately

All of these are reasonable needs. But the implementation path often drifted toward "just copy the bits we need":

  • duplicate tables with user status
  • ad-hoc sync jobs
  • services that assumed their local cache was truth for authorization decisions

Short-term, this reduced perceived latency and complexity.

Medium-term, it created hard-to-debug failure modes:

  • a user’s status would update in one system but not another
  • an account would be locked in the auth service but still able to trigger operations elsewhere
  • incident responders had to look in multiple places to understand "is this user allowed to do this?"

We needed an explicit decision about where truth lived.

Decision

We decided to keep a single primary auth store and treat all other copies as derived caches.

Concretely:

  • There is exactly one service that owns authentication and core authorization state.
  • All writes that change that state go through this service.
  • Other systems may cache subsets of that state, but:
    • they must not accept writes that contradict the primary
    • they must have a clear invalidation strategy
    • they must fail closed when the cache is stale or unavailable

When a new system wants a "local" view of auth state, the default options are:

  1. Call the auth service directly on the critical path, with caching at the client or gateway level.
  2. Subscribe to a stream of changes from the auth service (events or CDC) and maintain a read-only projection.

Explicitly rejected options:

  • independent writable tables of auth state in downstream systems
  • background jobs that "fix up" divergence between multiple sources of truth

Consequences

Upsides

  • Clear debugging path during incidents. When there is a question about whether a user should be able to perform an action, there is one place to look for truth.
  • Safer lockouts and suspensions. When an account is locked in the primary store, downstream systems and caches are expected to align quickly or fail closed.
  • Simpler mental model for engineers. New services don’t have to decide "which table is real"; they depend on the primary and document their cache behavior.

Downsides / costs

  • Perceived latency. Some operations will pay an extra network hop instead of reading a local table. We mitigate this with:
    • short-lived caches at the client or gateway
    • read models that are explicitly read-only
  • Operational coupling. The auth service becomes more critical. We need:
    • clear SLOs for its availability and latency
    • strong on-call coverage
    • careful rollout plans for schema and behavior changes
  • Migration work. Existing systems that quietly wrote their own auth state must be migrated. In some cases this means building reconciliation tooling.

Guardrails we added

  • New designs that touch auth state must answer:
    • Where does the primary truth live?
    • Is this system maintaining a cache or a projection, and how is it invalidated?
    • What happens when the cache is stale or the auth service is unavailable?
  • We added checks in code review templates for features that interact with auth or permissions.
  • We added a small "auth state" section to incident runbooks:
    • where to check current status
    • how long caches lag under normal conditions
    • how to force a refresh when necessary

This decision makes some paths less "convenient" in the short term, but it keeps us from trading reliability for one-off speed. We would rather pay for a single well-run primary store than debug a network of inconsistent shadows.

Further reading