ARCHITECTURE2024-07-04BY ELI NAVARRO

Decision record: Centralizing feature flag evaluation

We decided to move flag evaluation into a shared service instead of letting every client decide on its own.

architecturefeature-flagsconfigurationreliability

Context

Our feature flag usage grew organically.

At first, flags lived in a single service. Evaluation happened close to where the behavior changed.

Over time, more systems added flags:

  • web and mobile clients
  • backend services
  • batch jobs

Each environment had its own way to fetch and evaluate flags.

The result:

  • inconsistent behavior across platforms
  • unclear ownership for which flags existed and what they did
  • complex incident debugging when a flag flip affected only some clients

We faced two concrete problems during incidents:

  1. Flags behaved differently in different environments because local evaluation code drifted.
  2. It was hard to answer the simple question "What is the current state of this flag everywhere?"

Decision

We decided to centralize feature flag evaluation into a shared service and treat that service as part of the core architecture.

Concretely:

  • flag definitions, targeting rules, and evaluation logic live in one place
  • clients (web, mobile, backend) call the flag service or use thin SDKs that delegate evaluation to it
  • the service exposes clear APIs and auditing for flag state and changes

Clients may cache results or precompute values, but they no longer implement their own evaluation logic beyond simple, documented cases.

Consequences

Upsides

  • Consistency. The same rules apply across environments.
  • Visibility. We can answer:
    • which flags exist
    • who owns them
    • what their current values are in different segments
  • Operational control. During incidents, we can:
    • see who changed what, when
    • flip flags globally or per-segment through a single interface

Downsides / costs

  • New dependency. The flag service becomes another critical system. We need:
    • SLOs for its availability and latency
    • clear failover and degradation behavior
  • Migration work. Existing clients must:
    • remove local evaluation code
    • adopt the shared APIs or SDKs
  • Performance considerations. We have to:
    • ensure low-latency responses
    • design caching strategies that don’t reintroduce inconsistency

Guardrails

We set a few rules:

  • Clients must define what happens when the flag service is unavailable (default behaviors).
  • New flags must include metadata (owner, lifetime, emergency action) and be created through the shared system.
  • The flag service must be observable: metrics for latency, error rates, and evaluation volume per service.

The decision does not require every toggle in the system to go through this service.

It focuses on flags that affect user-visible behavior or incident response.

Further reading