ARCHITECTURE2024-07-04BY ELI NAVARRO

Decision record: Centralizing feature flag evaluation

We decided to move flag evaluation into a shared service instead of letting every client decide on its own.

architecturefeature-flagsconfigurationreliability

Context

Our feature flag usage grew organically.

At first, flags lived in a single service. Evaluation happened close to where the behavior changed.

Over time, more systems added flags:

Each environment had its own way to fetch and evaluate flags.

The result:

We faced two concrete problems during incidents:

Flags behaved differently in different environments because local evaluation code drifted.
It was hard to answer the simple question "What is the current state of this flag everywhere?"

We decided to centralize feature flag evaluation into a shared service and treat that service as part of the core architecture.

Concretely:

flag definitions, targeting rules, and evaluation logic live in one place
clients (web, mobile, backend) call the flag service or use thin SDKs that delegate evaluation to it
the service exposes clear APIs and auditing for flag state and changes

Clients may cache results or precompute values, but they no longer implement their own evaluation logic beyond simple, documented cases.

Consistency. The same rules apply across environments.
Visibility. We can answer:
- which flags exist
- who owns them
- what their current values are in different segments
Operational control. During incidents, we can:
- see who changed what, when
- flip flags globally or per-segment through a single interface

New dependency. The flag service becomes another critical system. We need:
- SLOs for its availability and latency
- clear failover and degradation behavior
Migration work. Existing clients must:
- remove local evaluation code
- adopt the shared APIs or SDKs
Performance considerations. We have to:
- ensure low-latency responses
- design caching strategies that don’t reintroduce inconsistency

We set a few rules:

Clients must define what happens when the flag service is unavailable (default behaviors).
New flags must include metadata (owner, lifetime, emergency action) and be created through the shared system.
The flag service must be observable: metrics for latency, error rates, and evaluation volume per service.

The decision does not require every toggle in the system to go through this service.

It focuses on flags that affect user-visible behavior or incident response.