Note: feature flags under stress
A short list of things we wish we had treated as production-critical in our flag system before traffic spiked.
We leaned on feature flags heavily during a period of rapid change.
That was the right call. Flags bought us reversibility under uncertainty.
But we also learned which parts of the flag system behave like critical infrastructure when everything is under stress.
None of this is new theory. It’s a memo to our future selves: treat these as production dependencies, not as "just config".
1. The flag store is on the critical path
When traffic surged, our flag evaluation path did more work than we expected:
- some services fetched flags synchronously on startup and on a timer
- cache misses fell back to a shared store on the request path
- a few "rare" flags were actually hit on nearly every checkout
Under normal load this was fine.
Under stress, the flag store became an invisible shared dependency. A slowdown there looked like a slowdown everywhere.
We changed two things:
- treated the flag SDK as a library, not a side channel; it must be budgeted like any other dependency
- ensured evaluations are local (cached) for the hot paths
2. Default values are a reliability decision
We had cases where the default value for a flag was "on" because that was convenient during rollout.
When the flag backend was slow or unreachable, those features silently stayed enabled.
We now write defaults the way we write fallbacks:
- if the flag system fails, we prefer the more conservative behavior
- we document the safe default next to the feature spec
3. Flags accumulate operational debt
Old flags were cheap to keep—until we needed to reason about behavior under pressure.
During an incident, it’s hard to remember which of ten launch-related flags actually does something.
We now:
- give each flag an owner and an expected removal date
- remove or hard-code flags in the same sprint we complete the rollout
This isn’t about tidiness. It’s about having one or two meaningful toggles during an incident, not a wall of switches.
Takeaways
- Your flag system is part of the architecture, not an afterthought. Budget its latency and failure modes.
- Defaults are not neutral; they are the behavior when the control plane is down.
- Expired flags make incidents harder. Treat cleanup as part of "done," not as optional polish.