ARCHITECTURE2020-04-18BY JONAS "JO" CARLIN

Note: feature flags under stress

A short list of things we wish we had treated as production-critical in our flag system before traffic spiked.

architecturefeature-flagsrolloutreliability

We leaned on feature flags heavily during a period of rapid change.

That was the right call. Flags bought us reversibility under uncertainty.

But we also learned which parts of the flag system behave like critical infrastructure when everything is under stress.

None of this is new theory. It’s a memo to our future selves: treat these as production dependencies, not as "just config".

When traffic surged, our flag evaluation path did more work than we expected:

Under normal load this was fine.

Under stress, the flag store became an invisible shared dependency. A slowdown there looked like a slowdown everywhere.

We changed two things:

treated the flag SDK as a library, not a side channel; it must be budgeted like any other dependency
ensured evaluations are local (cached) for the hot paths

We had cases where the default value for a flag was "on" because that was convenient during rollout.

When the flag backend was slow or unreachable, those features silently stayed enabled.

We now write defaults the way we write fallbacks:

Old flags were cheap to keep—until we needed to reason about behavior under pressure.

During an incident, it’s hard to remember which of ten launch-related flags actually does something.

We now:

This isn’t about tidiness. It’s about having one or two meaningful toggles during an incident, not a wall of switches.

Takeaways

Your flag system is part of the architecture, not an afterthought. Budget its latency and failure modes.
Defaults are not neutral; they are the behavior when the control plane is down.
Expired flags make incidents harder. Treat cleanup as part of "done," not as optional polish.