DELIVERY2023-03-21BY JONAS "JO" CARLIN

Feature flags that explain themselves

How we changed our flag system and habits so flags carry enough context to be safe to flip during incidents.

deliveryfeature-flagsconfigurationincidents

In early incidents, changing a feature flag felt like pulling a random lever.

Flags had short names, no description, and ambiguous defaults. The person who added the flag remembered what it did—until they didn’t. Everyone else guessed.

During a page, this looked like:

"Try turning off new_path_v2?"
"What does it actually do?"
"I think it only affects the new checkout, probably."

We got away with this as long as the people who created the flags were on the call.

That stopped working once rotations widened and time zones spread out.

We decided that flags needed to explain themselves well enough that an on-call who had never seen the code could make a safe decision.

Constraints

We already had a flag system in place; swapping it out wasn’t on the table.
Product and engineering both created flags; we couldn’t rely on one group to "fix" the other.
We didn’t want long, freeform documentation that would rot.

What we changed

We made small structural changes and backed them up with habits.

1. Required metadata for new flags

We added a few required fields to flag definitions:

Owner: team or person accountable for the flag.
Intended lifetime: temporary experiment, rollout safety net, or long-term config.
Default behavior: what "off" and "on" mean in plain language.
Safe emergency action: what an on-call should do if the flag is suspected to be involved in an incident.

The definition lives next to the code that reads the flag, not in a separate wiki.

2. Better names and scopes

We tightened naming and scoping:

names include the domain and effect (e.g., checkout_disable_optional_addons) instead of v2 or new_flow
flags are scoped as narrowly as possible (service-level, endpoint-level) instead of global when not needed

This made it easier to understand what flipping a flag would touch.

3. Flag dashboards tied to incidents

We added a simple flag dashboard per service that shows:

current state of each flag
recent changes
links to the definitions and owners

Incident runbooks now link to these dashboards.

During a page, on-call can answer quickly:

what changed recently?
is there a flag whose emergency action is "turn it off"?

4. A small review habit

We added one question to code review when flags are involved:

"Is this flag safe to flip during an incident based on its metadata and usage?"

If the answer is "no" or "I’m not sure," we:

add missing metadata
adjust the code so flipping the flag doesn’t have surprising side effects

This review is intentionally small; it’s a guardrail, not a ceremony.

5. Cleaning up old flags

We already had a note on treating flags as operational debt.

Here we formalized a rule:

when a temporary flag has been in a stable state for a full release cycle, we either remove it or document why it’s staying

Flags that remain become normal configuration, with owners and clear behavior.

Results / Measurements

After a few months of applying these changes, we saw:

Faster decision-making during incidents. In incident reviews, fewer minutes were spent debating what a flag did; metadata and dashboards answered the basics.
Fewer "mystery" flags. The number of flags with no owner or description dropped as we added metadata or removed stale flags.
Better defaults. We caught several cases where the default behavior was risky under failure (e.g., defaulting to "on" when the control plane was down) and fixed them.

Not every flag is perfect. Some older flags still carry historical baggage.

The difference is that new flags start from a safer baseline, and we have a process for retiring or fixing old ones.

Takeaways

Flags are part of the incident interface; they should be self-explanatory.
A little structured metadata (owner, lifetime, defaults, emergency action) makes flags safer to flip.
Service-level flag dashboards turn scattered toggles into a usable tool.
Small review questions and regular cleanup keep the system from decaying back into "v2" and "test_flag".