RELIABILITY2024-02-07BY ELI NAVARRO

Standardizing safe-mode configs across services

We defined what 'safe mode' means per service so we can degrade predictably instead of improvising under pressure.

reliabilitydegradationconfigurationincidents

During incidents, we reached for the same knobs every time:

turn off non-critical features
lower limits on expensive paths
relax some validations and tighten others

We did this via:

ad-hoc config changes
hurried feature-flag edits
"temporary" overrides left in place longer than intended

Every service had its own idea of "safe mode."

None of them were written down.

We decided to standardize safe-mode configs across services so we could degrade predictably instead of improvising.

Constraints

Services varied widely in behavior and risk.
We didn’t want a single global switch that hid important differences.
We wanted something simple enough to use during a page.

What we changed

1. Define safe-mode behavior per service

For each service, we documented:

what "normal" looks like
what "safe mode" should do

Safe mode generally means:

reducing optional work
prioritizing core flows
avoiding expensive operations that have cheap fallbacks

Examples:

serve cached or less detailed data instead of live, expensive queries
disable non-essential background jobs
cap request rates for expensive endpoints while keeping critical ones open

2. Encode safe mode in configuration

We added explicit, versioned configuration for safe mode, such as:

safe_mode_enabled: on/off
safe_mode_profile: which preset to use (e.g., "degrade-search", "degrade-recommendations")

These map to:

specific feature flags
rate limits
queue priorities

We treated safe-mode config like any other operational configuration:

checked into version control
reviewed
visible in dashboards

3. Make safe mode discoverable in runbooks

Runbooks now include a "Safe mode" section that answers:

when to consider enabling safe mode
how to enable it (commands, UIs, config changes)
what user impact to expect

This makes "enter safe mode" an explicit step rather than an improvised series of toggles.

4. Add metrics for safe-mode usage

We instrumented:

when safe mode is enabled or disabled
which profile is active
key SLOs before/during/after safe-mode intervals

This allows us to:

see how often we rely on safe mode
evaluate whether the degraded behavior is actually safer

5. Practice safe-mode drills

We ran small drills:

enable safe mode in a lower environment
simulate corresponding failures (e.g., slow dependencies)
confirm that behavior matches expectations

For some services, we also ran short, carefully monitored safe-mode tests in production during low-traffic windows.

Results / Measurements

After standardizing safe mode:

incident leads had clearer options than "roll back" or "turn off everything"
we saw fewer ad-hoc config changes during pages
we could discuss safe-mode design in calm reviews instead of only during outages

We also identified services where a simple safe mode wasn’t enough, which led to broader design changes.

Takeaways

Safe mode is a product decision as much as an operational one.
Encoding safe mode in configuration and runbooks makes it usable under pressure.
Metrics around safe-mode usage tell you whether your degradation paths are doing their job.