RELIABILITY2024-02-07BY ELI NAVARRO

Standardizing safe-mode configs across services

We defined what 'safe mode' means per service so we can degrade predictably instead of improvising under pressure.

reliabilitydegradationconfigurationincidents

During incidents, we reached for the same knobs every time:

  • turn off non-critical features
  • lower limits on expensive paths
  • relax some validations and tighten others

We did this via:

  • ad-hoc config changes
  • hurried feature-flag edits
  • "temporary" overrides left in place longer than intended

Every service had its own idea of "safe mode."

None of them were written down.

We decided to standardize safe-mode configs across services so we could degrade predictably instead of improvising.

Constraints

  • Services varied widely in behavior and risk.
  • We didn’t want a single global switch that hid important differences.
  • We wanted something simple enough to use during a page.

What we changed

1. Define safe-mode behavior per service

For each service, we documented:

  • what "normal" looks like
  • what "safe mode" should do

Safe mode generally means:

  • reducing optional work
  • prioritizing core flows
  • avoiding expensive operations that have cheap fallbacks

Examples:

  • serve cached or less detailed data instead of live, expensive queries
  • disable non-essential background jobs
  • cap request rates for expensive endpoints while keeping critical ones open

2. Encode safe mode in configuration

We added explicit, versioned configuration for safe mode, such as:

  • safe_mode_enabled: on/off
  • safe_mode_profile: which preset to use (e.g., "degrade-search", "degrade-recommendations")

These map to:

  • specific feature flags
  • rate limits
  • queue priorities

We treated safe-mode config like any other operational configuration:

  • checked into version control
  • reviewed
  • visible in dashboards

3. Make safe mode discoverable in runbooks

Runbooks now include a "Safe mode" section that answers:

  • when to consider enabling safe mode
  • how to enable it (commands, UIs, config changes)
  • what user impact to expect

This makes "enter safe mode" an explicit step rather than an improvised series of toggles.

4. Add metrics for safe-mode usage

We instrumented:

  • when safe mode is enabled or disabled
  • which profile is active
  • key SLOs before/during/after safe-mode intervals

This allows us to:

  • see how often we rely on safe mode
  • evaluate whether the degraded behavior is actually safer

5. Practice safe-mode drills

We ran small drills:

  • enable safe mode in a lower environment
  • simulate corresponding failures (e.g., slow dependencies)
  • confirm that behavior matches expectations

For some services, we also ran short, carefully monitored safe-mode tests in production during low-traffic windows.

Results / Measurements

After standardizing safe mode:

  • incident leads had clearer options than "roll back" or "turn off everything"
  • we saw fewer ad-hoc config changes during pages
  • we could discuss safe-mode design in calm reviews instead of only during outages

We also identified services where a simple safe mode wasn’t enough, which led to broader design changes.

Takeaways

  • Safe mode is a product decision as much as an operational one.
  • Encoding safe mode in configuration and runbooks makes it usable under pressure.
  • Metrics around safe-mode usage tell you whether your degradation paths are doing their job.

Further reading