STEWARDSHIP2022-04-26BY PRIYA PATEL

Aligning product experiments with error budgets

How we made sure experiments respect reliability by tying them to error budgets instead of running them until something breaks.

stewardshipexperimentsreliabilityslo

Product experiments are supposed to be controlled risk.

In practice, our early experiments behaved more like extra features:

  • they shipped under tight timelines
  • they added load or complexity in hot paths
  • they rarely had a clear exit plan

We paid attention to business metrics.

We did not consistently pay attention to error budgets.

After one quarter with a cluster of experiment-related incidents, we decided that experiments had to live inside the same reliability envelope as everything else.

Constraints

  • We didn’t want to slow experimentation to a crawl.
  • Different teams had different levels of SLO maturity.
  • Some experiments were small UI tweaks; others changed traffic patterns or backend behavior.

What we changed

We made experiments answer to SLOs and error budgets.

1. Make "experiment impact" a design question

Experiment design docs now include a short section:

  • Which services and flows does this touch?
  • Which SLOs could this experiment affect?
  • What’s the plan if error budgets start to burn faster than expected?

This pushed teams to think about reliability impact early instead of after launch.

2. Tie experiment toggles to observability

Experiments live behind feature flags.

We made sure:

  • experiment variants can be broken down in metrics (e.g., by flag value or cohort)
  • dashboards show SLO behavior by cohort when appropriate

This lets us see whether an experiment cohort is burning error budget faster than control.

3. Define "stop" criteria up front

Each experiment design includes simple stop rules, such as:

  • "If this cohort uses more than X% of the weekly error budget for this SLO, pause the experiment."
  • "If latency for this endpoint exceeds Yms for Z minutes, roll back the experimental path."

These are not precise formulas.

They are guardrails that make it easier for on-call and product to agree when to pause.

4. Integrate experiments into on-call context

We updated on-call runbooks to include:

  • where to see active experiments
  • how to correlate incidents with experiment activity
  • who owns which experiments

When an incident touches a flow under experiment, the runbook makes "turn off the experiment" a first-class candidate mitigation.

5. Review experiments like changes, not like ideas

We treat large experiments like any other risky change:

  • design review that includes SLO and error budget impact
  • rollout plan (cohorts, canaries, timing)
  • rollback plan (flags, configs)

Smaller experiments get lighter versions of the same questions.

Results / Measurements

After a few quarters, we saw:

  • fewer incidents where the root cause was "experiment added unexpected load"
  • faster mitigation for experiment-related regressions (flags made it easy to turn off the variant)
  • more predictable error budget usage during heavy testing periods

We also saw cultural shifts:

  • teams talked about "spending error budget" for experiments instead of "running them until something breaks"
  • product and engineering had clearer conversations about which SLOs we were willing to risk for which gains

Takeaways

  • Experiments are still production code; they need to live inside SLO and error budget constraints.
  • Simple stop rules and good observability keep experiments from quietly eating reliability.
  • Including experiment state in on-call context turns "turn it off" into a safe, fast option during incidents.

Further reading