Aligning product experiments with error budgets
How we made sure experiments respect reliability by tying them to error budgets instead of running them until something breaks.
Product experiments are supposed to be controlled risk.
In practice, our early experiments behaved more like extra features:
- they shipped under tight timelines
- they added load or complexity in hot paths
- they rarely had a clear exit plan
We paid attention to business metrics.
We did not consistently pay attention to error budgets.
After one quarter with a cluster of experiment-related incidents, we decided that experiments had to live inside the same reliability envelope as everything else.
Constraints
- We didn’t want to slow experimentation to a crawl.
- Different teams had different levels of SLO maturity.
- Some experiments were small UI tweaks; others changed traffic patterns or backend behavior.
What we changed
We made experiments answer to SLOs and error budgets.
1. Make "experiment impact" a design question
Experiment design docs now include a short section:
- Which services and flows does this touch?
- Which SLOs could this experiment affect?
- What’s the plan if error budgets start to burn faster than expected?
This pushed teams to think about reliability impact early instead of after launch.
2. Tie experiment toggles to observability
Experiments live behind feature flags.
We made sure:
- experiment variants can be broken down in metrics (e.g., by flag value or cohort)
- dashboards show SLO behavior by cohort when appropriate
This lets us see whether an experiment cohort is burning error budget faster than control.
3. Define "stop" criteria up front
Each experiment design includes simple stop rules, such as:
- "If this cohort uses more than X% of the weekly error budget for this SLO, pause the experiment."
- "If latency for this endpoint exceeds Yms for Z minutes, roll back the experimental path."
These are not precise formulas.
They are guardrails that make it easier for on-call and product to agree when to pause.
4. Integrate experiments into on-call context
We updated on-call runbooks to include:
- where to see active experiments
- how to correlate incidents with experiment activity
- who owns which experiments
When an incident touches a flow under experiment, the runbook makes "turn off the experiment" a first-class candidate mitigation.
5. Review experiments like changes, not like ideas
We treat large experiments like any other risky change:
- design review that includes SLO and error budget impact
- rollout plan (cohorts, canaries, timing)
- rollback plan (flags, configs)
Smaller experiments get lighter versions of the same questions.
Results / Measurements
After a few quarters, we saw:
- fewer incidents where the root cause was "experiment added unexpected load"
- faster mitigation for experiment-related regressions (flags made it easy to turn off the variant)
- more predictable error budget usage during heavy testing periods
We also saw cultural shifts:
- teams talked about "spending error budget" for experiments instead of "running them until something breaks"
- product and engineering had clearer conversations about which SLOs we were willing to risk for which gains
Takeaways
- Experiments are still production code; they need to live inside SLO and error budget constraints.
- Simple stop rules and good observability keep experiments from quietly eating reliability.
- Including experiment state in on-call context turns "turn it off" into a safe, fast option during incidents.