Incident report: Trace sampling change hid a regression
A change to tracing sampling made a latency regression invisible in our usual views. We describe how it happened and what we changed.
Summary
On August 3, 2025, we shipped a seemingly harmless change to trace sampling.
The goal was to:
- reduce trace volume for a set of high-traffic endpoints
- focus sampling on error and high-latency cases
Shortly afterward, a real latency regression hit those endpoints.
Because of how we had reconfigured sampling, the regression:
- did not show up in our usual trace-based dashboards
- appeared only as a subtle change in metrics
It took longer than it should have to correlate user reports and metrics with the underlying cause.
We treated this as an incident in how we designed our observability, not just the service that regressed.
Impact
- Duration: several hours of elevated latency before we identified and fixed the regression.
- User impact:
- increased tail latency for affected endpoints
- some users experienced slower pages and occasional timeouts
- Internal impact:
- extra time spent reconciling conflicting signals between metrics and traces
- erosion of trust in trace-based dashboards for a period
Timeline
All times local.
- 09:10 — Trace sampling configuration is updated for a high-traffic service.
- 09:25 — Latency metrics begin to show a small but real increase in P95 for certain endpoints.
- 10:00 — First user reports arrive indicating slower responses.
- 10:12 — On-call checks trace dashboards; sampled traces do not show a clear concentration of slow spans.
- 10:35 — Further investigation via metrics and logs suggests a new code path is heavier than expected.
- 11:05 — We realize sampling changes have biased away from seeing enough slow traces in the affected cohort.
- 11:20 — Sampling is temporarily increased for the relevant service and endpoints.
- 11:40 — New traces clearly show the problematic code path; rollback is initiated.
- 12:10 — Rollback completes; metrics confirm latency returns to baseline.
Root cause
The immediate technical regression was a performance bug in new code.
The observability-specific root cause was that our updated sampling strategy:
- concentrated on errors and very slow requests
- down-sampled moderately slow traces that would have highlighted the early stages of the regression
We had unintentionally created a gap between:
- when metrics showed a regression
- when traces gave us enough detail to diagnose it
Contributing factors
- Sampling decisions made per-service, not per-signal. We adjusted sampling globally for the service without thinking through which endpoints and latency ranges we relied on traces for.
- Overconfidence in "smart" sampling. We trusted that focusing on errors and top-percentile latency would catch important issues, overlooking mid-percentile regressions.
- Weak coupling between metrics and trace configuration. Metric-based alerts did not automatically drive sampling adjustments.
What we changed
1. Define trace budgets by question, not just volume
We reframed trace sampling as a way to answer specific questions:
- understanding normal behavior
- catching regressions early
- explaining incidents
For key endpoints, we:
- reserved sampling budget for a mix of normal and slow requests
- avoided configurations that only captured the extremes
2. Tie sampling to metric alerts
We added simple hooks:
- when latency metrics for an endpoint cross certain thresholds, sampling for that endpoint temporarily increases
- when alerts clear, sampling gradually returns to baseline
This ensures that when metrics hint at a problem, we collect richer traces automatically.
3. Make sampling visible and versioned
We:
- treated sampling configurations as code with version control
- added dashboards to show effective sampling rates per endpoint over time
This made it easier to see when a sampling change preceded a change in diagnostic power.
4. Clarify expectations for trace-based dashboards
We updated documentation and runbooks:
- trace dashboards are powerful but not complete
- metrics remain the source of truth for SLOs and regressions
- when metrics and traces disagree, we treat that as a signal to investigate sampling, not to dismiss either side
Follow-ups
Completed
- Rolled back the specific sampling configuration that hid the regression.
- Updated sampling strategies for key services and endpoints.
- Added metric-driven sampling adjustments for latency alerts.
Planned / in progress
- Extend metric-linked sampling to more services.
- Evaluate periodic "baseline" sampling runs to validate that our strategies still capture the behavior we care about.
Takeaways
- Trace sampling is part of your observability design; changes to it can hide or reveal entire classes of problems.
- Sampling strategies should be driven by the questions you need to answer, not just volume targets.
- Metrics and traces should reinforce each other; when they diverge, investigate your sampling before assuming everything is fine.