RELIABILITY2025-08-03BY PRIYA PATEL

Incident report: Trace sampling change hid a regression

A change to tracing sampling made a latency regression invisible in our usual views. We describe how it happened and what we changed.

reliabilitytracingobservabilityincidents

Summary

On August 3, 2025, we shipped a seemingly harmless change to trace sampling.

The goal was to:

  • reduce trace volume for a set of high-traffic endpoints
  • focus sampling on error and high-latency cases

Shortly afterward, a real latency regression hit those endpoints.

Because of how we had reconfigured sampling, the regression:

  • did not show up in our usual trace-based dashboards
  • appeared only as a subtle change in metrics

It took longer than it should have to correlate user reports and metrics with the underlying cause.

We treated this as an incident in how we designed our observability, not just the service that regressed.

Impact

  • Duration: several hours of elevated latency before we identified and fixed the regression.
  • User impact:
    • increased tail latency for affected endpoints
    • some users experienced slower pages and occasional timeouts
  • Internal impact:
    • extra time spent reconciling conflicting signals between metrics and traces
    • erosion of trust in trace-based dashboards for a period

Timeline

All times local.

  • 09:10 — Trace sampling configuration is updated for a high-traffic service.
  • 09:25 — Latency metrics begin to show a small but real increase in P95 for certain endpoints.
  • 10:00 — First user reports arrive indicating slower responses.
  • 10:12 — On-call checks trace dashboards; sampled traces do not show a clear concentration of slow spans.
  • 10:35 — Further investigation via metrics and logs suggests a new code path is heavier than expected.
  • 11:05 — We realize sampling changes have biased away from seeing enough slow traces in the affected cohort.
  • 11:20 — Sampling is temporarily increased for the relevant service and endpoints.
  • 11:40 — New traces clearly show the problematic code path; rollback is initiated.
  • 12:10 — Rollback completes; metrics confirm latency returns to baseline.

Root cause

The immediate technical regression was a performance bug in new code.

The observability-specific root cause was that our updated sampling strategy:

  • concentrated on errors and very slow requests
  • down-sampled moderately slow traces that would have highlighted the early stages of the regression

We had unintentionally created a gap between:

  • when metrics showed a regression
  • when traces gave us enough detail to diagnose it

Contributing factors

  • Sampling decisions made per-service, not per-signal. We adjusted sampling globally for the service without thinking through which endpoints and latency ranges we relied on traces for.
  • Overconfidence in "smart" sampling. We trusted that focusing on errors and top-percentile latency would catch important issues, overlooking mid-percentile regressions.
  • Weak coupling between metrics and trace configuration. Metric-based alerts did not automatically drive sampling adjustments.

What we changed

1. Define trace budgets by question, not just volume

We reframed trace sampling as a way to answer specific questions:

  • understanding normal behavior
  • catching regressions early
  • explaining incidents

For key endpoints, we:

  • reserved sampling budget for a mix of normal and slow requests
  • avoided configurations that only captured the extremes

2. Tie sampling to metric alerts

We added simple hooks:

  • when latency metrics for an endpoint cross certain thresholds, sampling for that endpoint temporarily increases
  • when alerts clear, sampling gradually returns to baseline

This ensures that when metrics hint at a problem, we collect richer traces automatically.

3. Make sampling visible and versioned

We:

  • treated sampling configurations as code with version control
  • added dashboards to show effective sampling rates per endpoint over time

This made it easier to see when a sampling change preceded a change in diagnostic power.

4. Clarify expectations for trace-based dashboards

We updated documentation and runbooks:

  • trace dashboards are powerful but not complete
  • metrics remain the source of truth for SLOs and regressions
  • when metrics and traces disagree, we treat that as a signal to investigate sampling, not to dismiss either side

Follow-ups

Completed

  • Rolled back the specific sampling configuration that hid the regression.
  • Updated sampling strategies for key services and endpoints.
  • Added metric-driven sampling adjustments for latency alerts.

Planned / in progress

  • Extend metric-linked sampling to more services.
  • Evaluate periodic "baseline" sampling runs to validate that our strategies still capture the behavior we care about.

Takeaways

  • Trace sampling is part of your observability design; changes to it can hide or reveal entire classes of problems.
  • Sampling strategies should be driven by the questions you need to answer, not just volume targets.
  • Metrics and traces should reinforce each other; when they diverge, investigate your sampling before assuming everything is fine.

Further reading