STEWARDSHIP2022-02-22BY PRIYA PATEL

Testing observability changes before production

How we started treating metrics, logs, and tracing changes like production code instead of 'just add it and see.'

observabilitytestingtelemetrystewardship

For a long time, we treated observability changes differently from feature code.

If you wanted to add a counter or a new log line, the path was:

  • add it
  • deploy
  • check if it shows up

There was no pre-deploy validation. Staging observability stacks were thin or non-existent. We assumed that if the code compiled, the telemetry would "just work."

That worked fine until it didn't.

We shipped metrics with typos in labels that made them uncorrelatable. We added logs that exploded cardinality. We created traces that sampled at 100% and overloaded the backend.

None of these broke the product directly, but they made incidents harder to debug and added surprising costs.

We needed to treat observability changes like production code: testable, reviewable, and safe to deploy.

Constraints

  • We had multiple telemetry backends (metrics, logs, traces); unifying them was not on the roadmap.
  • Staging environments had lighter-weight telemetry; we couldn't perfectly mirror production behavior.
  • Teams had different levels of observability expertise; we couldn't rely on everyone being an expert.
  • We wanted lightweight checks, not a heavyweight certification process.

What we changed

We introduced a few layers of validation for observability changes.

1. Lint and schema checks for metrics

We added simple linting for metric definitions:

  • metric names follow a consistent naming convention (e.g., service_operation_outcome_total)
  • labels are from an allowed list or explicitly documented as new
  • gauge/counter/histogram types match the semantic intent

These checks run in CI.

If a new metric is introduced, the PR must include:

  • a short description of what it measures
  • expected cardinality (low, medium, high)
  • which dashboard or alert will use it

This forces the question: "why are we adding this?"

2. Dry-run for high-cardinality changes

For changes that add or modify labels on frequently-emitted metrics, we run a cardinality estimator in staging:

  • emit the metric for a sample of traffic
  • estimate the resulting series count
  • flag if it would exceed a threshold

This doesn't catch everything (staging traffic shapes differ), but it surfaces obvious mistakes like unbounded user IDs in labels.

3. Log sampling and volume checks

We track log volume per service in staging and production.

For changes that add new log lines on hot paths:

  • the change must specify a sampling rate or log level
  • we estimate volume impact ("this will add ~X MB/day")
  • if volume would exceed the service's budget, we either sample more aggressively or defer

This prevents "just log everything" from turning into a production cost.

4. Trace sampling validation

Tracing is expensive.

We require:

  • explicit sampling rates for new trace spans
  • a brief justification if sampling is above a low default (e.g., 1%)

We also flag traces that include high-cardinality tags and ask whether they belong in logs instead.

5. Pre-deploy observability smoke tests

Before deploying a service with observability changes, we run a small smoke test in staging:

  • generate sample traffic
  • query for the new metrics/logs
  • verify they appear with expected labels/fields

If the query returns nothing or returns malformed data, the deploy is blocked until fixed.

This catches typos, schema mismatches, and misconfigured backends.

Results / Measurements

After introducing these checks, we saw a few concrete improvements:

  • Fewer post-deploy observability regressions. The rate of "we shipped a metric but it doesn't show up in the dashboard" incidents dropped. We went from 2–3 per month to roughly 1 per quarter.
  • Cardinality explosions caught earlier. In one case, a metric with an unbounded label was flagged in CI; fixing it before deploy saved us from adding ~50K unnecessary time series.
  • Cost visibility improved. Teams started asking "is this log line worth the cost?" during review, not after the bill arrived.

We also observed cultural shifts:

  • Engineers treated observability changes as reviewable, not automatic.
  • Pre-deploy checks became part of the normal workflow, not a special gate.

We did see some friction:

  • Early versions of the linter were too strict and flagged valid use cases. We loosened the rules and made exceptions explicit.
  • Some teams felt the checks added too much ceremony for "just adding a counter." We addressed this by making the simplest cases (low-cardinality counters on existing namespaces) automatic-pass.

Takeaways

  • Observability changes are code; they should be tested and validated before production.
  • Simple linting and schema checks catch typos, naming inconsistencies, and accidental high-cardinality metrics.
  • Dry-runs and volume estimates surface cost and performance issues early.
  • Smoke tests in staging ensure new metrics and logs actually work before deploy.
  • Treating telemetry as reviewable code shifts the conversation from "add it and hope" to "add it intentionally."

Further reading