STEWARDSHIP2023-10-17BY PRIYA PATEL

Story: the observability costs we didn’t see coming

We treated observability as 'basically free' until the bill and the query latencies told a different story. This is what changed next.

observabilitystewardshipcosttelemetry

What happened

For a long time, our observability story was simple:

metrics were cheap
logs were cheap
traces were "still being adopted"

We encouraged people to add more telemetry:

more tags on metrics
more structured logs
more traces around new features

The theory was that storage was inexpensive compared to engineering time, and better visibility would make incidents easier.

For a while, this held.

Then the graphs—and the bills—started to bend.

The slow change

Nobody added "too much telemetry" in one commit.

Instead, each team made reasonable local decisions:

add detailed logs around a new risk check
tag metrics with a few extra dimensions for a launch
increase trace sampling for a hot path during debugging and forget to turn it back down

Individually, these were defensible.

Collectively, they pushed us into a different regime:

cardinality grew
storage increased
queries slowed down or timed out at the worst moments

The turning point

Two things happened in the same quarter:

A set of incident reviews noted "we tried to use dashboards X and Y, but they didn’t load under load."
Finance asked why observability spend had grown faster than traffic or revenue.

We didn’t have good answers.

We had dashboards for product metrics, but not for telemetry about telemetry.

We decided to treat observability itself as something we needed to measure, budget, and design.

What we changed

1. Make telemetry costs and health visible

We built a small set of dashboards that answered basic questions:

top services by telemetry footprint (metrics series, log volume, trace volume)
trend lines over the last few quarters
gross indicators of query health (slow or failed queries per service)

We did not aim for perfect attribution.

A rough ranking—"these five services drive most of the metrics"—was enough to start conversations.

2. Introduce telemetry budgets

We wrote down approximate budgets per service:

an upper bound on time series count
target log volume ranges
default trace sampling rates

These were not hard walls.

They were:

a way to notice when things drifted
a way to talk about cost and performance in design reviews

Services that exceeded their budgets were not punished.

They were asked to:

explain what changed
consider optimizations (e.g., sampling, pruning unused metrics)

3. Treat high-cardinality data as a decision

We added a review question for new metrics and logs:

"What is the expected cardinality, and is this data better as a metric, a log, or a trace?"

Common patterns:

user-level identifiers moved from metrics into sampled logs
detailed traces used temporarily for debugging, with a plan to turn them down

This helped keep metrics usable while still giving engineers rich data when they needed it.

4. Close the loop after incidents

Incident reviews started to include a section:

"Which telemetry helped?"
"Which telemetry caused problems (slow queries, missing data, confusing dashboards)?"

We used this to:

remove or simplify useless metrics and dashboards
invest in the kinds of telemetry that actually shortened incidents

Telemetry got treated less like "everything we might ever need" and more like a curated toolset.

5. Budget experiments as well as steady state

We also recognized that experiments spike telemetry:

temporary debug logs
extra tags
higher trace sampling

We made experiments explicit:

call out temporary telemetry changes in design docs
set review dates to turn them down or remove them

This avoided "temporary" becoming permanent.

Takeaways

Observability is infrastructure; it deserves design, budgets, and reviews.
Costs show up both in bills and in query performance.
Rough per-service budgets and dashboards are enough to start steering, even without perfect attribution.
Incident reviews should ask which telemetry actually helped; that’s where investment should go next.