Story: the observability costs we didn’t see coming
We treated observability as 'basically free' until the bill and the query latencies told a different story. This is what changed next.
What happened
For a long time, our observability story was simple:
- metrics were cheap
- logs were cheap
- traces were "still being adopted"
We encouraged people to add more telemetry:
- more tags on metrics
- more structured logs
- more traces around new features
The theory was that storage was inexpensive compared to engineering time, and better visibility would make incidents easier.
For a while, this held.
Then the graphs—and the bills—started to bend.
The slow change
Nobody added "too much telemetry" in one commit.
Instead, each team made reasonable local decisions:
- add detailed logs around a new risk check
- tag metrics with a few extra dimensions for a launch
- increase trace sampling for a hot path during debugging and forget to turn it back down
Individually, these were defensible.
Collectively, they pushed us into a different regime:
- cardinality grew
- storage increased
- queries slowed down or timed out at the worst moments
The turning point
Two things happened in the same quarter:
- A set of incident reviews noted "we tried to use dashboards X and Y, but they didn’t load under load."
- Finance asked why observability spend had grown faster than traffic or revenue.
We didn’t have good answers.
We had dashboards for product metrics, but not for telemetry about telemetry.
We decided to treat observability itself as something we needed to measure, budget, and design.
What we changed
1. Make telemetry costs and health visible
We built a small set of dashboards that answered basic questions:
- top services by telemetry footprint (metrics series, log volume, trace volume)
- trend lines over the last few quarters
- gross indicators of query health (slow or failed queries per service)
We did not aim for perfect attribution.
A rough ranking—"these five services drive most of the metrics"—was enough to start conversations.
2. Introduce telemetry budgets
We wrote down approximate budgets per service:
- an upper bound on time series count
- target log volume ranges
- default trace sampling rates
These were not hard walls.
They were:
- a way to notice when things drifted
- a way to talk about cost and performance in design reviews
Services that exceeded their budgets were not punished.
They were asked to:
- explain what changed
- consider optimizations (e.g., sampling, pruning unused metrics)
3. Treat high-cardinality data as a decision
We added a review question for new metrics and logs:
- "What is the expected cardinality, and is this data better as a metric, a log, or a trace?"
Common patterns:
- user-level identifiers moved from metrics into sampled logs
- detailed traces used temporarily for debugging, with a plan to turn them down
This helped keep metrics usable while still giving engineers rich data when they needed it.
4. Close the loop after incidents
Incident reviews started to include a section:
- "Which telemetry helped?"
- "Which telemetry caused problems (slow queries, missing data, confusing dashboards)?"
We used this to:
- remove or simplify useless metrics and dashboards
- invest in the kinds of telemetry that actually shortened incidents
Telemetry got treated less like "everything we might ever need" and more like a curated toolset.
5. Budget experiments as well as steady state
We also recognized that experiments spike telemetry:
- temporary debug logs
- extra tags
- higher trace sampling
We made experiments explicit:
- call out temporary telemetry changes in design docs
- set review dates to turn them down or remove them
This avoided "temporary" becoming permanent.
Takeaways
- Observability is infrastructure; it deserves design, budgets, and reviews.
- Costs show up both in bills and in query performance.
- Rough per-service budgets and dashboards are enough to start steering, even without perfect attribution.
- Incident reviews should ask which telemetry actually helped; that’s where investment should go next.