STEWARDSHIP2021-01-22BY PRIYA PATEL

Telemetry budgets for small services

How we introduced simple telemetry budgets so small services stay observable without surprising costs or overload.

observabilitytelemetryreliabilitycoststewardship

When a service is small, it’s easy to treat telemetry as "basically free."

Add a dashboard here, a new metric there, a few extra log lines "just in case." Storage is cheap. Queries feel fast enough.

A year later, the same service is responsible for a non-trivial slice of our metrics bill and a surprising amount of CPU on the metrics cluster. Queries that used to be instant now time out when we need them most.

This post is about how we added budgets to telemetry for small services without making engineers feel like they were negotiating every counter and log line.

Constraints

We did not want a central observability team micromanaging every metric.
We had a shared metrics and logging stack; one noisy service could degrade experience for others.
Some teams had no dedicated on-call rotation; they depended heavily on simple, reliable dashboards.
We were not prepared to introduce a new vendor or rewrite the entire telemetry layer.
Our billing summaries were coarse; we could not perfectly attribute every byte.

Culturally:

Engineers saw telemetry as safety. Any change that looked like "less telemetry" was going to be resisted.
Teams had different levels of experience reading cost reports and performance graphs.

What we changed

We introduced a few simple ideas:

Per-service telemetry budgets.
A "starter pack" of metrics and logs.
Guardrails for high-cardinality data.
Lightweight reviews for big changes.

1. Per-service budgets

Instead of pretending telemetry was unbounded, we set target ranges per service for:

time series count (rough number of active metric series)
log volume (GB/day)
trace volume (if applicable)

The goal was not perfect accounting. The goal was a shared sense of "roughly this much is reasonable."

For a small service, a starting point looked like:

Metrics: a few hundred active series.
Logs: enough to capture requests, errors, and key events, but not every minor detail.

We wrote these down next to the service’s SLOs.

When telemetry usage drifted above the range for more than a sprint or two, we treated it like any other reliability regression: a thing to understand and correct.

2. A starter pack for new services

New services got a small, opinionated set of signals by default:

request rate, error rate, P95 latency
saturation metrics relevant to that service (e.g., queue depth, DB connections)
a structured request log with a bounded set of fields

Teams could add more, but they had to be explicit about what decision the new metric or log would support during an incident.

This prevented the "blank dashboard" problem and the "everything is a metric" problem.

3. Guardrails for cardinality

We introduced two concrete rules:

No unbounded user identifiers in metrics labels.
No high-cardinality fields in default log views.

If someone wanted to add a metric tagged by user ID, we asked:

Can this be a log field instead, sampled or scoped?
Can we aggregate by a lower-cardinality key (plan, region, tier)?

In logs, we defaulted to a small schema and made richer debugging fields opt-in (e.g., only at higher log levels or sampling rates).

4. Lightweight reviews

We added a small section to our change templates for telemetry-heavy changes:

What new metrics/logs are added?
What cardinality do we expect?
How do we turn this down or off if it misbehaves?

Reviews focused on two questions:

Will this help us answer a concrete question during an incident?
Does it respect the service’s telemetry budget?

We did not ban experiments. We time-boxed them:

Experimental metrics/logs get a review date by which we either keep them (and adjust budgets) or remove them.

Results / Measurements

We tracked a few indicators before and after introducing budgets:

Per-service series count. For a sample of small services, active time series counts flattened instead of steadily climbing. One service that had been adding ~15–20% new series per quarter leveled off after we removed unused metrics and applied the starter pack.
Query reliability. Dashboards that used to time out during incidents became usable again after we reduced high-cardinality metrics. We saw the percentage of failed or slow (>10s) queries on our metrics backend drop noticeably.
Cost visibility. Our monthly telemetry review went from "we’re up again" to "we’re up by N%, mostly from these three services and these two new kinds of metrics." That made the next set of decisions concrete.

Perhaps most importantly, engineers started to treat telemetry as something that could regress and be fixed.

Instead of adding metrics indefinitely, teams:

removed unused dashboards
consolidated similar metrics
turned off verbose debug logs once a launch had stabilized

We did see some friction early on:

A few teams felt the budgets were arbitrary. We had to show actual query performance improvements to make the benefit visible.
Some engineers were concerned that budgets would prevent them from debugging. In practice, the reviews focused on shifting detail into logs and traces where appropriate, not on removing visibility.

Takeaways

Telemetry is not free, even for small services. It consumes storage, compute, and human attention.
Simple per-service budgets help teams make better trade-offs without central micromanagement.
A small, opinionated starter pack of metrics and logs prevents both under- and over-instrumentation.
Guardrails on cardinality and experimental metrics keep the system healthy when enthusiasm is high.
Treating telemetry drift as a regression—just like latency or error rate—turns cost conversations into engineering work instead of blame.