RELIABILITY2019-10-10BY PRIYA PATEL

Cardinality is a budget

Telemetry that can’t be queried is just expensive noise. We treat cardinality and log volume as budgets.

reliabilityobservabilitymonitoringloggingmetrics

The first clue we were in trouble wasn’t an alert.

It was that the dashboard stopped loading.

Not because the monitoring vendor was down. Because we had built a telemetry system that could generate data faster than a tired human could query it.

Constraints

Observability has two kinds of cost:

Machine cost: ingestion, storage, query latency.
Human cost: attention, confusion, false confidence.

Cardinality is where both costs hide.

If you put an unbounded value (user ID, request ID, email, session token) into a metric label, you don’t get “better visibility.” You get an unbounded number of time series.

A small example:

requests_total{route="/checkout"} is one series.
requests_total{route="/checkout", user_id="…"} is as many series as you have users.

You can’t page a person with “we have 400,000 time series now.” But you will feel it as slow dashboards, missing context, and rising costs.

Logs have the same failure mode. A log line added “for debugging” can become the dominant workload if it fires on every request.

A subtle trap is trying to solve every question with metrics.

Metrics are great for “is it broken?” and “how much?”

They are terrible for “which exact request?” unless you turn them into an index of unique IDs.

So we keep our metrics boring and fast, and we use logs/traces for detail.

That keeps the incident room in one place: a quick dashboard for the first decision, then a smaller set of deep links when you need to zoom in.

If you need request-level detail, treat it as a log/trace concern. Keep the metric labels bounded so the dashboards stay fast.

We also watch out for “bounded in theory” labels:

error messages that include user input
raw exception strings
route labels that include IDs because a router wasn’t configured

Those look bounded until they aren’t.

What we changed

We started treating telemetry like production work with explicit budgets.

We banned unbounded labels in metrics.

no user IDs, request IDs, emails, or raw URLs
prefer bounded buckets (status code, route template, tier)

We made “who uses this?” a required question in review. If we can’t answer:

who will look at this at 2am
what decision it supports
what “bad” looks like

…we don’t ship it.

We moved high-volume logs behind sampling.

sampling is explicit, not accidental
the default path stays cheap

We made telemetry changes reversible. A safe rollback for telemetry is often “turn it down” (lower log level, reduce sampling rate, disable one emitter), not “roll back the whole deploy.”

The questions we ask in review

Before a new metric label, log field, or trace attribute ships, we answer:

what is the expected cardinality (order of magnitude)?
is it bounded? if not, why is this not a log we can sample?
what decision will this support at 2am?
what happens if queries get slow during an incident?
can we turn it down without a deploy (sampling knob, level, kill switch)?

We also treat route naming as observability work.

“Raw URL” is unbounded. “Route template” is bounded.

If we don’t have templates, we create them.

The budgets we actually write down

We don’t need perfect numbers, but we do need guardrails:

a maximum series budget per service for topline metrics
a default sampling rate for high-volume logs and traces
a denylist of fields we never emit (PII, tokens, emails)
a plan for correlation (trace IDs / request IDs in logs, not as metric labels)

When someone genuinely needs request-level detail, we push that detail into logs and traces:

metrics tell us checkout is failing
logs/traces tell us which request is failing and why

That separation keeps dashboards fast without removing the ability to investigate.

A small trick that keeps us honest: when we add a new label, we write the expected cardinality right in the PR.

“route_template (≈50), status_class (≈5), region (≈3)” is reviewable.

“user_id (unknown)” is a red flag.

It also forces a useful question: what is the bounded alternative?

We also set a rough SLO for observability: the first dashboard should load fast enough to use during a page.

If we can’t query it quickly, we reduce breakdowns until we can.

When we really need per-user slices, we keep that in logs behind sampling and in traces behind rate limits. Metrics stay boring.

Fast beats clever during a page, and bounded beats unbounded always.

When we break a budget, we roll back the telemetry change first. Telemetry that makes the dashboard unusable is not “more visibility.” It’s an outage multiplier.

Results / Measurements

The results were boring in the best way:

dashboards became fast enough to use during incidents
we stopped losing time to “is the monitoring broken?” when the system was broken
log ingestion stopped competing with request handling

The easiest metric to track was operational, not philosophical:

did a telemetry change ever become the incident?

When the answer trends toward “no,” your observability is helping.

Takeaways

Cardinality is a budget. If you don’t set one, you will pay the bill later.

Telemetry should be reviewable, reversible, and tied to a decision.

If you need a request ID, put it in logs where you can sample it—don’t explode your metrics trying to make them do everything.