STEWARDSHIP2025-05-14BY PRIYA PATEL

Incident report: Cost alert that arrived too late

A misconfigured cost alert let a runaway job spend much more than intended before we noticed. We describe what happened and how we changed our approach.

stewardshipcostincidentstelemetry

Summary

On May 14, 2025, a misconfigured batch job and a slow-to-fire cost alert combined to produce an unwelcome surprise.

A routine backfill that was supposed to:

run in off-peak hours
process a limited set of records
stay within a modest cost budget

instead:

ran longer than expected
touched more data than intended
generated infrastructure and third-party costs significantly above plan

The technical impact on users was minimal.

The impact on our cost budget was not.

We treated this as an incident because it revealed that our cost observability was not fast or detailed enough for certain classes of jobs.

Impact

Duration: several hours of unintended extra spend before we stopped the job.
User impact:
- none directly; the job ran on background infrastructure
- some minor contention with other jobs sharing the same resources
Internal impact:
- overspend beyond the planned budget for the month in one category
- time spent reconciling and explaining the spike

Timeline

All times local.

01:05 — Backfill job starts, expected to take ~60–90 minutes.
01:40 — Metrics show higher-than-expected throughput; no alerts are configured at this level for cost.
03:10 — Job is still running; resource usage is elevated but within capacity limits.
04:32 — A coarse-grained cost alert triggers, based on a %-over-baseline threshold aggregated at the daily level.
04:40 — On-call acknowledges the cost alert and correlates it with the long-running backfill.
04:52 — Job is paused and then terminated after verifying partial progress.
05:20 — Quick analysis estimates spend significantly above the initially approved range.
Later that day — Incident review scheduled to focus on cost observability and job configuration.

Root cause

The root cause was a mismatch between:

the speed at which the backfill could generate cost
the speed and granularity of our cost alerts

The job:

was configured with parameters that allowed it to process more data than originally planned
ran in a region and configuration that was more expensive per unit of work

Our cost alerts:

were tuned for slower-moving baseline shifts, not fast spikes from a single job
used daily aggregates and percentage thresholds that delayed notification

Contributing factors

Insufficient rate limits in the job framework. The backfill framework was capable of respecting throughput budgets but was not configured to do so here.
Loose approval process for job parameters. The job’s configuration deviated from the original plan without a clear review.
Coarse cost telemetry. We had good monthly and daily visibility, but not enough near-real-time per-job cost signals.

What we changed

1. Introduce per-job cost budgets

We added the concept of a cost budget to our backfill framework:

jobs specify an expected cost range (or a proxy, like compute-hours or data processed)
the framework tracks progress against this budget
if a job exceeds its budget, it:
- slows down
- raises a specific alert

This doesn’t replace aggregate cost monitoring; it adds a job-level safety net.

2. Improve near-real-time cost signals

We worked with our infrastructure and billing tools to:

derive more timely cost proxies for certain workloads (e.g., storage operations, API calls)
expose these proxies as metrics and dashboards per job or per service

Alerts for high-risk jobs can now:

trigger on these proxies within an hour or less
complement slower, billing-based alerts

3. Tighten review for large jobs

We added a small checklist for jobs that:

run against large datasets
can generate significant external or internal costs

The checklist covers:

expected duration and cost
limits (time, data volume, budget)
monitoring and stop conditions

Jobs that don’t meet the checklist require changes or explicit approvals before running.

4. Make cost part of incident drills

Just as we run reliability drills, we started running small "cost drills":

simulate a misconfigured job in a lower environment
observe how quickly and clearly the signals appear
tune alerts and dashboards based on what we learn

Follow-ups

Completed

Added per-job budgets and alerts to the backfill framework.
Improved cost proxies and metrics for the relevant infrastructure components.
Updated job templates and checklists for high-cost operations.

Planned / in progress

Extend per-job budgets to more classes of batch work.
Integrate cost signals more deeply into planning and review for large projects.

Takeaways

Cost incidents can be as important to treat systematically as reliability incidents.
Jobs that can generate large costs quickly need their own budgets and alerts.
Coarse, daily cost alerts are useful but insufficient for high-intensity workloads.