Incident report: Cost alert that arrived too late
A misconfigured cost alert let a runaway job spend much more than intended before we noticed. We describe what happened and how we changed our approach.
Summary
On May 14, 2025, a misconfigured batch job and a slow-to-fire cost alert combined to produce an unwelcome surprise.
A routine backfill that was supposed to:
- run in off-peak hours
- process a limited set of records
- stay within a modest cost budget
instead:
- ran longer than expected
- touched more data than intended
- generated infrastructure and third-party costs significantly above plan
The technical impact on users was minimal.
The impact on our cost budget was not.
We treated this as an incident because it revealed that our cost observability was not fast or detailed enough for certain classes of jobs.
Impact
- Duration: several hours of unintended extra spend before we stopped the job.
- User impact:
- none directly; the job ran on background infrastructure
- some minor contention with other jobs sharing the same resources
- Internal impact:
- overspend beyond the planned budget for the month in one category
- time spent reconciling and explaining the spike
Timeline
All times local.
- 01:05 — Backfill job starts, expected to take ~60–90 minutes.
- 01:40 — Metrics show higher-than-expected throughput; no alerts are configured at this level for cost.
- 03:10 — Job is still running; resource usage is elevated but within capacity limits.
- 04:32 — A coarse-grained cost alert triggers, based on a %-over-baseline threshold aggregated at the daily level.
- 04:40 — On-call acknowledges the cost alert and correlates it with the long-running backfill.
- 04:52 — Job is paused and then terminated after verifying partial progress.
- 05:20 — Quick analysis estimates spend significantly above the initially approved range.
- Later that day — Incident review scheduled to focus on cost observability and job configuration.
Root cause
The root cause was a mismatch between:
- the speed at which the backfill could generate cost
- the speed and granularity of our cost alerts
The job:
- was configured with parameters that allowed it to process more data than originally planned
- ran in a region and configuration that was more expensive per unit of work
Our cost alerts:
- were tuned for slower-moving baseline shifts, not fast spikes from a single job
- used daily aggregates and percentage thresholds that delayed notification
Contributing factors
- Insufficient rate limits in the job framework. The backfill framework was capable of respecting throughput budgets but was not configured to do so here.
- Loose approval process for job parameters. The job’s configuration deviated from the original plan without a clear review.
- Coarse cost telemetry. We had good monthly and daily visibility, but not enough near-real-time per-job cost signals.
What we changed
1. Introduce per-job cost budgets
We added the concept of a cost budget to our backfill framework:
- jobs specify an expected cost range (or a proxy, like compute-hours or data processed)
- the framework tracks progress against this budget
- if a job exceeds its budget, it:
- slows down
- raises a specific alert
This doesn’t replace aggregate cost monitoring; it adds a job-level safety net.
2. Improve near-real-time cost signals
We worked with our infrastructure and billing tools to:
- derive more timely cost proxies for certain workloads (e.g., storage operations, API calls)
- expose these proxies as metrics and dashboards per job or per service
Alerts for high-risk jobs can now:
- trigger on these proxies within an hour or less
- complement slower, billing-based alerts
3. Tighten review for large jobs
We added a small checklist for jobs that:
- run against large datasets
- can generate significant external or internal costs
The checklist covers:
- expected duration and cost
- limits (time, data volume, budget)
- monitoring and stop conditions
Jobs that don’t meet the checklist require changes or explicit approvals before running.
4. Make cost part of incident drills
Just as we run reliability drills, we started running small "cost drills":
- simulate a misconfigured job in a lower environment
- observe how quickly and clearly the signals appear
- tune alerts and dashboards based on what we learn
Follow-ups
Completed
- Added per-job budgets and alerts to the backfill framework.
- Improved cost proxies and metrics for the relevant infrastructure components.
- Updated job templates and checklists for high-cost operations.
Planned / in progress
- Extend per-job budgets to more classes of batch work.
- Integrate cost signals more deeply into planning and review for large projects.
Takeaways
- Cost incidents can be as important to treat systematically as reliability incidents.
- Jobs that can generate large costs quickly need their own budgets and alerts.
- Coarse, daily cost alerts are useful but insufficient for high-intensity workloads.