RELIABILITY2021-11-11BY ELI NAVARRO

Incident report: Runaway cron job in a forgotten namespace

A scheduled job we thought was gone kept running in an old namespace, causing periodic load spikes. We describe how we found it and what we changed.

reliabilitycronjobsincidents

Summary

On November 11, 2021, we investigated recurring load spikes that had no obvious cause.

Every few days, at roughly the same time, a set of services would experience:

  • elevated CPU and database load
  • short but noticeable latency increases

The pattern suggested some kind of scheduled work.

Our current cron and job dashboards showed nothing unusual.

Eventually we discovered the culprit: a cron job running in a forgotten namespace from an earlier infrastructure migration.

The job:

  • was no longer expected to exist
  • still had access to production data
  • ran a heavy query-and-update cycle

We shut it down and then treated the existence of "forgotten" scheduled work as an incident in its own right.

Impact

  • Duration: roughly 20–30 minutes of elevated load per run, recurring every few days until discovery.
  • User impact:
    • brief latency increases for some endpoints during runs
    • a small bump in timeouts during the worst spike
  • Internal impact:
    • on-call engineers spent several cycles chasing the cause before we found the job
    • some scheduled maintenance had to be rescheduled to avoid overlapping with the spikes

No data corruption was found, but the job was performing work that no longer had a clear owner or purpose.

Timeline

All times local for the discovery day.

  • 09:03 — Usual load spike begins; dashboards show familiar pattern.
  • 09:08 — On-call identifies the spike as similar to previous events and opens an incident channel.
  • 09:15 — Current cron dashboards and job queues show no obvious heavy tasks starting at this time.
  • 09:22 — Investigation focuses on database metrics; a specific set of queries appears at each spike.
  • 09:30 — We trace the queries back to a client identity associated with an old namespace.
  • 09:37 — Accessing that namespace reveals an old cron configuration still active.
  • 09:42 — We disable the cron job and monitor load.
  • 09:50 — Metrics return to normal; no additional spikes observed.
  • 10:20 — We begin an inventory of other potential scheduled jobs in legacy namespaces.

Root cause

The immediate cause was a cron job that should have been decommissioned.

The deeper causes were:

  • an incomplete migration away from an old namespace
  • lack of centralized visibility into scheduled work across environments
  • no explicit owner for the job once its original service was rewritten

The job itself:

  • enumerated a large set of records
  • performed expensive checks and updates
  • ran on a schedule that overlapped with normal peak traffic

Because it ran in a namespace our primary dashboards no longer watched, it appeared only as "mysterious" load.

Contributing factors

  • Inconsistent decommissioning. We removed most—but not all—cron jobs from the old namespace during a previous migration.
  • Limited cross-namespace observability. Our job dashboards focused on current, known environments.
  • No formal job ownership. The cron’s logic had been partially replaced elsewhere, but the job itself remained.

What we changed

1. Inventory and ownership for scheduled work

We created an inventory of:

  • all cron jobs and scheduled tasks
  • their namespaces/environments
  • owning teams

Jobs without a clear owner were either:

  • assigned one, or
  • decommissioned after review

2. Central dashboards for scheduled jobs

We added dashboards that show:

  • scheduled jobs across all namespaces
  • their next run times
  • recent runtime and error metrics

This made it easier to answer:

  • "What is running right now?"
  • "What is scheduled to run at this time?"

3. Decommissioning checklists

We added scheduled work to our decommissioning checklists:

  • when retiring a service or namespace, list and remove associated cron jobs
  • verify via dashboards that no jobs remain scheduled there

4. Safer defaults

We changed cron defaults so that:

  • new jobs require an explicit owner annotation
  • jobs without owners show up in a "needs attention" view

We also adjusted scheduling guidelines:

  • prefer off-peak windows for heavy jobs
  • coordinate with service teams when jobs touch shared dependencies

Follow-ups

Completed

  • Disabled the forgotten cron job and verified no similar jobs remained in that namespace.
  • Created cross-namespace dashboards for scheduled work.
  • Added ownership metadata to existing critical jobs.

Planned / in progress

  • Extend job inventories to cover third-party schedulers where applicable.
  • Automate checks for ownerless jobs.

Takeaways

  • Scheduled jobs are part of your production surface, even when they live in "old" namespaces.
  • Migrations and decommissions should include a pass over scheduled work, not just services and databases.
  • Centralized visibility and ownership for cron jobs reduce the chance that a "forgotten" script becomes the next root cause.

Further reading