ARCHITECTURE2025-02-20BY STORECODE

Incident report: Schema change that broke derived analytics

An online schema change left transactional traffic healthy but broke analytics pipelines. We describe the disconnect and what we changed.

architecturedatabasesanalyticsincidents

Summary

On February 20, 2025, an online schema change to a core table rolled out smoothly for transactional workloads.

User-facing APIs stayed within SLOs.

However, several downstream analytics pipelines broke:

nightly ETL jobs failed or produced partial data
dashboards built on derived tables showed gaps and misaligned totals

From the perspective of production traffic, the change was a success.

From the perspective of analytics and reporting, it was an incident.

We treated this as a design problem in how we coordinate schema changes across transactional and analytical systems.

Impact

Duration: roughly 36 hours of degraded analytics and reporting until data was backfilled and pipelines fixed.
User impact:
- end-user experiences in the product were unaffected
- some internal and partner-facing reports were delayed or incomplete
Internal impact:
- operations and finance teams could not rely on some daily metrics
- engineers spent time triaging ETL failures and patching transformations

No user data was lost, but derived data was temporarily inconsistent.

Timeline

All times local.

09:05 — Schema change begins: adding new columns and adjusting types on a core table using an online migration tool.
09:32 — Migration completes without errors; transactional health checks and dashboards look normal.
11:10 — An analytics engineer notices that a staging ETL run failed with a new column/type mismatch error.
11:24 — The production ETL run for the same pipeline is paused while investigation continues.
12:05 — We confirm that the schema change altered a column’s semantics in a way that existing transforms did not expect.
13:20 — Some dashboards begin showing incomplete data for the impacted dimensions.
14:02 — Incident channel opened for "derived analytics degradation"; data and platform teams join.
15:15 — Quick mitigation: adjust ETL jobs to handle both old and new schema shapes while we design a longer-term fix.
Next day 09:00 — Backfills complete for the missing data; dashboards begin to show consistent values again.
Next day 15:30 — Incident closed; follow-ups captured.

Root cause

The root cause was a schema change that treated transactional and analytical consumers differently.

The change:

added a new column with slightly different semantics than the field it replaced
changed the type of another column from a narrower integer to a wider representation

Transactional code was updated to:

write to the new column
interpret the new type correctly

Analytics code was not.

ETL pipelines assumed:

a specific type and format for the original column
a particular combination of fields that no longer matched transactional writes

We had no single, shared contract describing how this table was used for analytics.

Contributing factors

Split ownership. The team owning the transactional schema and the team owning analytics pipelines had separate review processes.
Incomplete "blast radius" analysis. The schema change review focused on production latency and correctness, not on downstream consumers.
Lack of schema versioning for analytics. ETL jobs read from the live transactional schema rather than a versioned or deliberately shaped interface.

What we changed

1. Make analytical consumers explicit in schema designs

Schema change proposals now include a section for analytical impact:

which pipelines read from the changed tables
which derived tables or reports depend on the affected columns
how type or semantic changes will be reflected downstream

We involve analytics owners in design and review when changes touch shared tables.

2. Introduce stable views for analytics

Rather than pointing ETL pipelines at raw transactional tables, we:

introduced versioned views or intermediate tables as stable contracts for analytics
made those views the primary interface for reporting pipelines

Schema changes can evolve underneath as long as the views preserve their contracts (or are versioned with deprecation plans).

3. Align rollout steps for transactional and analytical paths

We changed rollout patterns for schema changes:

transactional code and analytics code are updated in compatible phases
for significant changes, we:
- introduce new fields and update both transactional and ETL code to write/read them
- backfill derived data
- only then deprecate or repurpose old columns

This mirrors blue/green and dual-write strategies we already use for transactional migrations.

4. Improve monitoring for analytics health

We added monitoring that treats analytics as a first-class surface:

success/failure rates for ETL jobs
freshness of key derived tables
sanity checks comparing metrics from old and new paths during transitions

We also made an "analytics health" dashboard part of the standard review view when schema changes roll out.

5. Document shared data contracts

We began documenting data contracts for shared tables:

field meanings
expected ranges and types
how and where they are used in derived analytics

These contracts are now referenced in both schema change proposals and analytics pipeline designs.

Follow-ups

Completed

Restored and backfilled affected analytics tables.
Introduced a view layer for the most critical analytics consumers.
Updated schema change templates to include analytical impact sections.

Planned / in progress

Extend view-based contracts to more shared tables.
Add automated checks that compare transactional and analytical aggregates during schema changes.
Formalize joint reviews between transactional and analytics teams for high-risk changes.

Takeaways

A schema change is not "done" if transactional traffic is healthy but analytics are broken.
Views or dedicated analytical schemas make it easier to evolve transactional tables safely.
Monitoring and contracts for analytics should be as deliberate as for user-facing APIs.