Incident report: Schema change that broke derived analytics
An online schema change left transactional traffic healthy but broke analytics pipelines. We describe the disconnect and what we changed.
Summary
On February 20, 2025, an online schema change to a core table rolled out smoothly for transactional workloads.
User-facing APIs stayed within SLOs.
However, several downstream analytics pipelines broke:
- nightly ETL jobs failed or produced partial data
- dashboards built on derived tables showed gaps and misaligned totals
From the perspective of production traffic, the change was a success.
From the perspective of analytics and reporting, it was an incident.
We treated this as a design problem in how we coordinate schema changes across transactional and analytical systems.
Impact
- Duration: roughly 36 hours of degraded analytics and reporting until data was backfilled and pipelines fixed.
- User impact:
- end-user experiences in the product were unaffected
- some internal and partner-facing reports were delayed or incomplete
- Internal impact:
- operations and finance teams could not rely on some daily metrics
- engineers spent time triaging ETL failures and patching transformations
No user data was lost, but derived data was temporarily inconsistent.
Timeline
All times local.
- 09:05 — Schema change begins: adding new columns and adjusting types on a core table using an online migration tool.
- 09:32 — Migration completes without errors; transactional health checks and dashboards look normal.
- 11:10 — An analytics engineer notices that a staging ETL run failed with a new column/type mismatch error.
- 11:24 — The production ETL run for the same pipeline is paused while investigation continues.
- 12:05 — We confirm that the schema change altered a column’s semantics in a way that existing transforms did not expect.
- 13:20 — Some dashboards begin showing incomplete data for the impacted dimensions.
- 14:02 — Incident channel opened for "derived analytics degradation"; data and platform teams join.
- 15:15 — Quick mitigation: adjust ETL jobs to handle both old and new schema shapes while we design a longer-term fix.
- Next day 09:00 — Backfills complete for the missing data; dashboards begin to show consistent values again.
- Next day 15:30 — Incident closed; follow-ups captured.
Root cause
The root cause was a schema change that treated transactional and analytical consumers differently.
The change:
- added a new column with slightly different semantics than the field it replaced
- changed the type of another column from a narrower integer to a wider representation
Transactional code was updated to:
- write to the new column
- interpret the new type correctly
Analytics code was not.
ETL pipelines assumed:
- a specific type and format for the original column
- a particular combination of fields that no longer matched transactional writes
We had no single, shared contract describing how this table was used for analytics.
Contributing factors
- Split ownership. The team owning the transactional schema and the team owning analytics pipelines had separate review processes.
- Incomplete "blast radius" analysis. The schema change review focused on production latency and correctness, not on downstream consumers.
- Lack of schema versioning for analytics. ETL jobs read from the live transactional schema rather than a versioned or deliberately shaped interface.
What we changed
1. Make analytical consumers explicit in schema designs
Schema change proposals now include a section for analytical impact:
- which pipelines read from the changed tables
- which derived tables or reports depend on the affected columns
- how type or semantic changes will be reflected downstream
We involve analytics owners in design and review when changes touch shared tables.
2. Introduce stable views for analytics
Rather than pointing ETL pipelines at raw transactional tables, we:
- introduced versioned views or intermediate tables as stable contracts for analytics
- made those views the primary interface for reporting pipelines
Schema changes can evolve underneath as long as the views preserve their contracts (or are versioned with deprecation plans).
3. Align rollout steps for transactional and analytical paths
We changed rollout patterns for schema changes:
- transactional code and analytics code are updated in compatible phases
- for significant changes, we:
- introduce new fields and update both transactional and ETL code to write/read them
- backfill derived data
- only then deprecate or repurpose old columns
This mirrors blue/green and dual-write strategies we already use for transactional migrations.
4. Improve monitoring for analytics health
We added monitoring that treats analytics as a first-class surface:
- success/failure rates for ETL jobs
- freshness of key derived tables
- sanity checks comparing metrics from old and new paths during transitions
We also made an "analytics health" dashboard part of the standard review view when schema changes roll out.
5. Document shared data contracts
We began documenting data contracts for shared tables:
- field meanings
- expected ranges and types
- how and where they are used in derived analytics
These contracts are now referenced in both schema change proposals and analytics pipeline designs.
Follow-ups
Completed
- Restored and backfilled affected analytics tables.
- Introduced a view layer for the most critical analytics consumers.
- Updated schema change templates to include analytical impact sections.
Planned / in progress
- Extend view-based contracts to more shared tables.
- Add automated checks that compare transactional and analytical aggregates during schema changes.
- Formalize joint reviews between transactional and analytics teams for high-risk changes.
Takeaways
- A schema change is not "done" if transactional traffic is healthy but analytics are broken.
- Views or dedicated analytical schemas make it easier to evolve transactional tables safely.
- Monitoring and contracts for analytics should be as deliberate as for user-facing APIs.