DELIVERY2025-11-18BY JONAS "JO" CARLIN

Story: the feature that only worked in our favorite environment

We built and tested a feature in one staging environment and one region. It behaved very differently elsewhere. Here’s why and what we changed.

deliveryenvironmentstestingrollouts

What happened

We built a feature that looked solid.

It worked in:

our main staging environment
one production region we used for early rollouts

When we enabled it elsewhere, we saw:

inconsistent behavior
higher error rates
confusing logs that didn’t match what we’d seen in staging

The feature wasn’t broken everywhere.

It was broken everywhere that wasn’t our favorite environment.

The blind spots

Our "favorite" environment had:

a particular data shape
a subset of integrations
network and latency characteristics close to one major region

Other regions and environments had:

different data distributions
different patterns of third-party errors
slightly different configuration defaults

We had treated success in one environment as proof.

It was only proof for that environment.

What we changed

1. Make environment differences explicit

We documented key differences between environments and regions:

data volume and shape
enabled integrations and flags
latency profiles

We stopped calling staging "like prod." We started calling it "good for testing these things."

2. Design rollouts to sample diversity

Instead of:

staging → one region → everywhere

we aimed for:

staging → multiple regions or cohorts with different characteristics → broader rollout

This meant:

picking early regions with different traffic patterns
including internal or low-risk cohorts from more than one environment

3. Align configuration and flags

We found cases where:

flags had different defaults in different regions
config values drifted over time

We:

standardized flag and config baselines where possible
made differences explicit where they were intentional

4. Test failure modes, not just success

In staging and early rollouts, we:

simulated dependency failures common in other regions
tested data from more than one segment or locale

This caught some environment-specific issues before they reached users.

Takeaways

Success in one environment is a useful signal but not a guarantee.
Rollouts should sample environments and regions that reflect real diversity.
Configuration and data shape differences matter as much for testing as code paths.