DELIVERY2025-11-18BY JONAS "JO" CARLIN

Story: the feature that only worked in our favorite environment

We built and tested a feature in one staging environment and one region. It behaved very differently elsewhere. Here’s why and what we changed.

deliveryenvironmentstestingrollouts

What happened

We built a feature that looked solid.

It worked in:

  • our main staging environment
  • one production region we used for early rollouts

When we enabled it elsewhere, we saw:

  • inconsistent behavior
  • higher error rates
  • confusing logs that didn’t match what we’d seen in staging

The feature wasn’t broken everywhere.

It was broken everywhere that wasn’t our favorite environment.

The blind spots

Our "favorite" environment had:

  • a particular data shape
  • a subset of integrations
  • network and latency characteristics close to one major region

Other regions and environments had:

  • different data distributions
  • different patterns of third-party errors
  • slightly different configuration defaults

We had treated success in one environment as proof.

It was only proof for that environment.

What we changed

1. Make environment differences explicit

We documented key differences between environments and regions:

  • data volume and shape
  • enabled integrations and flags
  • latency profiles

We stopped calling staging "like prod." We started calling it "good for testing these things."

2. Design rollouts to sample diversity

Instead of:

  • staging → one region → everywhere

we aimed for:

  • staging → multiple regions or cohorts with different characteristics → broader rollout

This meant:

  • picking early regions with different traffic patterns
  • including internal or low-risk cohorts from more than one environment

3. Align configuration and flags

We found cases where:

  • flags had different defaults in different regions
  • config values drifted over time

We:

  • standardized flag and config baselines where possible
  • made differences explicit where they were intentional

4. Test failure modes, not just success

In staging and early rollouts, we:

  • simulated dependency failures common in other regions
  • tested data from more than one segment or locale

This caught some environment-specific issues before they reached users.

Takeaways

  • Success in one environment is a useful signal but not a guarantee.
  • Rollouts should sample environments and regions that reflect real diversity.
  • Configuration and data shape differences matter as much for testing as code paths.

Further reading