SECURITY2021-12-08BY STORECODE

Incident report: Access control misconfig slowed a rollout

A misconfigured permission set blocked a critical operational action during a rollout. We describe how it happened and what we changed.

securityaccess-controlrolloutpermissionsincident-response

Summary

On December 8, 2021, a misconfigured access control policy prevented the on-call engineer from performing a critical operation during a staged rollout.

The operation was a routine safety step: shifting a subset of traffic back to the previous version when error rates exceeded a threshold.

Because the on-call account lacked the necessary permission, the rollback action failed. It took an additional 18 minutes to identify the cause, locate someone with broader permissions, and complete the rollback.

During that window, a small but meaningful percentage of requests continued to experience elevated error rates.

We treated this as a security and reliability incident. The system behaved "securely" in the narrow sense—denying an unauthorized action—but unsafely in context because the wrong people were considered unauthorized.

Impact

Duration: approximately 32 minutes of elevated error rates for the affected cohort, including 18 minutes of delay attributable to the access control issue.
User impact:
- ~4% of requests in the rollout cohort failed with errors tied to the new version.
- Some users experienced retries and repeated failures on a specific flow.
Internal impact:
- on-call engineers could not execute the documented rollback path
- incident command attention shifted from debugging the regression to debugging permissions
- trust in the rollout tooling was temporarily reduced

We did not observe data loss.

Timeline

All times local.

14:02 — Rollout of the new version begins for 10% of traffic.
14:09 — Error rate for the cohort ticks up but remains within SLO.
14:14 — Error rate crosses the pre-defined threshold; alert fires.
14:16 — On-call initiates the rollback step in the deployment tool.
14:17 — Rollback action fails with a generic "not authorized" message. No change in traffic distribution.
14:19 — On-call checks logs and the deployment tool UI; no clear explanation is provided beyond an internal error code.
14:22 — Incident channel created; secondary engineer joins.
14:25 — They attempt the rollback via an alternate path (API call) with the same account; receives a more explicit "permission denied" response.
14:29 — They escalate to the team that owns access control and the deployment system.
14:32 — An engineer with broader permissions joins and performs the rollback successfully.
14:35 — Error rates begin to drop as traffic shifts back to the previous version.
14:44 — Error rates and latency return to baseline.
15:10 — Incident closed with follow-ups captured.

Root cause

The root cause was a misconfigured access control policy for the role used by primary on-call engineers.

Specifically:

The role allowed initiating deployments but not changing the traffic split for an active rollout.
Historically, these actions were separate. A previous policy review had focused on who could deploy, not who could adjust traffic during a deployment.
A later system update changed how rollout stages were modeled; traffic-shift operations moved into a different permission category that was not granted to the on-call role.

Contributing factors:

Lack of end-to-end testing for permissions. We did not have a routine check that "the on-call role can execute the full documented rollback path."
Unclear error messages. The first failure mode surfaced as a generic error in the UI, rather than an explicit "you do not have permission to perform this action" with a link to documentation.
Tight coupling between auth changes and rollout behavior. A security-focused change months earlier had tightened permissions, but its impact on operational flows was not fully evaluated.

What we changed

1. Align roles with operational responsibilities

We updated roles to match reality:

The primary on-call role now has explicit permissions to:
- adjust traffic splits within configured safe bounds
- trigger rollbacks for active rollouts
High-risk actions (e.g., forcing a rollout past guardrails) remain restricted to a smaller group.

We documented these capabilities in the on-call handbook and in the deployment tool itself.

2. Permissions as part of runbook tests

We added a small but important test to our runbook review process:

At least once per quarter, an engineer using the standard on-call account walks through the full rollback flow in a non-production environment.
Any permission errors are treated as regressions and fixed before the next rotation.

This is boring by design. The goal is to catch misalignments before they matter.

3. Better error messages and paths

We improved the deployment tool’s behavior when a user lacks permission to perform an action:

The UI now shows a clear error message: which permission is missing and which role(s) include it.
A small "request access" link creates a ticket with pre-filled details instead of relying on ad-hoc chat messages.

In production incidents, we still want to avoid granting broad new powers on the fly, but at least the path is visible.

4. Guardrails on traffic-shift actions

To keep the expanded permissions safe, we added guardrails:

On-call can only change traffic within a bounded range (e.g., from 10% back to 0% or forward to 25%) without additional approval.
Larger shifts or disabling guardrails entirely require a different role.

This balances the need for quick rollback against the risk of making major changes accidentally.

5. Access control as part of pre-rollout checks

We extended our pre-rollout checklist to include a simple step:

"Verify that the account expected to manage the rollout can perform both forward and backward actions in the target environment."

This is a small manual test, but it forces us to think about access from the perspective of the person holding the pager.

Follow-ups

Completed

Updated on-call role permissions for rollout management.
Improved deployment tool error messages and documentation around authorization failures.
Added quarterly permission checks for rollback flows in lower environments.

Planned / in progress

Automate some of the permission checks as part of CI for the deployment tool.
Introduce a clearer mapping between organizational roles (on-call, incident lead, SRE) and the permissions they require in operational tools.
Review other critical runbooks for hidden assumptions about access (e.g., database failover, feature-flag changes).

This incident reminded us that "secure" and "usable in an emergency" are not opposing goals. They are the same goal viewed from different angles: the right people should be able to do the right safe thing quickly, and everyone else should be guided or blocked.