Access is a production dependency
During an incident, lack of access looks like downtime. Excess access looks like risk. We treat access like any other production system.
Most teams only notice access when it’s missing.
A system can be healthy and still be unoperable if the on-call engineer can’t reach the logs.
In an incident, “I can’t access that” is not a security concern. It’s downtime.
I’ve seen incidents where the system was recoverable in five minutes, but we spent twenty minutes getting into the right place: waiting for someone with admin access to wake up, hunting for a token, or discovering that the “temporary” vendor login was tied to one person’s 2FA.
The outage clock keeps running while you’re negotiating access.
Constraints
Access control gets worse under predictable pressure:
- onboarding is rushed, so roles are copied instead of designed
- credentials are shared because it’s faster than fixing the process
- “temporary” access becomes permanent
- audit trails are missing, so trust becomes memory
In an incident, missing access is an outage multiplier.
In steady state, excessive access is a silent risk.
Access is also a reliability dependency that doesn’t show up in your error rate. Your service can be healthy and you can still fail to restore it quickly if nobody has the right permissions.
Two things are true at once:
- If nobody can access production, recovery is slow.
- If everybody can access everything, you don’t have boundaries.
Access also rots in ways that don’t show up on dashboards:
- vendor consoles with separate logins
- 2FA devices tied to a single person
- long-lived tokens nobody remembers
- service accounts that quietly become “the admin account”
If your recovery path depends on one person’s phone or one person’s memory, you don’t have an on-call rotation.
What we changed
We treated access as part of the operational surface area.
Inventory the access surface
We wrote down every place an on-call engineer might need to go during an incident: metrics, logs, tracing, deploys, feature flags, queue admin, vendor consoles.
For each surface, we wrote:
- how you get in
- who owns it
- what the least-privilege operational role looks like
This sounds bureaucratic.
It prevents the worst incident pattern: the page is firing, the dashboards exist, but the link takes you to a login screen you don’t have.
It also tells you where the real risks are: places where “admin” is the only role, where the audit trail is missing, or where the recovery path depends on a single person.
Replace shared credentials
Shared credentials feel like speed.
They are also:
- un-auditable
- hard to rotate
- impossible to attribute
We replaced shared credentials with role-based access where possible.
Define minimal operational roles
We made roles explicit.
Not “engineer” and “admin.”
Operational roles:
- on-call: read-only access to logs/metrics, ability to roll back and turn down risky knobs
- support: ability to look up reference IDs and correlate user reports, without broad write access
- deploy/ops: the small set of write actions needed to operate the system
This lets us answer a concrete question: “what does an on-call engineer need to do the first ten minutes of work?”
A follow-up question matters just as much: “what should on-call explicitly not be able to do at 2am?”
We want recovery actions to be available, and risky configuration changes to require deliberate escalation.
Add a break-glass path
We introduced a break-glass path: time-bounded elevated access with logging.
The point is not to bypass security.
The point is to avoid the worst pattern: permanent admin access “just in case” because escalation is painful.
Break-glass has three requirements:
- it expires
- it is logged
- it requires a reason (even a short one)
Put access in the runbook
We put access paths into the runbook.
If the safe first action requires a permission that only one person has, that’s not a safe first action.
A runbook that says “check the logs” is not done if on-call can’t reach the logs.
Make access testable
We treat access like we treat backups: something you don’t discover is broken during the incident.
On a recurring cadence (monthly or quarterly), we do a quick “access drill”:
- a random on-call engineer verifies they can reach logs, dashboards, and the deploy tool
- we validate break-glass still works and still expires
- we fix whatever has drifted (SSO changes, renamed groups, expired vendor accounts)
If you can’t test access, you don’t really have it.
Review access like a production system
We added a simple recurring review:
- who has access
- why
- when it expires
We also treat offboarding as an operational event: access removed, tokens rotated, shared secrets eliminated.
Results / Measurements
The goal isn’t perfection. The goal is fewer bad surprises.
We watched for:
- fewer incidents delayed by “who can access this?”
- fewer shared secrets living in chat threads
- fewer “temporary” permissions that never get removed
We also watched how often an incident required someone outside the rotation to join purely to grant access. That’s not escalation. That’s a process failure.
We also tracked a direct operational metric:
- time-to-first-log-line during an incident
If it takes ten minutes to get access to logs, you don’t have observability.
A practical metric: time-to-restore should not depend on knowing which person has the right login.
Takeaways
Access is part of uptime.
If you can’t get to the logs quickly, you don’t have observability.
If access is shared, you don’t have accountability.
If recovery depends on a single person’s phone, you don’t have an on-call rotation.