What the Data Shows
We've audited, migrated, and operated infrastructure for over 500 organisations. The teams with the lowest P0 incident rates aren't necessarily using the newest tools. They share five consistent practices.
Practice 1: Feature Flags Over Risky Deploys
The single biggest correlation with incident reduction is decoupling deployment from release. Teams that use feature flags ship more frequently with smaller blast radii. When something goes wrong, rollback is instant.
Practice 2: Runbooks That Are Actually Maintained
Every incident response playbook starts as aspirational and drifts into fiction. High-reliability teams treat runbooks like code — reviewed, versioned, and tested in gamedays quarterly.
Practice 3: Blast Radius First, Feature Velocity Second
The default engineering instinct is to optimise for shipping speed. Reliability teams think differently: if this fails, what's the maximum customer impact? Architecture decisions flow from that question.
Practice 4: Alert Fatigue Is a Reliability Risk
Teams that receive 1,000+ alerts per day respond to none of them well. The best SRE teams periodically audit every alert: "Has this alert led to a meaningful action in the last 30 days?" If not, it gets removed or degraded.
Practice 5: Blameless Post-Mortems That Actually Change Behaviour
Blameless post-mortems are industry standard advice. Actually running them well — capturing contributing factors, not just immediate causes, and following up on action items — is rare. The teams that do this consistently show measurable improvement in MTTR quarter-over-quarter.
The Counterintuitive Finding
The teams with the best incident records deploy more frequently than average, not less. Smaller, more frequent changes are easier to diagnose and roll back.