Standard : Systems recover quickly and fail safely
Purpose and Strategic Importance
This standard ensures systems are designed to recover quickly and fail safely, reducing the blast radius of incidents and supporting sustainable, high-velocity delivery. It embeds resilience into the architecture, not just the process.
Aligned to our "Resilience Over Uptime" and "Balance Sustainability with Speed" policies, this standard protects user experience and team wellbeing during failure scenarios. Without it, systems become brittle, outages last longer, and recovery depends on manual intervention.
Strategic Impact
- Improves delivery flow and service continuity
- Reduces incident recovery times and blast radius
- Encourages automation and design for failure
- Reinforces psychological safety by reducing stress during incidents
- Enhances trust in systems and delivery teams
Risks of Not Having This Standard
- High-impact outages due to brittle failure modes
- Increased on-call fatigue and manual firefighting
- Slow recovery and unplanned downtime
- Low trust in the reliability of services
- Systems evolve without resilience considerations
CMMI Maturity Model
Level 1 – Initial
| Category |
Description |
| People & Culture |
Recovery depends on individual effort and tacit knowledge. Stressful incident response is common. |
| Process & Governance |
No defined recovery patterns or safe failure strategies. |
| Technology & Tools |
Systems lack basic automation for detection or recovery. |
| Measurement & Metrics |
MTTR and failure containment are not measured. |
Level 2 – Managed
| Category |
Description |
| People & Culture |
Teams acknowledge need for safer failure handling. Some learnings from incidents are shared. |
| Process & Governance |
Manual rollback steps or monitoring alerts exist but vary. |
| Technology & Tools |
Monitoring or alerting covers critical failure scenarios. |
| Measurement & Metrics |
Recovery time is manually reviewed post-incident. |
Level 3 – Defined
| Category |
Description |
| People & Culture |
Teams adopt shared patterns for resilience and failure recovery. |
| Process & Governance |
Recovery processes (e.g. failover, auto-restart) are standardised. |
| Technology & Tools |
Automated rollback, self-healing scripts, and retry logic are in place. |
| Measurement & Metrics |
MTTR, recovery success rate, and time-to-detect are tracked. |
Level 4 – Quantitatively Managed
| Category |
Description |
| People & Culture |
Incident retros include resilience gaps and improvement plans. |
| Process & Governance |
Recovery playbooks are rehearsed and tied to service-level expectations. |
| Technology & Tools |
System recovery is validated through continuous tests or chaos engineering. |
| Measurement & Metrics |
MTTR and recovery trends drive investment in architectural resilience. |
Level 5 – Optimising
| Category |
Description |
| People & Culture |
Teams innovate proactively on safe-to-fail design patterns. |
| Process & Governance |
Continuous learning drives evolution of recovery capabilities. |
| Technology & Tools |
Systems degrade gracefully and recover autonomously. |
| Measurement & Metrics |
Resilience metrics improve year-over-year across systems. |
Key Measures
- Mean Time to Recover (MTTR)
- Failure containment rate: % of failures contained without user impact
- Automated recovery coverage: % of systems with automated rollback or self-healing
- Graceful degradation score: Ability to provide partial functionality during failures
- Incident recovery rehearsal frequency: Chaos testing, game days, or recovery simulations