Standard : Systems recover quickly and fail safely

Purpose and Strategic Importance

This standard ensures systems are designed to recover quickly and fail safely, reducing the blast radius of incidents and supporting sustainable, high-velocity delivery. It embeds resilience into the architecture, not just the process.

Aligned to our "Resilience Over Uptime" and "Balance Sustainability with Speed" policies, this standard protects user experience and team wellbeing during failure scenarios. Without it, systems become brittle, outages last longer, and recovery depends on manual intervention.

Strategic Impact

Improves delivery flow and service continuity
Reduces incident recovery times and blast radius
Encourages automation and design for failure
Reinforces psychological safety by reducing stress during incidents
Enhances trust in systems and delivery teams

Risks of Not Having This Standard

High-impact outages due to brittle failure modes
Increased on-call fatigue and manual firefighting
Slow recovery and unplanned downtime
Low trust in the reliability of services
Systems evolve without resilience considerations

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Recovery depends on individual effort and tacit knowledge. Stressful incident response is common.
Process & Governance	No defined recovery patterns or safe failure strategies.
Technology & Tools	Systems lack basic automation for detection or recovery.
Measurement & Metrics	MTTR and failure containment are not measured.

Level 2 – Managed

Category	Description
People & Culture	Teams acknowledge need for safer failure handling. Some learnings from incidents are shared.
Process & Governance	Manual rollback steps or monitoring alerts exist but vary.
Technology & Tools	Monitoring or alerting covers critical failure scenarios.
Measurement & Metrics	Recovery time is manually reviewed post-incident.

Level 3 – Defined

Category	Description
People & Culture	Teams adopt shared patterns for resilience and failure recovery.
Process & Governance	Recovery processes (e.g. failover, auto-restart) are standardised.
Technology & Tools	Automated rollback, self-healing scripts, and retry logic are in place.
Measurement & Metrics	MTTR, recovery success rate, and time-to-detect are tracked.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Incident retros include resilience gaps and improvement plans.
Process & Governance	Recovery playbooks are rehearsed and tied to service-level expectations.
Technology & Tools	System recovery is validated through continuous tests or chaos engineering.
Measurement & Metrics	MTTR and recovery trends drive investment in architectural resilience.

Level 5 – Optimising

Category	Description
People & Culture	Teams innovate proactively on safe-to-fail design patterns.
Process & Governance	Continuous learning drives evolution of recovery capabilities.
Technology & Tools	Systems degrade gracefully and recover autonomously.
Measurement & Metrics	Resilience metrics improve year-over-year across systems.

Key Measures

Mean Time to Recover (MTTR)
Failure containment rate: % of failures contained without user impact
Automated recovery coverage: % of systems with automated rollback or self-healing
Graceful degradation score: Ability to provide partial functionality during failures
Incident recovery rehearsal frequency: Chaos testing, game days, or recovery simulations