Standard : Failure modes are proactively tested

Purpose and Strategic Importance

This standard ensures teams proactively test failure modes to build resilience and uncover weaknesses before they impact users. It drives a culture of engineering excellence where systems are designed to handle the unexpected gracefully.

Aligned to our "Post-Incident Learning Culture" policy, this standard reduces downtime, improves confidence in releases, and strengthens operational readiness. Without it, failures are harder to diagnose, more costly to fix, and more likely to erode trust.

Strategic Impact

Improved resilience and fault tolerance across systems
Fewer production incidents and faster time-to-recovery
Stronger architectural awareness and disaster readiness
Greater customer and stakeholder trust

Risks of Not Having This Standard

Failures are discovered in production, not earlier
Increased customer impact and incident severity
Reduced confidence in change safety and system design
Slower recovery due to poor preparation and visibility

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Teams focus on delivery over resilience. Failure scenarios are rarely discussed.
Process & Governance	No formal practice for testing or reviewing failure modes.
Technology & Tools	Failure occurs unexpectedly. No tooling exists to simulate or analyse them.
Measurement & Metrics	Failure data is anecdotal or retrospective only.

Level 2 – Managed

Category	Description
People & Culture	Some awareness of failure scenarios. Ad hoc chaos testing begins.
Process & Governance	Common failure types are listed. Manual testing may occur before go-live.
Technology & Tools	Teams use test scripts or environment toggles to simulate basic errors.
Measurement & Metrics	Basic tracking of test runs and failure response time.

Level 3 – Defined

Category	Description
People & Culture	Resilience is valued and owned across teams. Failure testing is a planned activity.
Process & Governance	Failure scenarios are defined for key services. Runbooks guide response.
Technology & Tools	Injected failure scenarios are reproducible in test and staging environments.
Measurement & Metrics	Failure coverage and incident learnings are measured and used for planning.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Teams continuously improve based on failure test results. Resilience is a shared priority.
Process & Governance	Failure testing is part of release gates and continuous delivery.
Technology & Tools	Automated chaos testing is used across core systems. Dependency graphs inform blast radius.
Measurement & Metrics	Frequency, severity, and recovery time from simulated failures are tracked over time.

Level 5 – Optimising

Category	Description
People & Culture	Teams share learnings from simulated failures and evolve resilience patterns.
Process & Governance	Failure tests evolve with system complexity. Findings feed architecture, runbooks, and training.
Technology & Tools	Real-time failure simulation is integrated into production-safe environments.
Measurement & Metrics	Trends in failure impact, response, and recovery drive system design and team behaviour.

Key Measures

% of services with tested failure modes
Mean time to detect and recover from injected failures
% of incidents preceded by proactive test coverage
Number of teams actively running chaos tests
Frequency of resilience testing per system or platform