This standard ensures teams proactively test failure modes to build resilience and uncover weaknesses before they impact users. It drives a culture of engineering excellence where systems are designed to handle the unexpected gracefully.
Aligned to our "Post-Incident Learning Culture" policy, this standard reduces downtime, improves confidence in releases, and strengthens operational readiness. Without it, failures are harder to diagnose, more costly to fix, and more likely to erode trust.
| Category | Description |
|---|---|
| People & Culture | Teams focus on delivery over resilience. Failure scenarios are rarely discussed. |
| Process & Governance | No formal practice for testing or reviewing failure modes. |
| Technology & Tools | Failure occurs unexpectedly. No tooling exists to simulate or analyse them. |
| Measurement & Metrics | Failure data is anecdotal or retrospective only. |
| Category | Description |
|---|---|
| People & Culture | Some awareness of failure scenarios. Ad hoc chaos testing begins. |
| Process & Governance | Common failure types are listed. Manual testing may occur before go-live. |
| Technology & Tools | Teams use test scripts or environment toggles to simulate basic errors. |
| Measurement & Metrics | Basic tracking of test runs and failure response time. |
| Category | Description |
|---|---|
| People & Culture | Resilience is valued and owned across teams. Failure testing is a planned activity. |
| Process & Governance | Failure scenarios are defined for key services. Runbooks guide response. |
| Technology & Tools | Injected failure scenarios are reproducible in test and staging environments. |
| Measurement & Metrics | Failure coverage and incident learnings are measured and used for planning. |
| Category | Description |
|---|---|
| People & Culture | Teams continuously improve based on failure test results. Resilience is a shared priority. |
| Process & Governance | Failure testing is part of release gates and continuous delivery. |
| Technology & Tools | Automated chaos testing is used across core systems. Dependency graphs inform blast radius. |
| Measurement & Metrics | Frequency, severity, and recovery time from simulated failures are tracked over time. |
| Category | Description |
|---|---|
| People & Culture | Teams share learnings from simulated failures and evolve resilience patterns. |
| Process & Governance | Failure tests evolve with system complexity. Findings feed architecture, runbooks, and training. |
| Technology & Tools | Real-time failure simulation is integrated into production-safe environments. |
| Measurement & Metrics | Trends in failure impact, response, and recovery drive system design and team behaviour. |