• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Failure modes are proactively tested

Purpose and Strategic Importance

This standard ensures teams proactively test failure modes to build resilience and uncover weaknesses before they impact users. It drives a culture of engineering excellence where systems are designed to handle the unexpected gracefully.

Aligned to our "Post-Incident Learning Culture" policy, this standard reduces downtime, improves confidence in releases, and strengthens operational readiness. Without it, failures are harder to diagnose, more costly to fix, and more likely to erode trust.

Strategic Impact

  • Improved resilience and fault tolerance across systems
  • Fewer production incidents and faster time-to-recovery
  • Stronger architectural awareness and disaster readiness
  • Greater customer and stakeholder trust

Risks of Not Having This Standard

  • Failures are discovered in production, not earlier
  • Increased customer impact and incident severity
  • Reduced confidence in change safety and system design
  • Slower recovery due to poor preparation and visibility

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Teams focus on delivery over resilience.
Failure scenarios are rarely discussed.
Process & Governance No formal practice for testing or reviewing failure modes.
Technology & Tools Failure occurs unexpectedly.
No tooling exists to simulate or analyse them.
Measurement & Metrics Failure data is anecdotal or retrospective only.

Level 2 – Managed

Category Description
People & Culture Some awareness of failure scenarios.
Ad hoc chaos testing begins.
Process & Governance Common failure types are listed.
Manual testing may occur before go-live.
Technology & Tools Teams use test scripts or environment toggles to simulate basic errors.
Measurement & Metrics Basic tracking of test runs and failure response time.

Level 3 – Defined

Category Description
People & Culture Resilience is valued and owned across teams.
Failure testing is a planned activity.
Process & Governance Failure scenarios are defined for key services.
Runbooks guide response.
Technology & Tools Injected failure scenarios are reproducible in test and staging environments.
Measurement & Metrics Failure coverage and incident learnings are measured and used for planning.

Level 4 – Quantitatively Managed

Category Description
People & Culture Teams continuously improve based on failure test results.
Resilience is a shared priority.
Process & Governance Failure testing is part of release gates and continuous delivery.
Technology & Tools Automated chaos testing is used across core systems.
Dependency graphs inform blast radius.
Measurement & Metrics Frequency, severity, and recovery time from simulated failures are tracked over time.

Level 5 – Optimising

Category Description
People & Culture Teams share learnings from simulated failures and evolve resilience patterns.
Process & Governance Failure tests evolve with system complexity.
Findings feed architecture, runbooks, and training.
Technology & Tools Real-time failure simulation is integrated into production-safe environments.
Measurement & Metrics Trends in failure impact, response, and recovery drive system design and team behaviour.

Key Measures

  • % of services with tested failure modes
  • Mean time to detect and recover from injected failures
  • % of incidents preceded by proactive test coverage
  • Number of teams actively running chaos tests
  • Frequency of resilience testing per system or platform
Associated Policies
  • Post-Incident Learning Culture
Associated Practices
  • Self-Healing Systems
  • Runbooks and Playbooks
  • Health Checks & Readiness Probes
  • Exploratory Testing
  • Test-Driven Development (TDD)
  • Behaviour-Driven Development (BDD)
  • Non-functional Requirement Testing
  • Shadow Testing in Production
  • Mutation Testing
  • End-to-End (E2E) Testing
  • Contract Testing
  • Visual Regression Testing
  • Integration Testing
  • Load & Performance Testing
  • Incident Response Playbooks

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering