• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Systems recover quickly and fail safely

Purpose and Strategic Importance

This standard ensures systems are designed to recover quickly and fail safely, reducing the blast radius of incidents and supporting sustainable, high-velocity delivery. It embeds resilience into the architecture, not just the process.

Aligned to our "Resilience Over Uptime" and "Balance Sustainability with Speed" policies, this standard protects user experience and team wellbeing during failure scenarios. Without it, systems become brittle, outages last longer, and recovery depends on manual intervention.

Strategic Impact

  • Improves delivery flow and service continuity
  • Reduces incident recovery times and blast radius
  • Encourages automation and design for failure
  • Reinforces psychological safety by reducing stress during incidents
  • Enhances trust in systems and delivery teams

Risks of Not Having This Standard

  • High-impact outages due to brittle failure modes
  • Increased on-call fatigue and manual firefighting
  • Slow recovery and unplanned downtime
  • Low trust in the reliability of services
  • Systems evolve without resilience considerations

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Recovery depends on individual effort and tacit knowledge.
Stressful incident response is common.
Process & Governance No defined recovery patterns or safe failure strategies.
Technology & Tools Systems lack basic automation for detection or recovery.
Measurement & Metrics MTTR and failure containment are not measured.

Level 2 – Managed

Category Description
People & Culture Teams acknowledge need for safer failure handling.
Some learnings from incidents are shared.
Process & Governance Manual rollback steps or monitoring alerts exist but vary.
Technology & Tools Monitoring or alerting covers critical failure scenarios.
Measurement & Metrics Recovery time is manually reviewed post-incident.

Level 3 – Defined

Category Description
People & Culture Teams adopt shared patterns for resilience and failure recovery.
Process & Governance Recovery processes (e.g. failover, auto-restart) are standardised.
Technology & Tools Automated rollback, self-healing scripts, and retry logic are in place.
Measurement & Metrics MTTR, recovery success rate, and time-to-detect are tracked.

Level 4 – Quantitatively Managed

Category Description
People & Culture Incident retros include resilience gaps and improvement plans.
Process & Governance Recovery playbooks are rehearsed and tied to service-level expectations.
Technology & Tools System recovery is validated through continuous tests or chaos engineering.
Measurement & Metrics MTTR and recovery trends drive investment in architectural resilience.

Level 5 – Optimising

Category Description
People & Culture Teams innovate proactively on safe-to-fail design patterns.
Process & Governance Continuous learning drives evolution of recovery capabilities.
Technology & Tools Systems degrade gracefully and recover autonomously.
Measurement & Metrics Resilience metrics improve year-over-year across systems.

Key Measures

  • Mean Time to Recover (MTTR)
  • Failure containment rate: % of failures contained without user impact
  • Automated recovery coverage: % of systems with automated rollback or self-healing
  • Graceful degradation score: Ability to provide partial functionality during failures
  • Incident recovery rehearsal frequency: Chaos testing, game days, or recovery simulations
Associated Policies
  • Resilience Over Uptime
  • Psychological Safety First
Associated Practices
  • Runbooks and Playbooks
  • Log Correlation for RCA
  • Chaos Engineering
  • On-Call Rotation Health Checks
  • Health Checks & Readiness Probes
  • Container Security Scanning
  • Vulnerability Management Dashboards
  • Threat Modelling Workshops
  • Data Encryption-in-Transit & at-Rest
  • Threat Intelligence Feeds
  • Secure API Gateways
  • Shadow Testing in Production
  • Load & Performance Testing
  • Operational KPIs for Dev Teams
  • Service Mesh Implementation
  • Twelve-Factor App
  • Design for Failure
  • Observability-Driven Design
  • Immutable Infrastructure
  • Auto-scaling Infrastructure
  • Mocking and Stubbing
  • Evolutionary Architecture
  • Serverless Architecture
  • Event Sourcing
  • Security as Code
  • Deployment Freeze Windows
  • Blue-Green Deployments
  • Canary Releases
  • Feedback Loops from Ops to Dev
  • Real-time Event Streaming

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering