• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Services are restored quickly and safely following failure (MTTR)

Purpose and Strategic Importance

This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR) — a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.

Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.

Strategic Impact

  • Limits the impact of failures on customers and internal teams
  • Builds trust in the engineering organisation’s reliability and responsiveness
  • Encourages design for failure, observability, and automation
  • Reinforces confidence in continuous delivery and experimentation
  • Reduces on-call burden and incident fatigue

Risks of Not Having This Standard

  • Prolonged outages or data incidents that erode user trust
  • Escalation fatigue and inconsistent recovery actions
  • Delays in root cause analysis and missed learning opportunities
  • Reduced system resilience and confidence in platform capabilities
  • Over-reliance on heroics and manual triage

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Recovery relies on a few individuals with system knowledge.
Heroics and manual effort are common.
Process & Governance Incident response is informal and undocumented.
Technology & Tools No dedicated tooling for alerting, recovery, or MTTR tracking.
Measurement & Metrics MTTR is not measured; no visibility into time-to-recovery.

Level 2 – Managed

Category Description
People & Culture Teams begin documenting recovery actions after incidents.
Process & Governance Basic post-incident reviews are conducted.
Processes vary across teams.
Technology & Tools Alerting tools may be in place but are inconsistently used.
Measurement & Metrics Recovery time is manually tracked for some high-impact incidents.

Level 3 – Defined

Category Description
People & Culture Teams follow defined roles and escalation paths during incidents.
Process & Governance Standardised runbooks and playbooks are created and used.
Technology & Tools On-call rotations, alert routing, and monitoring are in place.
Measurement & Metrics MTTR is captured and reported across services.

Level 4 – Quantitatively Managed

Category Description
People & Culture Teams analyse MTTR trends to improve performance.
On-call feedback is routinely gathered.
Process & Governance Post-incident reviews are structured, include MTTR data, and are followed by tracked actions.
Technology & Tools Automation supports alert enrichment and common recovery actions.
Measurement & Metrics MTTR, detection time, and resolution steps are measured with dashboards and SLOs.

Level 5 – Optimising

Category Description
People & Culture Teams rehearse failure modes through chaos engineering and game days.
Learning is widely shared.
Process & Governance Recovery process is part of continuous improvement and resilience planning.
Technology & Tools Proactive failure detection and self-healing capabilities are implemented.
Measurement & Metrics MTTR trends inform architectural investments.
Performance improves year-over-year.

Key Measures

  • Mean Time to Recover (MTTR): Time from incident detection to full recovery
  • Detection to resolution latency: Granular timing for detection, escalation, response, and fix
  • Runbook usage rate: % of incidents resolved using predefined recovery steps
  • Recovery automation coverage: % of incidents with automated response or rollback
  • Post-incident review rate: % of incidents followed by structured review and learning
  • Stakeholder comms latency: Time between detection and notification to impacted parties
Associated Policies
  • Resilience Over Uptime
  • Post-Incident Learning Culture
Associated Practices
  • Chaos Engineering
  • Design for Failure
  • Auto-scaling Infrastructure
  • Blue-Green Deployments
  • Canary Releases

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering