Standard : Services are restored quickly and safely following failure (MTTR)

Purpose and Strategic Importance

This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR) — a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.

Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.

Strategic Impact

Limits the impact of failures on customers and internal teams
Builds trust in the engineering organisation’s reliability and responsiveness
Encourages design for failure, observability, and automation
Reinforces confidence in continuous delivery and experimentation
Reduces on-call burden and incident fatigue

Risks of Not Having This Standard

Prolonged outages or data incidents that erode user trust
Escalation fatigue and inconsistent recovery actions
Delays in root cause analysis and missed learning opportunities
Reduced system resilience and confidence in platform capabilities
Over-reliance on heroics and manual triage

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Recovery relies on a few individuals with system knowledge. Heroics and manual effort are common.
Process & Governance	Incident response is informal and undocumented.
Technology & Tools	No dedicated tooling for alerting, recovery, or MTTR tracking.
Measurement & Metrics	MTTR is not measured; no visibility into time-to-recovery.

Level 2 – Managed

Category	Description
People & Culture	Teams begin documenting recovery actions after incidents.
Process & Governance	Basic post-incident reviews are conducted. Processes vary across teams.
Technology & Tools	Alerting tools may be in place but are inconsistently used.
Measurement & Metrics	Recovery time is manually tracked for some high-impact incidents.

Level 3 – Defined

Category	Description
People & Culture	Teams follow defined roles and escalation paths during incidents.
Process & Governance	Standardised runbooks and playbooks are created and used.
Technology & Tools	On-call rotations, alert routing, and monitoring are in place.
Measurement & Metrics	MTTR is captured and reported across services.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Teams analyse MTTR trends to improve performance. On-call feedback is routinely gathered.
Process & Governance	Post-incident reviews are structured, include MTTR data, and are followed by tracked actions.
Technology & Tools	Automation supports alert enrichment and common recovery actions.
Measurement & Metrics	MTTR, detection time, and resolution steps are measured with dashboards and SLOs.

Level 5 – Optimising

Category	Description
People & Culture	Teams rehearse failure modes through chaos engineering and game days. Learning is widely shared.
Process & Governance	Recovery process is part of continuous improvement and resilience planning.
Technology & Tools	Proactive failure detection and self-healing capabilities are implemented.
Measurement & Metrics	MTTR trends inform architectural investments. Performance improves year-over-year.

Key Measures

Mean Time to Recover (MTTR): Time from incident detection to full recovery
Detection to resolution latency: Granular timing for detection, escalation, response, and fix
Runbook usage rate: % of incidents resolved using predefined recovery steps
Recovery automation coverage: % of incidents with automated response or rollback
Post-incident review rate: % of incidents followed by structured review and learning
Stakeholder comms latency: Time between detection and notification to impacted parties