This standard ensures services are restored quickly and safely following failure by measuring and improving Mean Time to Recover (MTTR) — a core DORA metric. It reflects how well teams detect, respond to, and learn from incidents.
Aligned to our "Resilience Over Uptime" and "Post-Incident Learning Culture" policies, this standard builds confidence in recovery, supports system design for failure, and reduces the impact of outages. Without it, teams risk prolonged incidents, fragile systems, and burnout from unplanned work.
| Category | Description |
|---|---|
| People & Culture | Recovery relies on a few individuals with system knowledge. Heroics and manual effort are common. |
| Process & Governance | Incident response is informal and undocumented. |
| Technology & Tools | No dedicated tooling for alerting, recovery, or MTTR tracking. |
| Measurement & Metrics | MTTR is not measured; no visibility into time-to-recovery. |
| Category | Description |
|---|---|
| People & Culture | Teams begin documenting recovery actions after incidents. |
| Process & Governance | Basic post-incident reviews are conducted. Processes vary across teams. |
| Technology & Tools | Alerting tools may be in place but are inconsistently used. |
| Measurement & Metrics | Recovery time is manually tracked for some high-impact incidents. |
| Category | Description |
|---|---|
| People & Culture | Teams follow defined roles and escalation paths during incidents. |
| Process & Governance | Standardised runbooks and playbooks are created and used. |
| Technology & Tools | On-call rotations, alert routing, and monitoring are in place. |
| Measurement & Metrics | MTTR is captured and reported across services. |
| Category | Description |
|---|---|
| People & Culture | Teams analyse MTTR trends to improve performance. On-call feedback is routinely gathered. |
| Process & Governance | Post-incident reviews are structured, include MTTR data, and are followed by tracked actions. |
| Technology & Tools | Automation supports alert enrichment and common recovery actions. |
| Measurement & Metrics | MTTR, detection time, and resolution steps are measured with dashboards and SLOs. |
| Category | Description |
|---|---|
| People & Culture | Teams rehearse failure modes through chaos engineering and game days. Learning is widely shared. |
| Process & Governance | Recovery process is part of continuous improvement and resilience planning. |
| Technology & Tools | Proactive failure detection and self-healing capabilities are implemented. |
| Measurement & Metrics | MTTR trends inform architectural investments. Performance improves year-over-year. |