Standard : Changes are introduced with minimal failures and maximum resilience (CFR)

Purpose and Strategic Importance

This standard ensures changes are introduced with minimal failures and maximum resilience by measuring and managing Change Failure Rate (CFR)—a core DORA metric. It enables high-velocity delivery without compromising quality, stability, or trust.

Aligned to our "Resilience Over Uptime" and "Secure by Design" policies, this standard drives investment in robust testing, observability, and safe deployment practices. Without it, change introduces risk blindly, erodes confidence, and limits the ability to innovate at pace.

Strategic Impact

Increases delivery confidence and system stability
Encourages robust testing, observability, and rollback strategies
Enhances reliability across software, data, and infrastructure
Builds trust with stakeholders and users through quality and resilience
Supports high deployment frequency without accumulating operational risk

Risks of Not Having This Standard

Increased operational incidents due to fragile changes
Hidden bugs, regressions, or data issues impact users and downstream teams
Diminished trust in the reliability of engineering delivery
Platform teams become bottlenecks due to fear of risk
Poor CFR inhibits experimentation and continuous improvement

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Change failures are seen as individual mistakes. There is a culture of blame or silence.
Process & Governance	Failures are not systematically logged. Post-incident learning is informal or skipped.
Technology & Tools	No automated rollback or failure detection. Monitoring is reactive and fragmented.
Measurement & Metrics	CFR is not tracked. No agreed definition of what counts as a failed change.

Level 2 – Managed

Category	Description
People & Culture	Teams acknowledge change-related incidents and start to share learnings. Safe dialogue is emerging.
Process & Governance	Some post-mortems occur after major failures. Processes vary by team or severity.
Technology & Tools	Rollbacks or patches are possible but often manual. Basic monitoring flags regressions.
Measurement & Metrics	Teams start logging incidents and linking to deployments. Failure tracking is ad hoc.

Level 3 – Defined

Category	Description
People & Culture	CFR is a shared accountability. Teams reflect on quality and resilience in retros.
Process & Governance	A clear definition of failed change is established. Quality reviews include CFR trends.
Technology & Tools	Staging, feature flags, and rollback automation are applied to mitigate impact.
Measurement & Metrics	CFR is tracked as a delivery metric. Trends are reviewed and discussed regularly.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	CFR metrics are used to guide investment in quality and reliability. Teams are proactive.
Process & Governance	CFR thresholds trigger deeper reviews. Learning is shared across the org.
Technology & Tools	Failures trigger automated remediation steps. Deployments are validated pre- and post-release.
Measurement & Metrics	CFR dashboards exist. Root causes are categorised. Remediation time is measured.

Level 5 – Optimising

Category	Description
People & Culture	Teams champion resilience culture. CFR drives experiments and quality improvements.
Process & Governance	CFR data informs policy and controls. Resilience playbooks evolve through shared learning.
Technology & Tools	Failure insights inform platform evolution. Adaptive tooling supports safer changes.
Measurement & Metrics	CFR trends influence architectural decisions. Patterns drive cross-cutting initiatives.

Key Measures

Change Failure Rate (%): % of production changes resulting in a P1/P2, severe degradation, or security breach
Number of rollbacks or patches per change
Post-release incident rate attributable to recent changes
Number of deployments with follow-up remediation activity
Time from change deployment to detection of failure
Incident root causes linked to recent changes

A Failed Change

A failed change is one that introduces:

A P1 or P2 incident requiring immediate or urgent response
A material degradation in user experience (e.g. broken functionality, slowness, or data issues)
A security vulnerability or misconfiguration that increases system risk

This applies across software deployments, infrastructure changes, data platform updates, and operational config changes.