• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Changes are introduced with minimal failures and maximum resilience (CFR)

Purpose and Strategic Importance

This standard ensures changes are introduced with minimal failures and maximum resilience by measuring and managing Change Failure Rate (CFR)—a core DORA metric. It enables high-velocity delivery without compromising quality, stability, or trust.

Aligned to our "Resilience Over Uptime" and "Secure by Design" policies, this standard drives investment in robust testing, observability, and safe deployment practices. Without it, change introduces risk blindly, erodes confidence, and limits the ability to innovate at pace.

Strategic Impact

  • Increases delivery confidence and system stability
  • Encourages robust testing, observability, and rollback strategies
  • Enhances reliability across software, data, and infrastructure
  • Builds trust with stakeholders and users through quality and resilience
  • Supports high deployment frequency without accumulating operational risk

Risks of Not Having This Standard

  • Increased operational incidents due to fragile changes
  • Hidden bugs, regressions, or data issues impact users and downstream teams
  • Diminished trust in the reliability of engineering delivery
  • Platform teams become bottlenecks due to fear of risk
  • Poor CFR inhibits experimentation and continuous improvement

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Change failures are seen as individual mistakes.
There is a culture of blame or silence.
Process & Governance Failures are not systematically logged.
Post-incident learning is informal or skipped.
Technology & Tools No automated rollback or failure detection.
Monitoring is reactive and fragmented.
Measurement & Metrics CFR is not tracked.
No agreed definition of what counts as a failed change.

Level 2 – Managed

Category Description
People & Culture Teams acknowledge change-related incidents and start to share learnings.
Safe dialogue is emerging.
Process & Governance Some post-mortems occur after major failures.
Processes vary by team or severity.
Technology & Tools Rollbacks or patches are possible but often manual.
Basic monitoring flags regressions.
Measurement & Metrics Teams start logging incidents and linking to deployments.
Failure tracking is ad hoc.

Level 3 – Defined

Category Description
People & Culture CFR is a shared accountability.
Teams reflect on quality and resilience in retros.
Process & Governance A clear definition of failed change is established.
Quality reviews include CFR trends.
Technology & Tools Staging, feature flags, and rollback automation are applied to mitigate impact.
Measurement & Metrics CFR is tracked as a delivery metric.
Trends are reviewed and discussed regularly.

Level 4 – Quantitatively Managed

Category Description
People & Culture CFR metrics are used to guide investment in quality and reliability.
Teams are proactive.
Process & Governance CFR thresholds trigger deeper reviews.
Learning is shared across the org.
Technology & Tools Failures trigger automated remediation steps.
Deployments are validated pre- and post-release.
Measurement & Metrics CFR dashboards exist.
Root causes are categorised.
Remediation time is measured.

Level 5 – Optimising

Category Description
People & Culture Teams champion resilience culture.
CFR drives experiments and quality improvements.
Process & Governance CFR data informs policy and controls.
Resilience playbooks evolve through shared learning.
Technology & Tools Failure insights inform platform evolution.
Adaptive tooling supports safer changes.
Measurement & Metrics CFR trends influence architectural decisions.
Patterns drive cross-cutting initiatives.

Key Measures

  • Change Failure Rate (%): % of production changes resulting in a P1/P2, severe degradation, or security breach
  • Number of rollbacks or patches per change
  • Post-release incident rate attributable to recent changes
  • Number of deployments with follow-up remediation activity
  • Time from change deployment to detection of failure
  • Incident root causes linked to recent changes

A Failed Change

A failed change is one that introduces:

  • A P1 or P2 incident requiring immediate or urgent response
  • A material degradation in user experience (e.g. broken functionality, slowness, or data issues)
  • A security vulnerability or misconfiguration that increases system risk

This applies across software deployments, infrastructure changes, data platform updates, and operational config changes.

Associated Policies
  • Resilience Over Uptime
  • Secure by Design
  • Post-Incident Learning Culture
Associated Practices
  • Self-Healing Systems
  • Chaos Engineering
  • Design for Failure
  • Auto-scaling Infrastructure
  • Static Code Analysis
  • Blue-Green Deployments
  • Canary Releases

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering