• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Failure patterns are used to inform architectural investment

Purpose and Strategic Importance

This standard ensures that patterns from failures are systematically analysed and used to guide architectural decisions. It turns operational pain into long-term improvement, enabling teams to invest in resilience where it matters most.

Aligned to our "Post-Incident Learning Culture" policy, this standard promotes smarter design, reduces repeat failures, and supports continuous learning. Without it, teams risk fixing symptoms instead of causes—slowing progress and weakening system reliability.

Strategic Impact

  • Improved consistency and quality across teams
  • Reduced operational friction and delivery risks
  • Stronger ownership and autonomy in technical decision-making
  • More inclusive and sustainable engineering culture

Risks of Not Having This Standard

  • Slower time-to-value and increased rework
  • Accumulation of inconsistency and process debt
  • Reduced trust in engineering data, systems, or ownership
  • Loss of agility in the face of change or failure

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture - Failures resolved in isolation without shared learning.
- Root causes often undocumented or revisited repeatedly.
Process & Governance - No standard practice for identifying failure patterns.
- Architectural decisions rarely informed by incident data.
Technology & Tools - Incident management tools used for firefighting only.
- No analysis or aggregation of failure data.
Measurement & Metrics - Lack of visibility into recurring failure trends or architectural impacts.

Level 2 – Managed

Category Description
People & Culture - Some teams capture incident learnings, but sharing is inconsistent.
- Architectural changes driven by local needs.
Process & Governance - Informal feedback loops link incidents to architecture.
- Improvements are reactive and isolated.
Technology & Tools - Incident data occasionally reviewed but not systematically analysed.
- Tools support manual extraction of failure info.
Measurement & Metrics - Basic tracking of incident recurrence without architectural prioritisation.

Level 3 – Defined

Category Description
People & Culture - Teams systematically review failure patterns.
- Architectural teams engaged in incident retrospectives.
Process & Governance - Failure analyses standardised and documented.
- Learnings feed into architecture and design discussions.
Technology & Tools - Tools support aggregation and trending of failure data.
- Dashboards provide visibility into architectural risks.
Measurement & Metrics - Metrics on failure types, frequency, and related architectural changes.

Level 4 – Quantitatively Managed

Category Description
People & Culture - Organisation-wide awareness of failure patterns.
- Architecture investment prioritised by data.
Process & Governance - Failure trend analysis informs roadmap and resilience strategies.
- Continuous feedback between incidents and architecture.
Technology & Tools - Automated analysis identifies systemic issues.
- Tools correlate failures with architectural components.
Measurement & Metrics - Impact of architectural changes on failure rates tracked and reported.

Level 5 – Optimising

Category Description
People & Culture - Post-incident learning drives architecture evolution.
- Cross-team collaboration prevents failure recurrence.
Process & Governance - Failure patterns shape long-term resilience and platform strategies.
- Shared learning culture embedded organisation-wide.
Technology & Tools - Advanced predictive analytics guide architectural investment.
- Systems proactively alert on emerging failure patterns.
Measurement & Metrics - Demonstrated reduction in critical failures due to architectural improvements.
- Evidence of organisational learning loops impacting resilience.

Key Measures

  • Adoption rates and coverage across teams
  • Impact on delivery metrics, quality, or team health
  • Evidence of ownership, governance, or learning loops
Associated Policies
  • Post-Incident Learning Culture
Associated Practices
  • Root Cause Analysis (RCA)
  • Exploratory Testing
  • Test-Driven Development (TDD)
  • Behaviour-Driven Development (BDD)
  • Non-functional Requirement Testing
  • Mutation Testing
  • End-to-End (E2E) Testing
  • Contract Testing
  • Visual Regression Testing
  • Integration Testing
  • Blameless Postmortems

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering