• Home
  • BVSSH
  • C4E
  • Playbooks
  • Frameworks
  • Good Reads
Search

What are you looking for?

Standard : Monitoring is embedded in design and operations

Purpose and Strategic Importance

This standard ensures that monitoring is a first-class capability built into system design, development, and operations. By instrumenting services, infrastructure, and user workflows with real-time metrics, health checks, and user-experience indicators, teams gain the visibility needed to detect anomalies early, troubleshoot effectively, and maintain high levels of service reliability.

Strategic Impact

  • Early detection and proactive response to anomalies
  • Improved operational excellence through data-driven decisions
  • Enhanced ability to prioritize features and architectural improvements
  • Assurance of SLA compliance and regulatory governance

Risks of Not Having This Standard

  • Blind spots in production leading to user impact
  • Inefficient incident diagnosis and longer outages
  • Decreased customer satisfaction due to unnoticed errors
  • Growth of fragmented and costly monitoring solutions

CMMI Maturity Model

Level 1 – Initial

Category Description
People & Culture Monitoring is informal or manual, with little standard practice.
Process & Governance Monitoring efforts are ad hoc, reactive, and inconsistent.
Technology & Tools Reliance on logs and manual checks without automation.
Measurement & Metrics No consistent measurement of detection or alert effectiveness.

Level 2 – Managed

Category Description
People & Culture Teams begin to recognise importance of monitoring and define basic metrics.
Process & Governance Central collection of key service and infrastructure metrics established.
Technology & Tools Basic alerting in place but varies across teams and systems.
Measurement & Metrics Some tracking of alert volumes and incident detection times.

Level 3 – Defined

Category Description
People & Culture Monitoring is embedded in team practices, with clear ownership.
Process & Governance Standardised metric schemas and dashboards mandated across teams.
Technology & Tools SLIs and SLOs defined, tracked, and reported regularly.
Measurement & Metrics Metrics quality and coverage are monitored for completeness.

Level 4 – Quantitatively Managed

Category Description
People & Culture Teams use monitoring data proactively to improve system health.
Process & Governance Monitoring quality metrics (accuracy, latency) are measured and optimised.
Technology & Tools Anomaly detection, dynamic thresholds, and alert tuning implemented.
Measurement & Metrics Quantitative tracking of detection time and alert precision.

Level 5 – Optimising

Category Description
People & Culture Predictive analytics and automated remediation are cultural norms.
Process & Governance Continuous monitoring improvement processes reduce noise and improve relevance.
Technology & Tools Intelligent systems anticipate issues and guide preventive action.
Measurement & Metrics Business impact quantified from monitoring-driven improvements.

Key Measures

  • Monitoring coverage (% of systems with standardized metrics)
  • Mean Time to Detect (MTTD) issues via monitoring
  • Alert precision (ratio of true positive alerts to total alerts)
  • SLO compliance rate (percentage of time services meet defined objectives)
  • Monitoring data latency (time from event to metric availability)
Associated Policies
Associated Practices
  • Root Cause Analysis (RCA)
  • Self-Healing Systems
  • Vulnerability Management
  • Incident Response Playbooks

Technical debt is like junk food - easy now, painful later.

Awesome Blogs
  • LinkedIn Engineering
  • Github Engineering
  • Uber Engineering
  • Code as Craft
  • Medium.engineering