Standard : Monitoring is embedded in design and operations

Purpose and Strategic Importance

This standard ensures that monitoring is a first-class capability built into system design, development, and operations. By instrumenting services, infrastructure, and user workflows with real-time metrics, health checks, and user-experience indicators, teams gain the visibility needed to detect anomalies early, troubleshoot effectively, and maintain high levels of service reliability.

Strategic Impact

Early detection and proactive response to anomalies
Improved operational excellence through data-driven decisions
Enhanced ability to prioritize features and architectural improvements
Assurance of SLA compliance and regulatory governance

Risks of Not Having This Standard

Blind spots in production leading to user impact
Inefficient incident diagnosis and longer outages
Decreased customer satisfaction due to unnoticed errors
Growth of fragmented and costly monitoring solutions

CMMI Maturity Model

Level 1 – Initial

Category	Description
People & Culture	Monitoring is informal or manual, with little standard practice.
Process & Governance	Monitoring efforts are ad hoc, reactive, and inconsistent.
Technology & Tools	Reliance on logs and manual checks without automation.
Measurement & Metrics	No consistent measurement of detection or alert effectiveness.

Level 2 – Managed

Category	Description
People & Culture	Teams begin to recognise importance of monitoring and define basic metrics.
Process & Governance	Central collection of key service and infrastructure metrics established.
Technology & Tools	Basic alerting in place but varies across teams and systems.
Measurement & Metrics	Some tracking of alert volumes and incident detection times.

Level 3 – Defined

Category	Description
People & Culture	Monitoring is embedded in team practices, with clear ownership.
Process & Governance	Standardised metric schemas and dashboards mandated across teams.
Technology & Tools	SLIs and SLOs defined, tracked, and reported regularly.
Measurement & Metrics	Metrics quality and coverage are monitored for completeness.

Level 4 – Quantitatively Managed

Category	Description
People & Culture	Teams use monitoring data proactively to improve system health.
Process & Governance	Monitoring quality metrics (accuracy, latency) are measured and optimised.
Technology & Tools	Anomaly detection, dynamic thresholds, and alert tuning implemented.
Measurement & Metrics	Quantitative tracking of detection time and alert precision.

Level 5 – Optimising

Category	Description
People & Culture	Predictive analytics and automated remediation are cultural norms.
Process & Governance	Continuous monitoring improvement processes reduce noise and improve relevance.
Technology & Tools	Intelligent systems anticipate issues and guide preventive action.
Measurement & Metrics	Business impact quantified from monitoring-driven improvements.

Key Measures

Monitoring coverage (% of systems with standardized metrics)
Mean Time to Detect (MTTD) issues via monitoring
Alert precision (ratio of true positive alerts to total alerts)
SLO compliance rate (percentage of time services meet defined objectives)
Monitoring data latency (time from event to metric availability)