DORA and Delivery Metrics | Measurement Operating Model

Why DORA Matters

Most engineering metrics are either vanity metrics (lines of code, story points, velocity) or so lagging that by the time they signal a problem, the problem has been entrenched for months. DORA's four key metrics are different: they are validated against actual business outcomes, widely benchmarked, and leading indicators of organisational capability rather than just output.

The research behind DORA (DevOps Research and Assessment) has been running since 2014, with findings published annually in the State of DevOps report. The core finding - replicated across thousands of organisations - is that high-performing software delivery teams outperform low-performing teams on business outcomes including revenue growth, market share, and customer satisfaction. And those delivery performance differences are reliably captured by four metrics.

This does not mean DORA metrics are the only thing that matters, or that optimising them directly is the right strategy. It means they are a useful signal of whether your engineering system is healthy - and that improving the underlying practices that drive these metrics will improve your organisation.

The Four Metrics in Depth

Deployment Frequency

What it measures: how often does your organisation deploy to production (or release to end users)?

Why it matters: deployment frequency is a proxy for the size of batches and the amount of risk in each deployment. Organisations that deploy frequently are deploying small changes. Small changes are easier to understand, easier to test, easier to roll back, and faster to deliver value. Organisations that deploy infrequently are accumulating large batches of change, which increases risk and delay.

Elite performance benchmark: multiple times per day, on demand. High performance: between once per day and once per week.

What inhibits deployment frequency: manual deployment processes, long test cycles, environment availability constraints, code freeze policies, fear of change, and organisational approval processes that create queues.

How to improve it: invest in deployment automation, reduce manual gates in the deployment pipeline, work toward trunk-based development (small frequent merges rather than long-lived branches), and build confidence through testing so that deploying feels safe rather than risky.

How teams game it: counting non-production deployments, splitting single features into multiple trivial deployments, or measuring deployments per service rather than per system. If your deployment frequency number doubles overnight without any process change, investigate the definition.

Lead Time for Changes

What it measures: the time from a code commit to that code running in production.

Why it matters: lead time captures the throughput of your delivery pipeline. Long lead times mean slow feedback loops - you cannot learn whether something works until it is in production, and if your lead time is two weeks, you are flying blind for two weeks per change. Long lead times also reflect the degree of friction in your development and deployment process.

Elite performance benchmark: less than one hour. High performance: between one day and one week.

What inflates lead time: long code review queues, slow build and test pipelines, manual quality gates, environment provisioning delays, and change approval processes that require human sign-off.

How to improve it: measure your pipeline stage by stage to find the bottleneck. Is the delay in code review? Build time? Testing? Deployment approval? Fix the bottleneck. Then measure again to find the next one. Lead time improvement is a sustained engineering investment, not a one-time fix.

How teams game it: measuring lead time only from approval to deployment, excluding the time code spends waiting for review. The correct measurement starts at commit.

Change Failure Rate

What it measures: what percentage of deployments to production cause a degradation in service that requires a hotfix, rollback, or patch?

Why it matters: change failure rate captures the quality of your delivery process. A high change failure rate means you are frequently introducing problems - which consumes engineering time in remediation, damages user trust, and increases the risk of each deployment, which in turn reduces deployment frequency.

Elite performance benchmark: 0-15%. High performance: 16-30%.

What drives high change failure rates: insufficient automated testing, manual and error-prone deployment processes, inadequate pre-production environments, insufficient monitoring to detect problems quickly, and poor knowledge transfer between team members.

How to improve it: increase automated test coverage (particularly integration and end-to-end tests for critical paths), improve pre-production environment fidelity, invest in feature flags and canary deployments so that changes can be rolled out gradually rather than all at once, and improve monitoring so that failures are detected quickly.

How teams game it: reclassifying incidents as not deployment-related, not tracking rollbacks, or narrowing the definition of "failure" to exclude minor degradations. This metric requires honest classification.

Mean Time to Recover

What it measures: how long does it take to restore service after a production incident?

Why it matters: MTTR is a measure of your organisation's ability to respond to failure. Failures are inevitable in complex systems. What differentiates high-performing organisations is not the absence of failure - it is the ability to detect and recover from failure quickly, minimising the impact on users.

Elite performance benchmark: less than one hour. High performance: less than one day.

What drives high MTTR: slow incident detection (insufficient monitoring or alerting), complex runbooks that require specialist knowledge to execute, inability to roll back changes quickly, poor incident communication and coordination processes, and systems that are hard to diagnose.

How to improve it: invest in observability (the ability to understand what your system is doing from its outputs), practice incident response regularly, build rollback capability into every deployment, document runbooks clearly and keep them current, and run postmortems that identify systemic improvements rather than individual blame.

How teams game it: closing incidents before service is fully restored, splitting incidents into multiple records to reduce apparent duration, or not measuring incidents that do not reach a certain severity threshold.

The Fifth Metric: Reliability

The 2021 State of DevOps report added reliability as a fifth metric. Reliability is measured as the percentage of time the system is meeting user needs - capturing both availability (is the system up?) and correctness (is it doing the right thing?).

Reliability is harder to instrument than the four core metrics but increasingly important. A system can have high deployment frequency and low MTTR but still be unreliable if it fails frequently. Tracking reliability alongside the four DORA metrics gives a more complete picture.

DORA Performance Bands

The DORA research classifies organisations into four performance bands: elite, high, medium, and low. The boundaries shift slightly each year as industry performance improves. The value of the classification is not the specific label - it is the benchmark.

If you are in the low or medium band, you have clear evidence that improvement is possible, because high and elite organisations exist and are not fundamentally different from you. The practices that drive high performance are known and learnable.

If you are already in the high band, elite performance is your target - but the gains become harder and the investment required more significant.

DORA as a Conversation Tool, Not a Performance Management Tool

The most important thing to understand about DORA metrics is how to use them. They are a starting point for conversation, not a final answer.

If your deployment frequency is low, the useful question is: why? What is preventing more frequent deployments? Is it a technical constraint, a process constraint, or a cultural constraint? What would it take to address the root cause?

Used this way, DORA metrics create productive conversations about engineering practices and organisational capability. They point toward systemic issues rather than individual shortcomings.

Used badly - as targets in performance reviews, as the basis for team rankings, or as evidence in blame attribution - DORA metrics will be gamed, resisted, and ultimately useless. Engineers are smart. If they know they are being measured on deployment frequency, they will find a way to increase the number without improving the underlying practices. This produces better-looking metrics and worse-functioning systems.

The right use of DORA is at the organisational level: are we, as an engineering organisation, improving over time? Are our practices maturing? Are we closing the gap to high performance? These are questions for engineering leadership, not for individual teams.

DORA and Team Health

One of the most robust findings in the DORA research is the relationship between delivery performance and team health. High-performing teams report higher job satisfaction, lower burnout rates, and more positive work environments than low-performing teams.

This is not accidental. The practices that drive DORA performance - automated testing, deployment automation, trunk-based development, clear work prioritisation, psychological safety - also reduce the toil and frustration that drive burnout. Improving delivery performance and improving team health are the same problem, not competing priorities.

This is a useful argument when making the investment case for improving DORA metrics. It is not just about delivery speed. It is about whether engineers find their work sustainable and rewarding.

Next → OKRs for Engineering