Data Quality and Metric Integrity | Measurement Operating Model

The Data Quality Problem Nobody Talks About

Engineering organisations invest significantly in building metrics dashboards and reporting systems. Most invest much less in understanding whether those systems produce numbers that can be trusted. The result is confident-looking dashboards built on shaky data - metrics that are technically present but misleading, contested, or simply wrong.

Data quality problems in engineering metrics are pervasive for predictable reasons. Engineering teams work across many tools that do not naturally integrate. Data is captured for operational purposes and then repurposed for measurement. Processes that are nominally standard are applied inconsistently in practice. And the people closest to the data - the engineers themselves - are rarely asked to validate whether the numbers reflect reality.

This matters because decisions made on bad data are worse than decisions made without data. If you invest in a reliability improvement programme based on MTTR data that systematically undercounts incidents, you are solving the wrong problem. If you use lead time data to evaluate team performance but your teams capture the "started" date inconsistently, you are comparing incomparable things.

Data quality is not glamorous. It does not generate enthusiasm. But it is the foundation on which every metric-driven conversation depends, and building it deliberately - rather than discovering its absence after a consequential decision goes wrong - is worth the investment.

Common Data Quality Problems in Engineering Metrics

Inconsistent Definitions

The most pervasive data quality problem is measuring the same thing differently in different places without realising it.

Lead time for changes: does it start when a developer commits code, when a branch is created, when a pull request is opened, or when a ticket moves to "in progress"? Different teams in the same organisation frequently answer this question differently. Aggregated lead time data that combines these different definitions is meaningless.

Incidents: what counts as an incident? Is a brief service degradation that was auto-resolved an incident? Is a customer-reported issue that turned out to be a user error an incident? Different teams have different instincts, and if the definition is not written down and enforced, different answers will appear in the data.

Deployment: does a deployment to a staging environment count? A partial canary deployment? A configuration change deployed via a separate mechanism from application code? Without a precise definition, different teams will count differently.

The fix is a metrics glossary - a written document that defines every metric that appears in any dashboard or report, specifying the calculation method, the data source, the time period convention, and the edge cases that have been resolved. This document should be version-controlled and treated as a living standard, updated when definitions change.

Manual Data Entry

Any metric that relies on human data entry is subject to inconsistency, omission, and bias. Engineers manually updating ticket statuses, estimating time spent on categories of work, or self-reporting incident timelines are introducing noise that compounds over time.

Manual data entry also invites gaming. If engineers know that their time-to-production is being tracked via manual status updates, they will move tickets to "done" at the most flattering point in the process - not necessarily when the work is actually complete.

Where possible, capture data automatically from system events: code commits, pipeline executions, ticket state transitions triggered by code merges, incident creation and resolution timestamps from your incident management tool. Automated capture is more consistent and harder to game.

Where manual input is unavoidable, reduce the number of options and make the correct answer the path of least resistance. A ticket status workflow with twelve states will be used inconsistently. A workflow with four states, where transitions are triggered automatically where possible, will be more reliable.

Tool Sprawl

Engineering organisations that have accumulated many tools over time often find that the same process is tracked in different tools by different teams. One team uses Jira for work tracking, another uses Linear, a third uses GitHub Issues. Aggregating cycle time or throughput across these three tools requires either expensive integration work or imprecise approximation.

Tool sprawl is a data quality problem as well as a cost problem. The solution is not necessarily standardisation on a single tool - sometimes the right tool varies by use case. But it does require either integration (connecting the tools so data can be aggregated) or clear boundaries (some metrics are only meaningful within a team, not across teams).

The data governance implication of tool sprawl: before publishing an organisation-wide metric that aggregates data from multiple tools, audit the data quality in each source. You may find that the aggregated number hides significant variation in data completeness and accuracy between sources.

Sampling Bias

Sampling bias occurs when your data systematically excludes certain cases, making your metrics look better than reality.

In incident management: if the oncall rotation is inconsistently executed and some incidents are resolved without being formally logged, your MTTR data only reflects the incidents that made it into the system - which may be systematically different (longer, more severe, more visible) from the ones that did not.

In deployment frequency: if some teams deploy via an automated pipeline that logs to your metrics system, and others deploy via a manual process that does not, your deployment frequency data only captures the automated deployments. The resulting figure is higher than the true organisational average.

In customer feedback: if you measure satisfaction only among users who choose to complete a feedback survey, you are measuring the opinion of a self-selected group (typically those who feel strongly positive or strongly negative) rather than the median user experience.

Identify sampling bias by asking: what would need to be true for this data to be systematically missing a category of events? Then check whether that category is present in the data.

Instrumentation Standards

Building data quality in from the start requires instrumentation standards - agreed practices for how events are captured, what metadata is attached, and what quality gates apply before data enters the measurement system.

For deployment metrics: every deployment should emit a structured event with a consistent schema including: timestamp, environment, service name, team, duration, outcome (success or failure), and a correlation identifier that links to the change being deployed.

For incident metrics: every incident should have a creation timestamp, a severity classification using an agreed taxonomy, a resolution timestamp, and a classification of the primary cause category. These should be captured in the incident management tool, not in a separate spreadsheet.

For work tracking: a consistent definition of what "started" and "done" mean, with automated state transitions where possible, and clear rules for handling edge cases (blocked items, items split mid-flight, items that are reopened after completion).

Instrumentation standards should be documented and enforced. Enforcement mechanisms can be lightweight: automated checks that flag events missing required fields, periodic audits of data completeness, and a named data steward who owns the standard for each metric category.

Data Lineage for Engineering Metrics

Data lineage is the documentation of where a metric comes from, what transformations were applied, and what assumptions were made in the calculation. It is the answer to "how was this number calculated?" that your finance partner or senior leader will inevitably ask when the number is surprising.

A data lineage record for a metric includes: the source system, the query or extraction method, the transformation logic (any calculations, filters, or aggregations applied), the refresh schedule, the date from which historical data is available, and the known limitations or edge cases.

Maintaining data lineage serves two purposes. First, it enables trust - when someone questions a number, you can show your working. Second, it enables debugging - when a metric produces a surprising result, you can trace back through the lineage to identify where the anomaly originates.

The minimum viable data lineage is a spreadsheet or wiki page with one row per metric, capturing the key information. More mature organisations use data cataloguing tools that maintain lineage automatically from the data pipeline.

Catching and Communicating Metric Anomalies

Metrics occasionally produce values that are wrong - not because of a genuine change in the underlying reality, but because of a data pipeline failure, a tool migration that changed the counting logic, or a source system bug.

Building anomaly detection into your metrics system prevents bad data from reaching decision-makers unchecked. Simple statistical approaches - flagging values that fall outside N standard deviations from the recent trend - catch most instrument failures. More sophisticated approaches use forecasting models that generate expected ranges and alert when actuals fall outside them.

When an anomaly is detected - whether automatically or by a human reviewer - the communication protocol matters. State clearly that the anomaly has been identified, that it is under investigation, and that the metric should be treated with caution until resolved. Do not remove the metric from the dashboard silently, as this creates confusion when stakeholders notice its absence. Do not present the anomalous value as if it were real, as this undermines trust when the error is later discovered.

Communicating "we found a problem with our data" is uncomfortable. It requires admitting imperfection in your measurement system. But the alternative - continuing to report bad data until someone catches it - is far more damaging to credibility.

← Previous Engineering Performance Reporting Next → Continuous Improvement Loops