Empresa

Mapping DORA Metrics Onto Governed Autonomous Reliability

How deployment frequency, lead time, change-failure rate, and MTTR actually move under a control layer where agents propose and humans authorize.

Book a demo

Equipo de Fiabilidad de Zof · Ingeniería y producto

5 de febrero de 2026 · 7 min de lectura · Actualizado 5 de febrero de 2026

Resumen

DORA's four keys were designed to measure teams of humans shipping code at human cadence. That assumption is gone. When roughly 41% of your codebase is AI-generated and a large share of those changes carry latent defects, the metrics still apply, but the levers that move them are no longer the ones in your runbook. This is a field guide for SREs to what deployment frequency, lead time, change-failure rate, and MTTR look like under a control layer, and which ones you should stop optimizing directly.

The original DORA insight holds: elite performers ship more often, faster, with fewer failures, and recover quicker, and the four metrics correlate rather than trade off.
It mattered because, historically, frequent deploys were *evidence* of small batches, good automation, and low coordination cost.
Lead time for changes, commit to production, is where governed autonomy delivers its cleanest win, and also where it is most often misunderstood.

Why the four keys behave differently now

The original DORA insight holds: elite performers ship more often, faster, with fewer failures, and recover quicker, and the four metrics correlate rather than trade off. What has changed is the input distribution. The throughput metrics, deployment frequency and lead time, are now trivially easy to inflate, because agents generate and open changes faster than any human team ever could. The stability metrics, change-failure rate and MTTR, are correspondingly harder to hold, because the same volume that pumps throughput also fattens the tail of risky changes.

Industry research suggests around 45% of AI coding tasks introduce a critical flaw or security issue. So if you let generation run and measure only throughput, you will post beautiful deployment-frequency numbers while your change-failure rate quietly climbs. The four keys stop being a balanced scorecard and become two metrics in tension. The job of a control layer is to restore the balance: keep throughput high without letting stability degrade. That only works if every change is validated and authorized before it lands, which is the core of the model where agents propose and humans authorize.

Deployment frequency: the metric that stops being the goal

Deployment frequency was always a proxy. It mattered because, historically, frequent deploys were *evidence* of small batches, good automation, and low coordination cost. Under governed autonomy, you can manufacture deploy frequency directly, point a fleet at your backlog and watch the number climb, which destroys it as a proxy for health.

Treat deployment frequency as a constraint to satisfy, not a target to maximize. The useful question shifts from "how often do we deploy" to "what fraction of changes reach production without a human in the critical path, and is that fraction safe." A System Graph that maps services, dependencies, and CI/CD into one change-aware model lets you answer that, because it knows the blast radius of each change. Low-blast-radius, fully-validated changes can auto-merge and inflate frequency honestly. High-blast-radius changes route to a human. The number goes up for the right reason.

The failure mode to watch: frequency that rises while change-failure rate rises with it. That is not velocity, it is the queue filling with under-validated diffs. If you see both climbing together, your validation gate is not change-aware, it is rubber-stamping volume.

Lead time: where the control layer earns its keep

Lead time for changes, commit to production, is where governed autonomy delivers its cleanest win, and also where it is most often misunderstood. The naive read is that adding governance adds latency: more gates, slower releases. The opposite is true when governance is evidence-driven rather than queue-driven.

Traditional lead time is dominated by waiting, not working. A change sits in a review queue because a human has to reconstruct, by hand, whether it is safe. The control layer collapses that wait by attaching the evidence to the change before it arrives at a gate. Testing Fleets plan and execute validation that is aware of what changed and what depends on it, so the approver reads a concrete artifact, which paths were exercised, what regressed, what is reachable, instead of guessing. Decision time drops because the decision is no longer a research project.

Decompose lead time and you will find the real target:

Generation time, now near zero, and not your bottleneck.
Validation time, should be automated, change-scoped, and parallel.
Approval wait, the dominant cost in most pipelines; collapses when the gate is evidence-driven.
Deploy/propagation time, pipeline mechanics, largely unchanged.

Optimize approval wait first. It is almost always the fat slice, and it is the one governance is built to compress.

Change-failure rate: the metric to defend, not chase

If deployment frequency is the metric to stop chasing, change-failure rate is the one to defend. It is the clearest read on whether your autonomy is actually governed or merely fast. And it is under direct attack from AI-generated volume: more changes, a higher per-change defect probability, and reviewers who, when overwhelmed, start approving on autopilot. Around 80% of developers admit to bypassing policy or guardrails when those guardrails slow them down, so a gate that becomes a bottleneck doesn't just slow you down; it gets routed around, and your change-failure rate reflects the bypass, not the policy.

Two mechanisms keep this metric flat while volume scales. The first is change-aware validation: testing that exercises the modified paths and their dependents, not a static suite that ignores the dependency graph and reports green. The second is reachability-based prioritization. Asking whether a flaw sits on a path that is actually reachable in your deployed system can mean 70 to 90% less exploitable exposure to triage. Applied here, an unreachable defect doesn't have to fail a change, while a reachable one blocks it before it ships. You concentrate your failure-prevention budget on changes that can actually fail in production.

The governance principle is load-bearing for this metric specifically: agents propose, humans authorize. Unsupervised autonomous merging of high-blast-radius changes is how change-failure rate spikes. The point of the control layer is not to remove the human from the dangerous decisions, it is to remove the human from the safe ones, so their attention lands where failure originates. Governance is where the policy, the approval tiers, and the audit trail live as first-class configuration rather than tribal knowledge.

MTTR: from heroics to a closed loop

MTTR is the metric most distorted by AI volume, because the thing that drove recovery time down historically, an engineer who understands the system in their head, does not scale to a codebase where 41% of the code was written by a model nobody fully read. When the author is an agent and the reviewer skimmed, the institutional memory that powered fast recovery is thin.

A control layer restores recovery time by making the system the source of truth instead of the engineer's memory. The closed loop, understand, test, reproduce, remediate, verify, maps almost one-to-one onto the phases of an incident:

Understand / reproduce, the System Graph localizes blast radius and the dependency path, so triage starts from a map, not a guess.
Remediate, Remediation Fleets propose a governed fix, with the change validated against the same graph before it is offered.
Verify, the fix is re-validated and the evidence recorded before it ships, so you are not trading one incident for the next.

This is governed autonomous fixing, not a robot rewriting production at 3 a.m. The fleet proposes; a human authorizes; the audit trail captures both. For changes that must execute inside a customer boundary or a regulated enclave, Edge Runners run as signed capsules and emit audit-ready evidence from inside the boundary, which matters when your MTTR story has to survive a post-incident review or a compliance audit.

What to do Monday morning

You don't need a re-platform to start measuring the right things.

Split deployment frequency by blast radius. Report auto-merged low-risk changes separately from human-authorized high-risk ones. A single blended number now hides more than it reveals.
Decompose lead time into the four slices above. Find your approval-wait fraction. That is your governance target.
Pair every change-failure-rate report with a reachability cut. Stop counting unreachable defects as the same risk as reachable ones.
Instrument MTTR by loop phase. Measure understand, remediate, and verify separately so you know whether your recovery cost is diagnosis or fixing.

The context underneath all four numbers is the same dependency model, which is why a live System Graph is the prerequisite, not a nice-to-have.

The bottom line

Preparación para la publicación SRE System Graph Flotas de pruebas Flotas de remediación

Guías relacionadas

Reliability ROI

Producto relacionado

Continuar leyendo

Empresa

Activity vs. Outcome: Why Your Reliability Metrics Are Measuring the Wrong Thing

Test counts and run volumes are activity theater. Here's why only outcome metrics, escaped defects and proven-safe releases, justify reliability investment.

Equipo de Fiabilidad de Zof17 jun 20267 min de lectura

Empresa

Reliability ROI for E-commerce: Measuring Confidence on Every Checkout Release

A case-study model for pricing avoided revenue loss on every checkout, payments, and inventory release, so product managers can defend reliability as ROI.

Equipo de Fiabilidad de Zof10 jun 20267 min de lectura

Empresa

Velocity Doesn't Kill Quality, Lack of Visibility Does

The speed-vs-quality tradeoff is a measurement failure, not a law of physics. Here's why full traceability across the reliability loop dissolves it.

Equipo de Fiabilidad de Zof9 jun 20267 min de lectura

Why the four keys behave differently now

Deployment frequency: the metric that stops being the goal

Lead time: where the control layer earns its keep

Change-failure rate: the metric to defend, not chase

MTTR: from heroics to a closed loop

What to do Monday morning

The bottom line

Continuar leyendo

Activity vs. Outcome: Why Your Reliability Metrics Are Measuring the Wrong Thing

Reliability ROI for E-commerce: Measuring Confidence on Every Checkout Release

Velocity Doesn't Kill Quality, Lack of Visibility Does

Una superficie para la postura, las operaciones y lo que necesita atención a continuación.