Skip to content
Producto

The Remediation Metrics That Matter: Mean-Time-to-Governed-Fix, Revert Rate, and Recurrence

MTTR rewards fast diffs, not safer systems. Govern autonomous remediation on mean-time-to-governed-fix, revert rate, recurrence, and reachable-risk instead.

Equipo de Fiabilidad de Zof · Ingeniería y producto

7 de enero de 2025 · 7 min de lectura · Actualizado 7 de enero de 2025

Share
01

Why MTTR went vanity the moment fixing got automated

MTTR was a defensible proxy when humans wrote the fix. A human applying a patch was an implicit quality gate. They understood the blast radius, they hesitated before touching a payment path, and the time the clock measured included the time they spent thinking. The number correlated with care.

Automated remediation breaks that correlation. An agent can close a ticket in ninety seconds, and the speed tells you nothing about whether the change was correct, whether it touched something it shouldn't have, or whether the same defect recurs next week under a different stack trace. You can drive MTTR to near zero by shipping fast, wrong fixes. The metric will applaud.

This is not a hypothetical edge case at current code volumes. Roughly 41% of codebases are now AI-generated, and around 45% of AI coding tasks introduce critical flaws or security issues. The defects are arriving faster than humans can review them, which is exactly why teams reach for autonomous fixing. But if the metric governing that fixing rewards diff throughput, you have built a machine that generates risk faster and calls it resolution. The cost of poor software quality already sits near $2.41 trillion. Optimizing for speed-to-close without governing for correctness is how you contribute to that number while reporting a green dashboard.

The fix is not to slow down. It is to measure the things that actually distinguish a governed fix from a fast one.

02

1. Mean-Time-to-Governed-Fix (MTTGF)

Replace MTTR with the time from defect detection to a fix that is validated, policy-checked, authorized, and verified in production. The stopwatch does not stop when a diff merges. It stops when the change has cleared the control loop: validated against its real dependencies, checked against policy, authorized by a named human where policy requires it, and confirmed to have actually resolved the underlying defect.

That is a harder number to move, and that is the point. MTTGF refuses to give you credit for a fix that hasn't been proven. It includes the validation and authorization time that vanity MTTR conveniently omits.

A skeptical reader will object that this just makes the number look worse. Correct, at first. But it makes the number *honest*, and an honest number that trends down over a quarter is a defensible ROI story. A vanity number that's already near zero has nowhere to go and proves nothing. The closed loop is what makes MTTGF measurable at all: Understand the change's real scope, Test it, Reproduce the failure deterministically, Remediate, Verify. Each stage stamps the evidence the metric depends on.

03

2. Revert rate (the fix that didn't hold)

Revert rate is the percentage of shipped fixes that get rolled back, hotfixed, or superseded within a defined window. It is the single most honest signal that your remediation is churning diffs rather than reducing risk, because a revert is the system telling you the fix was wrong, incomplete, or had side effects nobody caught.

Vanity MTTR hides reverts. A fix that ships in two minutes and gets reverted in two hours can still report a low MTTR for the original ticket. The revert becomes a *new* ticket with its own fast resolution, and the dashboard shows two quick wins instead of one failure. Revert rate collapses that illusion into a single number you can't game.

Watch for two failure modes:

  • The thrash loop. A fix, a revert, a re-fix, another revert. Each cycle scores well on MTTR and terribly on revert rate. This is the signature of unsupervised autonomous fixing with no verification step.
  • The silent supersede. A fix that's quietly replaced by a different change days later, never formally reverted. Track supersession alongside reverts or the thrash hides in plain sight.

A low revert rate is what earns the trust to widen autonomy. It is the metric that lets you tell a board that agents are fixing more *and* breaking less.

04

3. Recurrence rate

Recurrence is the percentage of defects that return after being marked resolved, whether as the identical bug or the same root cause wearing a different stack trace. It answers the question MTTR structurally cannot: did we fix the *problem*, or did we mute the *symptom*?

This is where most remediation programs leak value. A fix that suppresses an error without addressing its cause will pass tests, close the ticket, and reduce MTTR, and the defect will resurface under marginally different conditions. You pay the remediation cost repeatedly and never retire the risk. High recurrence with low MTTR is the precise fingerprint of symptom-patching at scale.

Measuring recurrence properly requires that your remediation is reproduction-grounded. If you can't deterministically reproduce a failure, you can't prove a fix addressed its cause rather than coincidentally cleared the alert. This is the Reproduce stage of the loop doing load-bearing work: a fix verified against a reproduced failure is a fix you can claim retired the defect. Reliability Analytics is where recurrence becomes visible as a trend, because recurrence only shows up over time and across incidents that a per-ticket view can't connect.

05

4. Reachable-risk burndown (not finding count)

Counting how many findings remediation closed is the issue-tracker version of vanity MTTR. Most findings aren't exploitable from a live entry point, so a high close count can represent a lot of motion against low-risk noise while the genuinely dangerous, reachable defects wait in the queue.

Measure burndown of reachable risk instead: the exploitable exposure that is actually reachable from a live path. Reachability-based prioritization can mean 70-90% less exploitable exposure to triage, which means the metric stops flattering busywork and starts tracking risk that matters. A System Graph is what makes this computable, because reachability is a property of the dependency map. It knows whether a vulnerable function sits on a path from a real entry point or in dead code no request ever hits. Remediation governed on reachable-risk burndown puts agent effort where the danger is, instead of wherever the close count is easiest to run up.

06

5. Authorization integrity (the governance metric)

If agents are proposing fixes, you must measure whether the human-authorization model is actually holding. Authorization integrity asks: what percentage of shipped fixes followed the policy that governs them? Were payment-path changes approved by a named human? Did anything reach production through a bypass?

This metric exists because the alternative is the failure mode this entire category is built to prevent. Around 80% of developers already bypass policy and guardrails, and a fast autonomous fixer is the easiest thing in your stack to wave through. Agents propose; humans authorize is only real if you can prove it after the fact. The governance posture is the engineering, not the paperwork.

Remediation Fleets propose fixes; Governance records who authorized what, under which policy, with a reproducible audit trail. For teams that can't send code to a vendor cloud, Edge Runners run the loop as signed capsules inside your boundary and produce the same audit-ready evidence. When an incident review or a regulator asks why a change shipped, authorization integrity is the difference between an answer and an apology.

07

What to do Monday morning

You don't need to instrument all five at once. You need to stop trusting MTTR and start trusting the metrics that distinguish a governed fix from a fast one.

  • Add revert rate to your remediation dashboard this week. It's the cheapest honest signal, and you almost certainly already have the data.
  • Define your governed-fix clock. Decide what "done" means: validated, policy-checked, authorized, verified. Measure to that line, not to merge.
  • Stop reporting finding counts to leadership. Replace them with reachable-risk burndown so effort tracks danger.
  • Write your authorization policy down. If you can't state which fixes need a named human, you can't measure whether the rule held.

Prove the metrics on one service, watch revert and recurrence fall, then widen autonomy as the evidence compounds.

08

The bottom line

Guías relacionadas

Continuar leyendo

01Zof Console

Una superficie para la postura, las operaciones y lo que necesita atención a continuación.

El hogar autenticado que los equipos de ingeniería, QA y SRE abren cada día: postura de calidad, ejecuciones en vuelo, cobertura por módulo y lo que requiere atención a continuación.

KPI OPERACIONALES

  • Carreras
  • Cobertura
  • Riesgo

Viva en todos los entornos a los que realiza envíos.

COLUMNA DE TRABAJO

  • Especificaciones
  • Pruebas
  • Horarios

De la especificación a la regresión programada.

BARANDILLAS

  • RBAC
  • SSO
  • auditoría

Cada acción atribuible a un humano nombrado.

LIVE/console
Centro de comando interno de Zof AI que muestra 12 ejecuciones con un 94 % de aprobación, 3 problemas críticos abiertos, 84 % de cobertura, cuatro barras de trazabilidad de módulos, el proceso de especificaciones, próximos cronogramas y las próximas acciones recomendadas con una barra lateral de ejecuciones activas.
Vista de inicio · Servicio de pago · Puesta en escena · capturado en vivo desde el producto.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

The Remediation Metrics That Matter: Mean-Time-to-Governed-Fix, Revert