Entreprise

Activity vs. Outcome: Why Your Reliability Metrics Are Measuring the Wrong Thing

Test counts and run volumes are activity theater. Here's why only outcome metrics, escaped defects and proven-safe releases, justify reliability investment.

Book a demo

Équipe Fiabilité Zof · Ingénierie et produit

17 juin 2026 · 7 min de lecture · Mis à jour le 17 juin 2026

The activity trap

Most reliability metrics are activity metrics in disguise. Test count, suite execution volume, lines of coverage, CI minutes consumed, automation percentage. They share a comforting property: they always trend in the direction you want. Add tests, the count rises. Run them more often, executions climb. Each number rewards motion, and motion feels like progress.

The problem is that none of these metrics is causally connected to the thing your business cares about, which is whether software fails in front of customers. You can double your test count and ship the same incidents. You can hit 90% coverage and still take down payments, because coverage measures lines exercised, not failure modes anticipated. A team can run a hundred thousand assertions a day and learn nothing about whether the one change that mattered was safe to release.

This is activity theater. It produces a dashboard that looks like control and a release process that delivers none. Worse, it is self-justifying. When an incident slips through, the reflex is to write more tests, which raises the activity numbers, which makes the dashboard look healthier right after the system proved it was not. The metric improves precisely when it has failed you.

The tell is simple. Ask whether anyone outside engineering can feel the number move. A VP does not feel test count. A customer does not feel coverage. They feel outages, regressions, the slow tax of a product that breaks in ways that erode trust. If a metric cannot be felt by the people funding it, it is activity, not outcome.

Why this got worse, fast

For most of software history, activity metrics were a tolerable proxy. A diligent team writing more tests usually did ship more reliably, because the same humans who wrote the tests understood the code. That correlation is breaking.

Industry research now puts roughly 41% of codebases as AI-generated, and finds that about 45% of AI coding tasks introduce critical flaws or security issues. Read those together. A large and growing share of your code arrives from a source that produces volume at machine speed and ships defects by default. Meanwhile, an estimated 80% of developers bypass policy and guardrails when those guardrails are advisory.

In that environment, activity metrics actively mislead. Test count rises because AI also generates tests, often tests that assert the buggy behavior is correct. Coverage rises while exploitable surface rises faster. The activity dashboard goes green while the actual risk curve bends the wrong way. The cost of poor software quality, estimated near $2.41 trillion, is in large part the bill for organizations that measured how much testing happened instead of whether the right risk was retired.

The honest conclusion: at AI scale, counting activity does not just fail to help. It manufactures false confidence at the exact moment confidence is least warranted.

What an outcome metric actually is

An outcome metric answers a question a non-engineer would ask: did the thing we were afraid of happen, and are we getting better at preventing it? It is felt, not just observed.

A few properties separate outcome from activity:

It maps to a real-world failure, not an internal action. Escaped defects that reached production. Incidents caused by a change that passed your gates. Time a regression lived undetected. These are outcomes. "Tests added" is an action.
It can get worse when you do more work. A genuine outcome metric is not guaranteed to improve with effort. If your number only ever goes up, it is measuring activity. Escaped-defect rate can climb even as test count climbs, which is exactly why it is honest.
It survives the "so what" test. If leadership asks "so what?" and the answer is another internal number, you have activity. If the answer is "fewer customer-facing failures, lower exposure, faster safe releases," you have an outcome.
It is change-aware. A blast-radius-weighted outcome distinguishes a regression on a marketing page from one on the payments path. Treating all failures as equal is its own form of theater.

The hardest part is that outcome metrics are uncomfortable. They can make a busy quarter look bad. That discomfort is the point. A metric that cannot deliver bad news cannot justify investment, because it never told you the truth in the first place.

You cannot measure outcomes without a model of the system

Here is the catch that traps most teams. Outcome metrics require knowing what a change actually touched and whether the resulting risk was real. That demands a live model of the system, not a spreadsheet of test runs.

This is the gap a control layer is built to close. A System Graph maintains a live dependency and context map of services, deps, and CI/CD, which makes validation change-aware. With that model, a release decision can be scored by what it actually affects instead of by how many assertions ran. It is also what makes reachability-based prioritization possible, which industry research associates with 70 to 90% less exploitable exposure, because you act on what is genuinely reachable in the live graph rather than triaging a flat list.

Validation has to be an outcome, not a report. Static scripts decay the moment the system moves, and a decaying suite inflates activity numbers while degrading real coverage. Testing Fleets plan, execute, observe, and maintain validation as the system evolves, so the output is a verdict on whether a specific change is safe, not a coverage percentage on a chart.

And the loop has to close with evidence. Zof's operating model runs Understand, Test, Reproduce, Remediate, Verify, where remediation is governed under the principle that agents propose and humans authorize. Remediation Fleets propose fixes; Governance decides whether they execute and records who authorized what. The outcome metric that falls out of this is not "tests run." It is "changes proven safe, with an audit-ready record of why." A serious enterprise does not want more automation it cannot see. It wants control it can prove.

What to do Monday morning

You do not need a platform migration to stop measuring the wrong thing. You need to retire one vanity metric and stand up one outcome metric in its place.

Pick your loudest activity metric and demote it. Test count or raw coverage is the usual suspect. Keep it as a diagnostic, but take it off the leadership dashboard. It was never the story.
Stand up one escaped-defect measure. Count defects and incidents that reached production despite passing your gates, weighted by blast radius. This is the number leadership feels.
Measure time-to-verified, not time-to-tested. Track how long from a change landing to a proven-safe verdict with evidence. Speed of confidence beats volume of activity.
Run the "so what" test on every reliability number you report. If the honest answer is another internal metric, cut it from the executive view.

Each move shifts your scorecard from how much testing happened to whether risk was actually retired. That is the only basis on which reliability investment gets justified twice.

The bottom line

Préparation des mises en production SRE System Graph Flottes de test Flottes de remédiation

Guides associés

Reliability ROI

Produit associé

Continuer la lecture

Entreprise

Reliability ROI for E-commerce: Measuring Confidence on Every Checkout Release

A case-study model for pricing avoided revenue loss on every checkout, payments, and inventory release, so product managers can defend reliability as ROI.

Équipe Fiabilité Zof10 juin 20267 min de lecture

Entreprise

Velocity Doesn't Kill Quality, Lack of Visibility Does

The speed-vs-quality tradeoff is a measurement failure, not a law of physics. Here's why full traceability across the reliability loop dissolves it.

Équipe Fiabilité Zof9 juin 20267 min de lecture

Entreprise

From Rework Tax to Recovered Velocity: Measuring What a Control Layer Gives Back

A defensible before/after model for measuring the rework tax AI accelerates, and the recovered engineering capacity a governed control layer gives back.

Équipe Fiabilité Zof26 mai 20268 min de lecture

The activity trap

Why this got worse, fast

What an outcome metric actually is

You cannot measure outcomes without a model of the system

What to do Monday morning

The bottom line

Continuer la lecture

Reliability ROI for E-commerce: Measuring Confidence on Every Checkout Release

Velocity Doesn't Kill Quality, Lack of Visibility Does

From Rework Tax to Recovered Velocity: Measuring What a Control Layer Gives Back

Une surface pour la posture, les opérations et ce qui nécessite une attention particulière.