自律的な信頼性

Measuring Quality Intelligence: The Metrics That Actually Predict Reliability

Pass rate predicts nothing. Move SRE teams to reachability-weighted coverage, escaped-defect trends, and confidence-to-release signals that actually hold.

Book a demo

Zof Reliability Team · エンジニアリング & プロダクト

2025年10月28日 · 読了時間 7 分 · 2025年10月28日更新

Why pass rate stopped predicting anything

Pass rate answered a real question in a slower era: of the tests we wrote, how many passed? That was a reasonable proxy for risk when humans wrote most of the code, changes were legible, and the suite roughly tracked the system. None of those conditions hold anymore.

Roughly 41% of codebases are now AI-generated, and around 45% of AI coding tasks introduce critical flaws or security issues. The volume of change and the defect rate climbed together. A pass rate computed against a static suite measures the wrong denominator: it tells you how the tests that exist behaved, not whether the tests that *should* exist for this change were ever run. Green means "nothing we already check broke." It says nothing about the surface this specific diff exposed.

The failure mode is concrete. A diff touches a config three repositories away. CI is green because the cart and checkout suites pass. Nobody asks whether that config is reachable from a revenue-critical path, because pass rate has no concept of reachability. The number went up and to the right while the actual risk moved somewhere the metric cannot see. When the cost of poor software quality is estimated at $2.41 trillion, a large share of that bill is shipped by teams who were watching a healthy-looking dashboard the whole time.

The deeper problem is incentive. A metric that is easy to satisfy gets satisfied. Around 80% of developers bypass policy and guardrails, and a pass-rate gate is among the easiest to game: add shallow tests, skip flaky ones, and the number flatters everyone while predicting nothing.

The three signals that actually predict reliability

Reliability is a property of the change against the live system, not an average of the suite. Three metrics track that, and they are the ones to put on the wall.

### 1. Reachability-weighted coverage

Raw coverage counts lines or branches executed. It treats a getter in an internal admin tool the same as a code path one hop from your payment processor. Reachability-weighted coverage asks a sharper question: of the code that is actually reachable from a live entry point, how much is validated?

This reweighting changes behavior immediately. It de-emphasizes coverage theater on dead or low-consequence code and concentrates effort where exploitation is possible. The leverage is large: reachability-based prioritization can mean 70-90% less exploitable exposure to triage, because you stop spending validation budget on paths an attacker or a failure cannot reach. The prerequisite is a live model of what is reachable. A System Graph that maps services, dependencies, and CI/CD is what makes "reachable" a computed fact rather than a tribal guess.

### 2. Escaped-defect trend

The honest measure of a quality program is not what it catches but what it lets through. An escaped defect is one your validation passed and production later surfaced. The single number that matters is the trend: are escapes per release falling, flat, or rising as you ship faster?

Track three cuts:

Escape rate by reachability tier. An escape on a reachable, high-consequence path is a different event from a cosmetic one. Weight accordingly.
Time-in-system before escape. Defects that survive multiple releases before surfacing indicate a blind spot your suite structurally cannot see, not bad luck.
Recurrence. A defect class that escapes twice is a process failure. If your postmortems read like reruns, this metric is already telling you so.

Pass rate can hold at 100% while escaped defects climb. When those two lines diverge, believe the escape trend. It is measuring reality; pass rate is measuring your test list.

### 3. Confidence-to-release

The release decision is still, for most teams, a meeting: a green pipeline, some anecdotes about what feels risky, and a senior person saying "ship it." That gut call is the most expensive judgment in your delivery process and it runs on the least evidence.

Confidence-to-release replaces the vibe with a signal scoped to the specific change: validated against its real dependency surface, with reachable risk below a stated policy threshold, and the evidence attached. It is not a status on a chart. It is a defensible answer to one question: *is this change safe to release into this system right now, and what is the proof?* The useful leading indicator is time-to-confidence, how long from merged change to a defensible release decision. Falling over a quarter, that is the number your VP of Engineering repeats.

Why these metrics need a control layer, not another dashboard

Here is the hard part, and the honest objection: you cannot compute any of these three from a dashboard. Reachability weighting needs a live dependency map. Escaped-defect trends need validation that tracks the code instead of decaying behind it. Confidence-to-release needs a policy engine and an audit trail. These are control-plane primitives, not panels you bolt onto observability.

Each signal maps to a mechanism in the closed loop, Understand → Test → Reproduce → Remediate → Verify:

Reachability comes from *Understand*. The System Graph makes validation change-aware: it knows the cart service calls payments, and that a config three repos away is reachable from checkout.
Coverage that tracks the system comes from *Test*. Testing Fleets plan, execute, observe, and maintain validation as the system evolves. They are coordinated agents, not static scripts, so coverage tracks the code instead of rotting behind it. That is what keeps the escaped-defect trend honest.
Confidence-to-release is assembled by Reliability Analytics from accumulated evidence: time-to-validate, reachable-risk trend, remediation cycle time. These are leading indicators an SRE can defend, unlike a coverage percentage that flatters the dashboard.

Remediation belongs in the loop too, and it is the part to be most careful about. When a metric reveals a defect worth fixing, Remediation Fleets propose a scoped fix. They do not silently ship it. Agents propose; humans authorize. Unsupervised autonomous fixing inside a release gate would be reckless; the Governance layer of policy, approval, and audit is the engineering. The fix is re-validated against the same scope before any metric updates.

This also resolves the bypass problem. The 80% bypass rate is a symptom of gates that are subjective or slow. A gate defined as "reachable critical findings = 0, payment-path change requires one named approval" is concrete. "I bypassed it because it felt fine" is no longer an available excuse. Engineers route *through* a fast, specific, evidence-backed gate, not around it.

What to do Monday morning

You do not need to replace your metrics program to start. Instrument one path and let the better signals prove themselves.

Pick one high-consequence path. Checkout, auth, or a payments dependency. Define what "reachable" means for it using your dependency map, not memory.
Start the escaped-defect ledger. For your last ten incidents, mark which were escapes your validation passed. That baseline is your real quality number; pass rate was never it.
Write one gate as policy, not a vibe. If you cannot write the release criterion down in checkable terms, you cannot govern it, and you certainly cannot measure confidence against it.
Measure time-to-confidence on that one path. Watch it fall as the loop matures. That curve is your ROI story.

For teams that cannot send code or telemetry to a vendor cloud, the same loop and the same evidence run inside your boundary via Edge Runners, signed capsules in a secure enclave that produce an audit-ready record. The metrics do not change; only where they are computed does.

The bottom line

ソフトウェアテストリリース準備状況 System Graph テスティングフリート修復フリート

続きを読む

自律的な信頼性

The Control Layer for Regulated Software: Signed Capsules, Enclaves, and Customer-Controlled Evidence

How Zof's control plane reaches into secure enclaves via signed capsules and Edge Runners, giving regulated buyers governed autonomy with audit-ready, customer-controlled evidence.

Zof Reliability Team2026年6月25日読了時間 7 分

自律的な信頼性

The 7 Signs Your QA Has Outgrown Test Automation

Flaky scripts, coverage that ignores risk, release anxiety. Seven signs your QA has outgrown test automation and needs Quality Intelligence instead.

Zof Reliability Team2026年6月4日読了時間 8 分

自律的な信頼性

The Reliability Control Loop: Understand, Test, Reproduce, Remediate, Verify

A platform engineer's walkthrough of the five-stage reliability control loop, Understand, Test, Reproduce, Remediate, Verify, and how each maps to a governed control layer.

Zof Reliability Team2026年6月1日読了時間 7 分

Why pass rate stopped predicting anything

The three signals that actually predict reliability

Why these metrics need a control layer, not another dashboard

What to do Monday morning

The bottom line

続きを読む

The Control Layer for Regulated Software: Signed Capsules, Enclaves, and Customer-Controlled Evidence

The 7 Signs Your QA Has Outgrown Test Automation

The Reliability Control Loop: Understand, Test, Reproduce, Remediate, Verify

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。