Skip to content
自律的な信頼性

Measuring Quality Intelligence: The Metrics That Actually Predict Reliability

Pass rate predicts nothing. Move SRE teams to reachability-weighted coverage, escaped-defect trends, and confidence-to-release signals that actually hold.

Zof Reliability Team · エンジニアリング & プロダクト

2025年10月28日 · 読了時間 7 分 · 2025年10月28日 更新

Share
01

Why pass rate stopped predicting anything

Pass rate answered a real question in a slower era: of the tests we wrote, how many passed? That was a reasonable proxy for risk when humans wrote most of the code, changes were legible, and the suite roughly tracked the system. None of those conditions hold anymore.

Roughly 41% of codebases are now AI-generated, and around 45% of AI coding tasks introduce critical flaws or security issues. The volume of change and the defect rate climbed together. A pass rate computed against a static suite measures the wrong denominator: it tells you how the tests that exist behaved, not whether the tests that *should* exist for this change were ever run. Green means "nothing we already check broke." It says nothing about the surface this specific diff exposed.

The failure mode is concrete. A diff touches a config three repositories away. CI is green because the cart and checkout suites pass. Nobody asks whether that config is reachable from a revenue-critical path, because pass rate has no concept of reachability. The number went up and to the right while the actual risk moved somewhere the metric cannot see. When the cost of poor software quality is estimated at $2.41 trillion, a large share of that bill is shipped by teams who were watching a healthy-looking dashboard the whole time.

The deeper problem is incentive. A metric that is easy to satisfy gets satisfied. Around 80% of developers bypass policy and guardrails, and a pass-rate gate is among the easiest to game: add shallow tests, skip flaky ones, and the number flatters everyone while predicting nothing.

02

The three signals that actually predict reliability

Reliability is a property of the change against the live system, not an average of the suite. Three metrics track that, and they are the ones to put on the wall.

### 1. Reachability-weighted coverage

Raw coverage counts lines or branches executed. It treats a getter in an internal admin tool the same as a code path one hop from your payment processor. Reachability-weighted coverage asks a sharper question: of the code that is actually reachable from a live entry point, how much is validated?

This reweighting changes behavior immediately. It de-emphasizes coverage theater on dead or low-consequence code and concentrates effort where exploitation is possible. The leverage is large: reachability-based prioritization can mean 70-90% less exploitable exposure to triage, because you stop spending validation budget on paths an attacker or a failure cannot reach. The prerequisite is a live model of what is reachable. A System Graph that maps services, dependencies, and CI/CD is what makes "reachable" a computed fact rather than a tribal guess.

### 2. Escaped-defect trend

The honest measure of a quality program is not what it catches but what it lets through. An escaped defect is one your validation passed and production later surfaced. The single number that matters is the trend: are escapes per release falling, flat, or rising as you ship faster?

Track three cuts:

  • Escape rate by reachability tier. An escape on a reachable, high-consequence path is a different event from a cosmetic one. Weight accordingly.
  • Time-in-system before escape. Defects that survive multiple releases before surfacing indicate a blind spot your suite structurally cannot see, not bad luck.
  • Recurrence. A defect class that escapes twice is a process failure. If your postmortems read like reruns, this metric is already telling you so.

Pass rate can hold at 100% while escaped defects climb. When those two lines diverge, believe the escape trend. It is measuring reality; pass rate is measuring your test list.

### 3. Confidence-to-release

The release decision is still, for most teams, a meeting: a green pipeline, some anecdotes about what feels risky, and a senior person saying "ship it." That gut call is the most expensive judgment in your delivery process and it runs on the least evidence.

Confidence-to-release replaces the vibe with a signal scoped to the specific change: validated against its real dependency surface, with reachable risk below a stated policy threshold, and the evidence attached. It is not a status on a chart. It is a defensible answer to one question: *is this change safe to release into this system right now, and what is the proof?* The useful leading indicator is time-to-confidence, how long from merged change to a defensible release decision. Falling over a quarter, that is the number your VP of Engineering repeats.

03

Why these metrics need a control layer, not another dashboard

Here is the hard part, and the honest objection: you cannot compute any of these three from a dashboard. Reachability weighting needs a live dependency map. Escaped-defect trends need validation that tracks the code instead of decaying behind it. Confidence-to-release needs a policy engine and an audit trail. These are control-plane primitives, not panels you bolt onto observability.

Each signal maps to a mechanism in the closed loop, Understand → Test → Reproduce → Remediate → Verify:

  • Reachability comes from *Understand*. The System Graph makes validation change-aware: it knows the cart service calls payments, and that a config three repos away is reachable from checkout.
  • Coverage that tracks the system comes from *Test*. Testing Fleets plan, execute, observe, and maintain validation as the system evolves. They are coordinated agents, not static scripts, so coverage tracks the code instead of rotting behind it. That is what keeps the escaped-defect trend honest.
  • Confidence-to-release is assembled by Reliability Analytics from accumulated evidence: time-to-validate, reachable-risk trend, remediation cycle time. These are leading indicators an SRE can defend, unlike a coverage percentage that flatters the dashboard.

Remediation belongs in the loop too, and it is the part to be most careful about. When a metric reveals a defect worth fixing, Remediation Fleets propose a scoped fix. They do not silently ship it. Agents propose; humans authorize. Unsupervised autonomous fixing inside a release gate would be reckless; the Governance layer of policy, approval, and audit is the engineering. The fix is re-validated against the same scope before any metric updates.

This also resolves the bypass problem. The 80% bypass rate is a symptom of gates that are subjective or slow. A gate defined as "reachable critical findings = 0, payment-path change requires one named approval" is concrete. "I bypassed it because it felt fine" is no longer an available excuse. Engineers route *through* a fast, specific, evidence-backed gate, not around it.

04

What to do Monday morning

You do not need to replace your metrics program to start. Instrument one path and let the better signals prove themselves.

  • Pick one high-consequence path. Checkout, auth, or a payments dependency. Define what "reachable" means for it using your dependency map, not memory.
  • Start the escaped-defect ledger. For your last ten incidents, mark which were escapes your validation passed. That baseline is your real quality number; pass rate was never it.
  • Write one gate as policy, not a vibe. If you cannot write the release criterion down in checkable terms, you cannot govern it, and you certainly cannot measure confidence against it.
  • Measure time-to-confidence on that one path. Watch it fall as the loop matures. That curve is your ROI story.

For teams that cannot send code or telemetry to a vendor cloud, the same loop and the same evidence run inside your boundary via Edge Runners, signed capsules in a secure enclave that produce an audit-ready record. The metrics do not change; only where they are computed does.

05

The bottom line

続きを読む

01Zof Console

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。

エンジニアリング、QA、SREの各チームが毎日開く認証済みのホーム。品質の姿勢、進行中の実行、モジュールごとのカバレッジ、そして次に注目すべきことが分かります。

運用上の KPI

実行数、カバレッジ、リスク

出荷先のあらゆる環境に対応します。

ワークスパイン

仕様・テスト・スケジュール

仕様から計画された回帰まで。

ガードレール

RBAC・SSO・監査

指定された人間に起因するすべての行為。

LIVE/console
Zof AI ホーム コマンド センターには、94% パスでの 12 件の実行、3 つの未解決の重大な問題、84% のカバレッジ、4 つのモジュール トレーサビリティ バー、仕様パイプライン、今後のスケジュール、アクティブ実行サイドバー付きの推奨される次のアクションが表示されます。
ホーム ビュー · チェックアウト サービス · ステージング · 製品からライブでキャプチャ。
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

Measuring Quality Intelligence: The Metrics That Actually Predict Reli