Skip to content
プロダクト

When Should an Agent Defer? Confidence Scoring and Human Authorization for Remediation

A confidence-and-criticality matrix for deciding when an agent auto-applies a fix, waits for approval, or escalates to a human. An SRE's playbook for governed remediation.

Zof Reliability Team · エンジニアリング & プロダクト

2025年2月18日 · 読了時間 8 分 · 2025年2月18日 更新

Share
01

The deferral decision is the product

Most discussion of autonomous remediation focuses on capability. Can the agent find the root cause, can it write the fix, can it verify it. Those are necessary and increasingly solved. The part that decides whether you can actually run remediation in production is governance: a defensible rule for when the agent acts on its own and when it defers to a person.

The stakes are not abstract. Roughly 41% of codebases are now AI-generated, and industry research suggests around 45% of AI coding tasks introduce a critical flaw or security issue. The cost of poor software quality sits near $2.41 trillion. The fixes themselves are increasingly machine-authored, which means the same defect distribution that infected your codebase now infects your remediation. An agent fixing AI-generated code with more AI, unsupervised, is not a control. It is a faster path to the same failure.

So the governing principle is fixed: agents propose, humans authorize. The interesting question is how to make "authorize" cheap and selective rather than a blanket human gate on every fix. The answer is a decision surface that routes each proposed fix by two independent variables: how confident the agent is that the fix is correct, and how much damage it does if the agent is wrong.

02

Two axes, not one

Teams that bolt a single confidence threshold onto an agent learn the hard way that confidence alone is a useless gate. A 95%-confident fix to a feature flag and a 95%-confident fix to your payments authorization path are not the same decision. Confidence tells you how likely the agent is right. Criticality tells you what it costs when it is wrong. You need both, and they are orthogonal.

Criticality is a property of the system, not the change. It comes from what the change touches, computed against a live dependency model rather than the diff. A System Graph that maps services, dependencies, and CI/CD into one change-aware model is what lets you ask the right question: does this fix touch a node that fans out to revenue paths, handles regulated data, or performs an irreversible operation? Criticality is high when the answer is yes, regardless of how small the patch looks. A three-line change to a shared auth library is high-criticality. A six-hundred-line change to an isolated internal tool is not.

Confidence is a property of the evidence behind the fix, not the model's self-reported certainty. This distinction is where most implementations go wrong, so it deserves its own section.

03

Confidence has to be earned, not declared

A language model will happily emit a confidence number. That number is close to worthless as a gate because it is uncalibrated: the model is often most confident exactly where it is wrong. If you wire an agent's self-rated confidence to auto-apply, you have built a system that fast-tracks its own blind spots.

Confidence that can gate a production change has to be assembled from evidence the agent did not author:

  • Reproduction. Was the failing behavior reproduced before the fix, and does the fix make the reproduction pass? A fix for a bug you could not reproduce is a guess.
  • Change-aware validation. Did validation actually exercise the changed paths and their dependents, or did an aggregate suite pass while ignoring the diff? Coordinated Testing Fleets plan and execute validation that is aware of what changed and what depends on it, which is what makes the coverage signal trustworthy rather than decorative.
  • Reachability. For security fixes, is the flaw on a path that is actually reachable in the deployed system? Reachability-based prioritization can mean 70 to 90% less exploitable exposure to triage, and applied here it sharpens the confidence signal: a reachable flaw routes up, an unreachable one does not have to block.
  • Blast-radius agreement. Does the validation evidence cover everything the System Graph says this change can affect? Confidence should fall when coverage and blast radius disagree.

Confidence, in other words, is a computed score over real artifacts. When the evidence is thin, the score is low, and the matrix sends the fix to a human. That is the design working as intended, not a failure.

04

The matrix

Cross the two axes and you get the decision surface. Keep it coarse. Three bands on each axis is enough, and more bands invite false precision.

| | Low criticality | Medium criticality | High criticality | |---|---|---|---| | High confidence | Auto-apply, logged | Auto-apply with rollback armed | Single approver | | Medium confidence | Auto-apply with rollback armed | Single approver | Escalate | | Low confidence | Single approver | Escalate | Escalate, never auto |

Three things make this matrix honest rather than decorative.

First, the high-criticality column never auto-applies, at any confidence. Payments, authentication and authorization, regulated data, and irreversible operations always cost a human authorization. Confidence can make that authorization a single fast approval instead of a change-advisory ritual, but it cannot remove it. This is the line that keeps the system governed.

Second, auto-apply is always paired with an armed rollback and a recorded verdict. Auto-apply does not mean fire-and-forget. It means the agent applies, verifies in production behavior, and holds a one-action reversal ready. The absence of a human in the path raises the bar on evidence and reversibility; it does not lower it.

Third, the agent never sets its own cell. Criticality is derived from the graph and policy. Confidence is computed from evidence. The agent cannot relabel a high-criticality change as medium to skip the gate, because it does not own either axis. This is the difference between governance and theater.

05

Failure modes to design against

The matrix introduces its own ways to fail. Name them so your implementation defends against each.

  • Calibration drift. If the confidence score does not track real correctness, the whole surface misroutes. Sample auto-applied fixes for post-hoc human review and recalibrate the bands against observed outcomes. Treat confidence calibration as an ongoing measurement, not a constant.
  • Criticality staleness. If the dependency map drifts from reality, a high-criticality change gets scored as medium. The graph has to be live and continuously reconciled, or the criticality axis quietly rots.
  • Escalation as a dumping ground. If too much lands in escalate, humans tune out and the gate becomes a rubber stamp. The fix is better evidence, not a lower bar. Most of your volume should be earning its way into auto-apply through reproduction and change-aware validation.
  • Coverage laundering. A fix shows "tests passed" while validation never touched the changed path. Confidence must read coverage of the change, not aggregate suite status.

Where remediation runs inside a customer boundary or a regulated enclave, the evidence requirement is stricter. Edge Runners execute as signed capsules inside the boundary and emit audit-ready evidence outward, so the matrix decision and its supporting artifacts survive a compliance review instead of living in a CI log someone can edit. All of this is where Governance lives as first-class configuration rather than tribal knowledge, and where Remediation Fleets execute against the policy rather than around it.

06

What to do Monday morning

You do not need to deploy autonomous remediation to start building the decision surface. Build the matrix before you arm the agent.

  1. Write your high-criticality list explicitly. Auth, payments, regulated data, irreversible operations. This is the column that never auto-applies. Be conservative here and nowhere else.
  2. Define what "confidence" means in artifacts. Decide the concrete evidence a fix must carry to count as high confidence: reproduction, change-aware coverage, reachability, blast-radius agreement. Reject self-reported model scores.
  3. Wire criticality to the graph, not the diff. Replace file-count and author heuristics with blast-radius signals from your dependency model.
  4. Shadow-run before you arm. Have the agent classify and propose without applying for two weeks. Compare its matrix cells against what a senior engineer would have decided. Calibrate, then turn on auto-apply for the safe corner only.

Each step moves human attention off the fixes that never needed it and concentrates it on the genuinely dangerous few.

07

The bottom line

続きを読む

01Zof Console

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。

エンジニアリング、QA、SREの各チームが毎日開く認証済みのホーム。品質の姿勢、進行中の実行、モジュールごとのカバレッジ、そして次に注目すべきことが分かります。

運用上の KPI

実行数、カバレッジ、リスク

出荷先のあらゆる環境に対応します。

ワークスパイン

仕様・テスト・スケジュール

仕様から計画された回帰まで。

ガードレール

RBAC・SSO・監査

指定された人間に起因するすべての行為。

LIVE/console
Zof AI ホーム コマンド センターには、94% パスでの 12 件の実行、3 つの未解決の重大な問題、84% のカバレッジ、4 つのモジュール トレーサビリティ バー、仕様パイプライン、今後のスケジュール、アクティブ実行サイドバー付きの推奨される次のアクションが表示されます。
ホーム ビュー · チェックアウト サービス · ステージング · 製品からライブでキャプチャ。
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

When Should an Agent Defer? Confidence Scoring and Human Authorization