Skip to content
自律的な信頼性

Control Plane vs Dashboard: Why Visibility Is Not Control

Dashboards show you reliability problems. A control plane authorizes, gates, and acts on them. Here's the architectural line every SRE should draw.

Zof Reliability Team · エンジニアリング & プロダクト

2025年11月5日 · 読了時間 7 分 · 2025年11月5日 更新

Share
01

The category confusion: observe, decide, act

Every reliability system does some subset of three things: it observes state, it decides what that state means, and it acts to change it. Most platforms sold as "reliability platforms" do only the first, and partially the second.

A dashboard observes. It ingests signals, computes derived metrics, and renders them. A good one also encodes a thin layer of decision: thresholds, burn-rate alerts, anomaly flags. But the action boundary is where it stops. The dashboard fires a page; a human decides; a human acts, usually through a different system entirely. The control logic lives in someone's head and in a runbook that may or may not be current.

A control plane closes that loop by design. It observes, it decides against explicit policy, and it acts within authorized bounds, then verifies the result. The distinction is not cosmetic. An observability tool that adds a "remediate" button is still an observability tool with a button, because the authority model, the policy engine, and the audit trail were never first-class. Control is an architecture, not a feature you bolt on.

Here is the line, stated plainly:

  • A dashboard answers "what is happening?" Its output is a human's attention.
  • A control plane answers "what is allowed to happen, and what should I do about it?" Its output is a governed action with evidence.
02

Why visibility plateaus

Visibility has diminishing returns, and the curve bends earlier than most teams expect. The first dashboard catches the obvious fires. The tenth dashboard mostly adds cognitive load. You can see this in the failure modes that visibility-only stacks share.

Alert fatigue is a control failure wearing an observability costume. When every signal routes to a human and no signal can be acted on automatically within policy, the only scaling lever is more humans staring at more screens. The signal-to-noise ratio degrades because the system has no way to resolve a known-safe condition on its own. Detection improves; resolution does not.

Knowing the blast radius does not contain it. Suppose your traces clearly show that a config change to an auth service is degrading three downstream services. The dashboard renders this beautifully. But rendering is not rollback. The containment still depends on a human reading the graph correctly, under pressure, and executing the right action through a separate tool, with no enforced check that the action is permitted or that it worked.

Visibility cannot enforce policy on what ships. This is the part that should alarm any SRE. Roughly 41% of codebases are now AI-generated, and industry research puts the rate at which AI coding tasks introduce critical flaws or security issues near 45%. Meanwhile, an estimated 80% of developers bypass policy and guardrails when those guardrails are advisory. A dashboard observing a risky release is a spectator. It has no authority to gate it. You can watch the security debt accumulate in real time and remain structurally unable to stop it. The cost of poor software quality, estimated at $2.41 trillion, is in large part the bill for systems that could see problems but could not act on them.

The plateau, then, is not a maturity stage you grow out of by buying a better dashboard. It is the ceiling of the observe-only architecture.

03

What a control plane actually adds

A control layer is defined by four properties that dashboards structurally lack. The governing principle throughout is agents propose, humans authorize: autonomy is real, but it is bounded by policy and accountable to a person.

1. A change-aware model of the system. You cannot gate what you cannot reason about. A control plane needs a live dependency and context map of services, dependencies, and CI/CD so that any proposed action is evaluated against current reality, not a stale diagram. In Zof's architecture this is the System Graph, and it is what makes validation change-aware rather than running the same static suite regardless of what actually moved.

2. Validation that is an action, not a report. Instead of static scripts that rot as the system evolves, Testing Fleets plan, execute, observe, and maintain validation as the system changes. The output is not a coverage number on a chart. It is a verdict the control plane can act on.

3. Governed remediation. Remediation is the hardest and most consequential part of the loop, which is exactly why it must be the most governed. Unsupervised autonomous fixing is reckless; the engineering is in the Governance layer of policy, approval, and audit. Remediation Fleets propose fixes, governance decides whether and how they execute, and every step is attributable. A serious enterprise does not want more AI acting on its production systems. It wants control over what that AI is allowed to do.

4. Evidence as a first-class output. A control plane produces an audit-ready record of what was proposed, what was authorized, who authorized it, what executed, and whether verification passed. That record is the difference between "we think it's fixed" and "we can prove the change was safe."

04

The closed loop, concretely

The operating model of a control plane is a loop, not a feed: Understand → Test → Reproduce → Remediate → Verify. A dashboard gives you, at best, the first stage and a notification. The loop is what converts a detected problem into a governed, verified resolution.

Walk it through a hypothetical. Consider a fintech SaaS team that ships a dependency bump on a payments service.

  • Understand. The System Graph identifies which downstream services and CI paths the change touches.
  • Test. Testing Fleets validate the affected surfaces, not a fixed suite, and surface a behavioral regression in idempotency handling.
  • Reproduce. The condition is reproduced deterministically, so the team is debugging a fact, not a theory.
  • Remediate. A Remediation Fleet proposes a scoped fix. Because this is a payments path, policy routes it for human authorization before anything executes.
  • Verify. Post-change validation confirms the regression is gone and nothing adjacent broke, with evidence attached.

A dashboard would have shown the team the error rate climbing after release. The control plane prevents the release from reaching that state, or contains and resolves it under policy, and proves it. Note where the human sits: not babysitting every step, but holding authority at the one decision that genuinely warrants it. Reliability becomes the default, with oversight reserved for genuine risk. This is also how prioritization gets honest, reachability-based analysis can mean 70-90% less exploitable exposure, because you act on what is actually reachable in the live graph instead of triaging a flat list of findings.

05

What to do Monday morning

You do not need a rip-and-replace to test the thesis. You need to find one place where your stack sees but cannot act.

  1. Audit your action boundary. For your last five incidents, mark where the system stopped at "notified a human" and where it could have acted within policy. That gap is your plateau.
  2. Find one advisory guardrail and ask if it is enforceable. If a check is a wiki page or a non-blocking CI warning, it is being bypassed. Make one gate executable and unavoidable.
  3. Pick a low-risk remediation to govern, not automate blindly. Choose a class of fix where you would trust a proposal but want authorization. That is the shape of governed autonomy.
  4. Demand evidence, not dashboards, from one workflow. Require that a single release decision produce an audit-ready record of what was checked and authorized.

If you want the longer argument on why this matters now, the AI code testing imperative whitepaper makes the case, and how it works shows the loop end to end.

06

The bottom line

続きを読む

01Zof Console

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。

エンジニアリング、QA、SREの各チームが毎日開く認証済みのホーム。品質の姿勢、進行中の実行、モジュールごとのカバレッジ、そして次に注目すべきことが分かります。

運用上の KPI

実行数、カバレッジ、リスク

出荷先のあらゆる環境に対応します。

ワークスパイン

仕様・テスト・スケジュール

仕様から計画された回帰まで。

ガードレール

RBAC・SSO・監査

指定された人間に起因するすべての行為。

LIVE/console
Zof AI ホーム コマンド センターには、94% パスでの 12 件の実行、3 つの未解決の重大な問題、84% のカバレッジ、4 つのモジュール トレーサビリティ バー、仕様パイプライン、今後のスケジュール、アクティブ実行サイドバー付きの推奨される次のアクションが表示されます。
ホーム ビュー · チェックアウト サービス · ステージング · 製品からライブでキャプチャ。
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

Control Plane vs Dashboard: Why Visibility Is Not Control