Skip to content
エンジニアリング

AIによるテスト生成だけでは不十分

企業に必要なのは、単により多くのテストではなく、コンテキスト、実行、テレメトリ、ガバナンス、そして修復です。

Zof Reliability Team · エンジニアリング & プロダクト

2026年5月11日 · 読了時間 11 分 · 2026年5月19日 更新

Share
02

Where generation genuinely helps

Generation earns its place at the start of a workflow, when a human still owns the judgment and a maintained system owns what happens next. Used that way, it removes real toil.

Good fits for generation

  • Bootstrapping API contract tests from a schema or spec
  • Drafting edge cases a human might overlook on a first pass
  • Translating acceptance criteria into executable sketches
  • Seeding a new service with a baseline suite before fleets take over
03

Where generation fails

Generated tests drift the moment the system changes. They cannot prioritize without a System Graph to tell them what changed and what depends on it. They do not choose safe environments, respect data policy, or produce audit-grade evidence by default. They flag failures; they do not reproduce, remediate, or verify them.

Without governance and maintenance around it, generation becomes another source of CI noise, a suite that turns red for reasons no one trusts, until the team learns to ignore it. This is the same pattern that made static scripts a liability, explored in Testing Fleets, not test scripts.

04

The real problem is operating reliability, not authoring tests

AI-generated code now accounts for roughly 41% of codebases by Zof's research, and the volume of change has outrun the suites meant to validate it. The cost center moved from "writing tests" to "keeping validation accurate under continuous change." Generation addresses the first; it is silent on the second.

Reliability is operated, not authored once. That means deciding what to validate for a given change, executing inside human-defined boundaries, interpreting evidence in context, and closing the loop when something breaks. We make the full version of this argument in The AI code testing imperative.

Generation answers "can we write a test for this?" The enterprise question is "is the right thing still being validated after the system changed 400 times?"

05

A worked example: day 30

Consider a payments service. On day zero, a generation tool drafts 220 tests from the OpenAPI spec and acceptance criteria. They pass, coverage looks healthy, and the suite is merged.

Over the next thirty days the team ships 400 changes: a renamed field, a new idempotency requirement, a refactored retry path, a third-party webhook migration. The generated suite has no map of any of it. Some tests now assert against fields that no longer exist and fail loudly. Others still pass while silently validating dead code paths. Nobody can say which 40 of the 220 actually matter for the change that shipped this morning.

Generation alone versus a maintained fleet on day 30

Day 0    220 generated tests, all green
            │
Day 30   400 changes shipped
            │
  Generation only ──► drift: false reds + silent false greens
            │
  System Graph + Fleets ──► run the 40 that matter, retire the noise,
                            attach evidence to the change that ran them
The gap is not authoring. It is what happens after the system moves.
06

The missing pieces: context, execution, telemetry, governance, remediation

Generation tool versus a reliability control plane
CapabilityGeneration toolARI platform
What to testHeuristic or prompt at author timeChange impact and risk scored on the System Graph
ExecutionOften local or CI-onlyGoverned fleets plus enclave and edge runners
TelemetryPass or failArtifacts, traces, failure signatures, analytics
GovernanceMinimal or CI permissionsPolicy, RBAC, approval, audit
RemediationNoneGoverned remediation fleets, staging-first, human-approved

The columns are not competitors. Generation becomes one input on the left that a control plane consumes on the right. The point is that authoring is a single step in a loop, not the loop itself.

07

"Our generator already maintains its tests"

The common objection is that modern generators self-heal: they re-run, detect a broken selector or assertion, and rewrite it. This helps with surface-level brittleness. It does not solve the harder problem, because self-healing optimizes for keeping a test green, not for keeping it correct.

A test that rewrites its own assertion to match changed behavior can paper over a regression instead of catching it. Maintenance has to be anchored to a model of what the system is supposed to do and what changed, which is what the System Graph provides. Without that anchor, "self-healing" can quietly erode the coverage it claims to preserve.

08

Why enterprises need a control plane

A control plane coordinates agents, policies, evidence, and integrations across the whole reliability loop. Generation becomes one capability inside it, not the product. Testing Fleets maintain validation as the system changes, Remediation Fleets propose fixes that humans authorize, and the governance layer keeps every action policy-bound and auditable.

Procurement should score vendors on operated reliability outcomes, not on lines of generated code. The decision is closer to a build-versus-buy question for reliability infrastructure than a tooling purchase, which we cover in Build versus buy test automation.

09

How to evaluate a generation-first vendor

If a tool leads with generation, the right questions are about everything that happens after the draft. Use this checklist in a proof of concept rather than a slide review.

Questions to put in the POC

  1. Does it know what changed in this PR, and can it explain why each test ran?
  2. Where does execution happen, and can it respect environment and data policy?
  3. What evidence does a run attach to the change that triggered it?
  4. When maintenance rewrites a test, what prevents it from masking a regression?
  5. On a failure, does it stop at a signal, or does it reproduce and propose a governed fix?
  6. Who approves a remediation, and is every action in the audit trail?
10

How autonomous reliability infrastructure closes the gap

Autonomous reliability infrastructure connects generation, where it is useful, to maintained fleets, graph context, telemetry, and optional governed remediation. Tests become assets in an operated system, not disposable drafts. The loop runs Understand, Test, Reproduce, Remediate, Verify, with humans setting boundaries at every gate.

The result is validation that stays proportional to risk as the system moves. For a deeper treatment of the operating model, see the autonomous reliability infrastructure guide.

11

Final takeaway

AI test generation is a feature. Enterprise reliability is a platform. The hard part was never writing the first version of a test; it is keeping the right things validated, safely and auditably, while the system changes underneath you.

Evaluate tools on closed-loop outcomes operated over time, escaped defects, reproduction time, maintenance load, and evidence quality, not on demo-day velocity. If a vendor cannot answer the day-30 question, generation is all you are buying.

よくある質問

No. Generation is a useful first step for bootstrapping suites, drafting edge cases, and translating acceptance criteria into executable sketches. The point is that authoring is one input into an operated reliability loop, not the whole product. Zof uses generation where it helps and then maintains, governs, and remediates around it.

続きを読む

01Zof Console

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。

エンジニアリング、QA、SREの各チームが毎日開く認証済みのホーム。品質の姿勢、進行中の実行、モジュールごとのカバレッジ、そして次に注目すべきことが分かります。

運用上の KPI

実行数、カバレッジ、リスク

出荷先のあらゆる環境に対応します。

ワークスパイン

仕様・テスト・スケジュール

仕様から計画された回帰まで。

ガードレール

RBAC・SSO・監査

指定された人間に起因するすべての行為。

LIVE/console
Zof AI ホーム コマンド センターには、94% パスでの 12 件の実行、3 つの未解決の重大な問題、84% のカバレッジ、4 つのモジュール トレーサビリティ バー、仕様パイプライン、今後のスケジュール、アクティブ実行サイドバー付きの推奨される次のアクションが表示されます。
ホーム ビュー · チェックアウト サービス · ステージング · 製品からライブでキャプチャ。
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

AIによるテスト生成だけでは不十分な理由 | Zof AI ブログ