Sécurité et gouvernance

We Verified What an AI Coding Agent Shipped for Two Weeks. The Loop Caught What Review Missed.

A case-study walkthrough of running the Understand-Test-Reproduce-Remediate-Verify loop on two weeks of AI-generated commits, and the defects it caught that PR review missed.

Book a demo

Équipe Fiabilité Zof · Ingénierie et produit

13 août 2025 · 8 min de lecture · Mis à jour le 13 août 2025

Résumé

PR review was built for code a colleague wrote and could explain. It is a poor fit for code an agent generated at 3 a.m. across nine files, with a description that reads plausible and a diff that compiles clean. Reviewers approve the diff in front of them; they rarely re-derive the system behavior behind it. This is a walkthrough of what happens when you stop trusting the diff and start verifying the change, applied to two weeks of AI-authored commits in a hypothetical retail codebase. The setup is deliberately ordinary. Consider a mid-market e-commerce team running a coding agent inside their normal workflow: it picks up tickets, opens PRs, and a human approves before merge. Over two weeks it shipped roughly forty merged changes touching checkout, catalog search, and the promotions engine. Every one passed human review. Every one passed CI. We then ran each merged change back through a closed reliability loop, Understand, Test, Reproduce, Remediate, Verify, to see what review and green pipelines had not.

Industry research now puts AI-generated code at roughly 41% of codebases, and finds that about 45% of AI coding tasks introduce a critical flaw or security issue.
The Understand-Test-Reproduce-Remediate-Verify loop is not a linear pipeline you run once.
Across the two weeks, the defects that slipped past human PR review clustered into a recognizable pattern, none of them the kind a reviewer is built to catch:

Why green CI and an approved PR are not the same as a verified change

Industry research now puts AI-generated code at roughly 41% of codebases, and finds that about 45% of AI coding tasks introduce a critical flaw or security issue. Those two numbers compound. When nearly half your code is machine-written and nearly half of those tasks carry a serious defect, the defect rate is no longer an edge case you catch by reading carefully. It is the base rate.

The failure is structural, not a matter of reviewer diligence. A human reviewer evaluates intent and local correctness: does this function do what the description says, is the logic sound in isolation. An AI agent produces locally plausible code that is globally wrong, it satisfies the ticket while quietly violating an invariant three services away that no one thought to mention in the diff. CI compounds the gap, because most suites only assert the behaviors someone already thought to test. New code that breaks an untested path stays green. And when roughly 80% of developers admit to bypassing policy and guardrails under deadline pressure, the controls meant to catch this are inconsistently applied by the time the change is moving fast.

So the question for a test lead is not "did a human look at it." It is "did anything prove this change is safe against the system it actually landed in." That is what the loop is for.

The loop, applied to a merged change

The Understand-Test-Reproduce-Remediate-Verify loop is not a linear pipeline you run once. It is a control cycle you run per change, and each stage exists to defeat a specific way that AI-generated code lies.

Understand establishes what the change actually touches, not what the PR says it touches.
Test validates against the real surface of the system, including paths the author never considered.
Reproduce turns a suspicious signal into a deterministic, evidence-backed failure.
Remediate proposes a governed fix, agents propose, humans authorize.
Verify proves the fix closed the defect without opening a new one.

Here is what each stage surfaced across the two weeks.

### Understand: mapping blast radius the PR description omitted

The first thing the loop does is consult a live dependency map of the system rather than read the diff in isolation. A System Graph of services, shared libraries, and CI/CD wiring makes validation change-aware: it knows that a change to a "format price for display" helper is also consumed by the tax calculator and the promotions engine.

One agent PR was titled "fix currency rounding in cart subtotal." Locally correct. The graph showed the modified helper had eleven downstream callers, two of them in the promotions path. The reviewer saw a three-line diff. The change-aware view saw a blast radius the description never mentioned. That reframing is the whole point of the Understand stage: you cannot test what you do not know you broke.

### Test: validating the paths nobody wrote a test for

The team's CI asserted cart subtotals. It did not assert the interaction between a percentage-off promotion and the newly rounded subtotal, because no human had ever shipped a change that connected those two. Testing Fleets, coordinated agents that plan and execute validation against the current system rather than replay a static script, generated coverage along the graph's flagged paths instead of only the ones already in the suite.

That is the difference between test generation and operated testing. A static script encodes yesterday's assumptions and rots the moment the system moves. A fleet that re-derives what to test from the live graph keeps pace with continuous change. It found that stacking two promotions on the re-rounded subtotal produced a discount one cent larger than intended on a narrow band of inputs. Trivial in isolation. At checkout volume, a systematic revenue leak that no dashboard would have flagged as an error, because nothing errored.

### Reproduce: from "looks wrong" to deterministic evidence

A signal is not a defect until you can reproduce it on demand. The Reproduce stage isolated the exact input combination, specific item price, specific stacked-promo configuration, and produced a deterministic failing case with full evidence: inputs, the code path taken, the expected versus actual subtotal. This matters for two reasons. First, an intermittent or hard-to-trigger bug gets dismissed in triage; a deterministic repro does not. Second, you cannot prove a fix works later if you cannot reliably produce the failure now.

For regulated or sensitive environments, this evidence can be generated inside the customer boundary. Edge Runners execute as signed capsules within a secure enclave and emit audit-ready evidence without code or data leaving the perimeter, the difference between "we think we fixed it" and a record you can hand an auditor.

### Remediate: governed fixing, not autonomous merging

This is the hardest and most consequential stage, and the place where unsupervised autonomy is genuinely reckless. A Remediation Fleet proposed a fix that pinned the rounding order so promotions applied to the pre-rounded value. It did not merge it. The principle holds without exception: agents propose, humans authorize. The proposed fix arrived as a reviewable change with the failing repro attached and the graph-derived blast radius shown, so the human approving it had the context the original reviewer never had.

Governance is the engineering here, not a checkbox bolted on after. Policy decides which classes of change an agent may even propose, approval routes the fix to the right owner, and audit records who authorized what and on what evidence. That is what keeps "autonomous fixing" from becoming "autonomous breaking at scale."

### Verify: prove it closed, prove nothing else opened

The last stage re-ran the deterministic repro against the patched build to confirm the defect was gone, then re-ran the graph-flagged paths to confirm the fix introduced no regression in the tax calculator or the other promotions branch. Verification is not optional polish; it is the step that distinguishes a fix from a guess. A remediation that closes one defect and silently opens another is a net loss, and only an explicit verify step against the same evidence catches it.

What the loop caught that review missed

Across the two weeks, the defects that slipped past human PR review clustered into a recognizable pattern, none of them the kind a reviewer is built to catch:

Invariant violations across services the PR description never mentioned (the rounding-promotion leak).
Untested-path breakage that stayed green because the suite only asserted known behaviors.
Plausible-but-wrong logic that read correctly line by line and was wrong only in interaction.
Silent revenue and correctness drift, no exception, no alert, no log line, just slightly wrong output at scale.

The common thread: every one was invisible from inside the diff and visible only from the system. Reachability-aware prioritization is the same principle applied to security, focus on what is actually exploitable in context and you can mean 70-90% less exploitable exposure, instead of drowning in findings that no path reaches.

What to do Monday morning

You do not need to rebuild your pipeline to start. Pick one high-traffic surface, checkout is the obvious one for retail, and for the next two weeks, run every AI-authored merge back through the loop as a parallel control, not a blocking gate. Require a change-aware blast radius before approval, a deterministic repro before any fix, and a verify step before close. Keep humans on the authorize button the entire time. The goal is not to slow the agents down. It is to make their output provable.

The bottom line

Gouvernance de l'IA IA d'entreprise System Graph Flottes de test Flottes de remédiation

Guides associés

Governed AI remediation

Produit associé

Continuer la lecture

Sécurité et gouvernance

Agents Propose, Humans Authorize: A Reference Architecture for Governed Autonomy

A reference architecture for letting agents act on production safely: the four control surfaces, policy, approval, evidence, attribution, and how they wire into the loop.

Équipe Fiabilité Zof16 juin 20268 min de lecture

Sécurité et gouvernance

More Models Won't Save You: Why AI-Generated Code Needs a Control Layer, Not Smarter Autocomplete

Better code generation can't validate its own output. Why AI-written code needs a governed control layer that maps, tests, and proves every change.

Équipe Fiabilité Zof14 mai 20267 min de lecture

Sécurité et gouvernance

Code Without Provenance: The Real Risk When 41% of Your Codebase Has No Author

When 41% of your codebase has no author, the real risk isn't bugs, it's lost intent. How a System Graph restores the provenance AI-generated code strips away.

Équipe Fiabilité Zof5 mai 20267 min de lecture

Why green CI and an approved PR are not the same as a verified change

The loop, applied to a merged change

What the loop caught that review missed

What to do Monday morning

The bottom line

Continuer la lecture

Agents Propose, Humans Authorize: A Reference Architecture for Governed Autonomy

More Models Won't Save You: Why AI-Generated Code Needs a Control Layer, Not Smarter Autocomplete

Code Without Provenance: The Real Risk When 41% of Your Codebase Has No Author

Une surface pour la posture, les opérations et ce qui nécessite une attention particulière.