Does the run merge code on its own?

No. The fleet drafts a fix, validates it in staging, and opens a pull request with full evidence. A human reviews that evidence and authorizes the merge. Governance policy decides who that approver is for a given service, and the approval is logged. Agents propose; humans authorize.

How do you avoid acting on flaky, intermittent failures?

The Reproduce step exists precisely for this. Before any fix is proposed, the fleet pins the failure to specific inputs and an environment and confirms it occurs deterministically. Remediation only proceeds from a reproduction a reviewer can run once and see for themselves.

What can I show a security or compliance reviewer after a run?

A complete trail: the change-impact map, why each check ran, captured traces and artifacts, the deterministic reproduction, staging validation, the pull request, the named approver and timestamp, and post-merge verification. Every agent action is an auditable event under the governance layer.

Why not just run the full regression suite on every change?

Because that is slow, noisy, and does not tell you why anything ran. The Understand step uses the System Graph to scope validation to what the change can actually break, then records the rationale. You get validation proportional to risk instead of exhaustive and unexplained.

Reliability Operations

Inside a Zof Run: The Five-Step Reliability Loop

Understand, Test, Reproduce, Remediate, Verify—walked through one real checkout release.

Book a demo

Zof Reliability Team · Engineering & product

June 16, 2026 · 14 min read · Updated June 16, 2026

What "autonomous" actually means in a Zof run

The word autonomous does a lot of quiet damage in this category. To some it means software that fixes itself in production while everyone sleeps. That is not what happens in a Zof run, and we would not sell it if it were.

In a run, autonomy means three concrete things. The agents are governed: they act only inside boundaries your team defined. They are evidence-producing: every action leaves an artifact you can inspect. And they are human-authorized: nothing reaches a production-bound branch without a person approving it. Agents propose, humans authorize.

The rest of this piece walks one run end to end. We use a single, ordinary change—a modification to checkout that touches the payments path—because the value of the loop is easiest to see when the stakes are normal, not exotic.

The five-step loop at a glance

Every Zof run follows the same closed loop: Understand, Test, Reproduce, Remediate, Verify. The loop is closed because the last step feeds the first—post-merge verification updates the same context the next run reads.

The closed-loop reliability method

  ┌─────────────────────────────────────────────┐
  │                                               │
  ▼                                               │
Understand ──► Test ──► Reproduce ──► Remediate ──┤
(System Graph) (Fleets) (deterministic) (PR + approval)
  │              │          │            │        │
  └── evidence accumulates at every step ─┘        │
                                                   │
            Verify (post-merge re-validation) ─────┘

Understand -> Test -> Reproduce -> Remediate -> Verify, with evidence at each step and a human gate before merge.

The loop is not new to anyone who has run an incident retro. What is new is that a governed agent fleet executes it continuously, on every meaningful change, and records why each step happened. For the architecture behind the fleets, see how it works.

Step 1 — Understand: context before action

The run begins when a change lands: a developer modifies the checkout service to add a new promotional-discount branch in the payment authorization flow. Before any test executes, the fleet reads the System Graph—the living map of services, workflows, dependencies, tests, incidents, and environments.

Change-impact analysis runs first. The graph shows that the edited function is called by the cart-checkout workflow, sits upstream of the payment-gateway adapter, and shares a serialization helper with the refunds service. It also surfaces that two prior incidents touched this code path. The fleet now knows the blast radius is wider than the diff suggests.

This is the difference between running the regression suite and validating what this change can break. Understanding is not a warm-up. It is what makes the rest of the run proportional to risk instead of exhaustive and slow.

What the Understand step produced

A change-impact map: 1 edited service, 3 downstream consumers, 1 shared helper
A risk score weighted by payment-path involvement and incident history
A scoped validation plan—the workflows and surfaces that actually matter for this merge
A record of why each was selected, attached to the change

Step 2 — Test: fleets execute across surfaces

With a scoped plan, the Testing Fleets execute. These are governed agents that plan, execute, observe, and maintain validation, rather than a static suite someone wrote eighteen months ago. We make the case for that distinction in Testing Fleets, not test scripts.

For this change the fleet runs across four surfaces. UI: drive the checkout flow with and without the promo code. API: assert the payment-authorization contract holds. Integration: exercise the gateway adapter and the shared serialization helper against the refunds path. Security: check that the new discount branch does not expose an unauthenticated price-override.

The fleet does not just emit pass or fail. It captures evidence—request and response captures, screenshots, traces, and structured failure signatures—and ties each artifact to the change that triggered the run. One check fails: the integration test against the shared serialization helper returns a malformed amount under a specific currency.

What ran, and why

Surface	Check	Source of selection
UI	Checkout flow, promo and no-promo	Edited workflow in graph
Integration	Shared serialization helper vs refunds	Shared-dependency edge
Security	Price-override / auth on discount branch	Payment-path risk score

Step 3 — Reproduce: make the failure deterministic

A failing check is a signal, not a diagnosis. Many reliability programs stall here, because intermittent failures are hard to trust and harder to act on. The Reproduce step exists to convert the signal into something deterministic.

The fleet isolates the failing condition: the serialization helper drops a decimal when the discount produces a sub-unit amount in a zero-decimal currency. It replays the exact inputs, captures a full trace, and confirms the failure occurs every time under those inputs and never outside them. The reproduction is pinned to a seed and an environment, so a reviewer can run it once and see the same result.

Deterministic reproduction is the hinge of the whole loop. Without it, remediation is guesswork and verification proves nothing.

A failure you cannot reproduce on demand is a rumor. A failure you can is a work item.
— Zof reliability team

Step 4 — Remediate: propose, validate, request approval

Now the Remediation Fleets take the reproduced failure and propose a fix. The proposed diff corrects the rounding in the shared helper and adds a guard for zero-decimal currencies. This is a draft, not a deployment. The full governance model is covered in governed AI remediation.

Before anyone is asked to look, the fix is validated in staging against the reproduction case and the broader scoped plan. The previously failing check now passes; nothing else regresses. Only then does the fleet open a pull request—with the failing check, the deterministic reproduction, the trace, the staging results, and the diff all attached. The reviewer sees the same evidence the agent used.

The pull request waits. A human reviews it, and a human approves the merge. Governance policy decides who that person is for a payment-path change, and that decision is logged. Autonomy drafted the fix; accountability did not move.

Step 5 — Verify: close the loop after merge

Approval is not the end of the run. After the merge, the fleet re-validates—it re-runs the reproduction case and the scoped plan against the merged result to confirm the fix held and the change is clean in the integrated state.

Verification then writes back to the System Graph. The reproduction becomes a retained regression case, the incident history is updated, and the next run that touches this code path will read that context. The loop is closed because Verify feeds Understand.

If verification had failed, the run would not silently pass. It would surface the regression with evidence and re-enter the loop, because closing the loop means proving the fix, not assuming it.

What the human did, and what the agents did

Walk back through the run and the division of labor is unambiguous. The agents carried the operational load; the human made the decisions that carry accountability.

Division of labor across the run

Step	Agents did	Human did
Understand	Impact map, risk score, scoped plan	Set graph policy and risk weighting
Test	Executed surfaces, captured evidence	Defined release-ready criteria
Reproduce	Deterministic replay with trace	Nothing required
Remediate	Drafted fix, validated in staging, opened PR	Reviewed evidence, approved the merge
Verify	Re-validated post-merge, updated graph	Reviewed close-out (optional)

The human touched the run in exactly two places that matter: defining the boundaries up front, and authorizing the change at the end. Everything between was governed, evidence-producing work the agents performed so a person did not have to.

The evidence trail a run leaves behind

The deliverable of a Zof run is not a green checkmark. It is an evidence trail a skeptical reviewer—or a security and compliance reviewer six months later—can follow without taking anything on faith.

Artifacts retained for this one run

Change-impact map tying the diff to its real blast radius
Per-check rationale: why each surface and test ran
Captured artifacts: traces, screenshots, request/response captures
A deterministic reproduction pinned to inputs and environment
Staging validation results for the proposed fix
The pull request, the approver, and the timestamped authorization
Post-merge verification confirming the fix held

Every action—read, execute, propose, approve—is an auditable event under the governance layer. Security teams can answer the only questions that matter: who authorized this, what did the agent see, what did it change, and what validated the fix.

Why one walked-through run generalizes

A single change is easy to reason about by hand. The reason the loop is operated by a fleet is that organizations do not ship one change. They ship hundreds a week, across services no one person fully holds in their head, and a growing share of that code is machine-generated—Zof's research puts AI-generated code at roughly 41% of codebases.

Run the same five steps on every meaningful change, continuously, and reliability stops being a heroic post-incident scramble and becomes a property of the pipeline. One Series C fintech VP of Engineering reported 94% fewer production incidents within 90 days; we report it as one organization's result, not a guarantee.

The shape of the run never changes—Understand, Test, Reproduce, Remediate, Verify, with a human at the boundary and the gate. What changes is that it now happens at the rate your team actually ships.

Final takeaway

A Zof run is the opposite of a black box. It is a closed loop that produces evidence at every step and stops at a human before anything ships. Autonomous, here, means governed, evidence-producing, and human-authorized—not unattended.

If you want to see the loop against your own stack, the most useful starting point is a real change on a real service. Bring the workflow you are most afraid to break, and we will walk the five steps with you.

Frequently asked questions

: No. The fleet drafts a fix, validates it in staging, and opens a pull request with full evidence. A human reviews that evidence and authorizes the merge. Governance policy decides who that approver is for a given service, and the approval is logged. Agents propose; humans authorize.

Testing Fleets Remediation Fleets System Graph Release Readiness

Related guides

AI testing agents

Continue Reading

Engineering

Testing Fleets, Not Test Scripts

Static scripts cannot keep up with continuous change. Testing fleets bring operational discipline to enterprise validation.

Zof Reliability TeamMay 3, 202612 min read

Security & Governance

Governed AI Remediation: Fixing Software Without Losing Control

Why remediation is the hardest part of autonomous reliability, and how enterprises can adopt AI fixes safely.

Zof Reliability TeamMay 5, 202611 min read