Skip to content
Reliability Operations

Inside a Zof Run: The Five-Step Reliability Loop

Understand, Test, Reproduce, Remediate, Verify—walked through one real checkout release.

Zof Reliability Team · Engineering & product

June 16, 2026 · 14 min read · Updated June 16, 2026

Share
01

What "autonomous" actually means in a Zof run

The word autonomous does a lot of quiet damage in this category. To some it means software that fixes itself in production while everyone sleeps. That is not what happens in a Zof run, and we would not sell it if it were.

In a run, autonomy means three concrete things. The agents are governed: they act only inside boundaries your team defined. They are evidence-producing: every action leaves an artifact you can inspect. And they are human-authorized: nothing reaches a production-bound branch without a person approving it. Agents propose, humans authorize.

The rest of this piece walks one run end to end. We use a single, ordinary change—a modification to checkout that touches the payments path—because the value of the loop is easiest to see when the stakes are normal, not exotic.

02

The five-step loop at a glance

Every Zof run follows the same closed loop: Understand, Test, Reproduce, Remediate, Verify. The loop is closed because the last step feeds the first—post-merge verification updates the same context the next run reads.

The closed-loop reliability method

  ┌─────────────────────────────────────────────┐
  │                                               │
  ▼                                               │
Understand ──► Test ──► Reproduce ──► Remediate ──┤
(System Graph) (Fleets) (deterministic) (PR + approval)
  │              │          │            │        │
  └── evidence accumulates at every step ─┘        │
                                                   │
            Verify (post-merge re-validation) ─────┘
Understand -> Test -> Reproduce -> Remediate -> Verify, with evidence at each step and a human gate before merge.

The loop is not new to anyone who has run an incident retro. What is new is that a governed agent fleet executes it continuously, on every meaningful change, and records why each step happened. For the architecture behind the fleets, see how it works.

03

Step 1 — Understand: context before action

The run begins when a change lands: a developer modifies the checkout service to add a new promotional-discount branch in the payment authorization flow. Before any test executes, the fleet reads the System Graph—the living map of services, workflows, dependencies, tests, incidents, and environments.

Change-impact analysis runs first. The graph shows that the edited function is called by the cart-checkout workflow, sits upstream of the payment-gateway adapter, and shares a serialization helper with the refunds service. It also surfaces that two prior incidents touched this code path. The fleet now knows the blast radius is wider than the diff suggests.

This is the difference between running the regression suite and validating what this change can break. Understanding is not a warm-up. It is what makes the rest of the run proportional to risk instead of exhaustive and slow.

What the Understand step produced

  • A change-impact map: 1 edited service, 3 downstream consumers, 1 shared helper
  • A risk score weighted by payment-path involvement and incident history
  • A scoped validation plan—the workflows and surfaces that actually matter for this merge
  • A record of why each was selected, attached to the change
04

Step 2 — Test: fleets execute across surfaces

With a scoped plan, the Testing Fleets execute. These are governed agents that plan, execute, observe, and maintain validation, rather than a static suite someone wrote eighteen months ago. We make the case for that distinction in Testing Fleets, not test scripts.

For this change the fleet runs across four surfaces. UI: drive the checkout flow with and without the promo code. API: assert the payment-authorization contract holds. Integration: exercise the gateway adapter and the shared serialization helper against the refunds path. Security: check that the new discount branch does not expose an unauthenticated price-override.

The fleet does not just emit pass or fail. It captures evidence—request and response captures, screenshots, traces, and structured failure signatures—and ties each artifact to the change that triggered the run. One check fails: the integration test against the shared serialization helper returns a malformed amount under a specific currency.

What ran, and why
SurfaceCheckSource of selection
UICheckout flow, promo and no-promoEdited workflow in graph
IntegrationShared serialization helper vs refundsShared-dependency edge
SecurityPrice-override / auth on discount branchPayment-path risk score
05

Step 3 — Reproduce: make the failure deterministic

A failing check is a signal, not a diagnosis. Many reliability programs stall here, because intermittent failures are hard to trust and harder to act on. The Reproduce step exists to convert the signal into something deterministic.

The fleet isolates the failing condition: the serialization helper drops a decimal when the discount produces a sub-unit amount in a zero-decimal currency. It replays the exact inputs, captures a full trace, and confirms the failure occurs every time under those inputs and never outside them. The reproduction is pinned to a seed and an environment, so a reviewer can run it once and see the same result.

Deterministic reproduction is the hinge of the whole loop. Without it, remediation is guesswork and verification proves nothing.

A failure you cannot reproduce on demand is a rumor. A failure you can is a work item.

Zof reliability team
06

Step 4 — Remediate: propose, validate, request approval

Now the Remediation Fleets take the reproduced failure and propose a fix. The proposed diff corrects the rounding in the shared helper and adds a guard for zero-decimal currencies. This is a draft, not a deployment. The full governance model is covered in governed AI remediation.

Before anyone is asked to look, the fix is validated in staging against the reproduction case and the broader scoped plan. The previously failing check now passes; nothing else regresses. Only then does the fleet open a pull request—with the failing check, the deterministic reproduction, the trace, the staging results, and the diff all attached. The reviewer sees the same evidence the agent used.

The pull request waits. A human reviews it, and a human approves the merge. Governance policy decides who that person is for a payment-path change, and that decision is logged. Autonomy drafted the fix; accountability did not move.

07

Step 5 — Verify: close the loop after merge

Approval is not the end of the run. After the merge, the fleet re-validates—it re-runs the reproduction case and the scoped plan against the merged result to confirm the fix held and the change is clean in the integrated state.

Verification then writes back to the System Graph. The reproduction becomes a retained regression case, the incident history is updated, and the next run that touches this code path will read that context. The loop is closed because Verify feeds Understand.

If verification had failed, the run would not silently pass. It would surface the regression with evidence and re-enter the loop, because closing the loop means proving the fix, not assuming it.

08

What the human did, and what the agents did

Walk back through the run and the division of labor is unambiguous. The agents carried the operational load; the human made the decisions that carry accountability.

Division of labor across the run
StepAgents didHuman did
UnderstandImpact map, risk score, scoped planSet graph policy and risk weighting
TestExecuted surfaces, captured evidenceDefined release-ready criteria
ReproduceDeterministic replay with traceNothing required
RemediateDrafted fix, validated in staging, opened PRReviewed evidence, approved the merge
VerifyRe-validated post-merge, updated graphReviewed close-out (optional)

The human touched the run in exactly two places that matter: defining the boundaries up front, and authorizing the change at the end. Everything between was governed, evidence-producing work the agents performed so a person did not have to.

09

The evidence trail a run leaves behind

The deliverable of a Zof run is not a green checkmark. It is an evidence trail a skeptical reviewer—or a security and compliance reviewer six months later—can follow without taking anything on faith.

Artifacts retained for this one run

  • Change-impact map tying the diff to its real blast radius
  • Per-check rationale: why each surface and test ran
  • Captured artifacts: traces, screenshots, request/response captures
  • A deterministic reproduction pinned to inputs and environment
  • Staging validation results for the proposed fix
  • The pull request, the approver, and the timestamped authorization
  • Post-merge verification confirming the fix held

Every action—read, execute, propose, approve—is an auditable event under the governance layer. Security teams can answer the only questions that matter: who authorized this, what did the agent see, what did it change, and what validated the fix.

10

Why one walked-through run generalizes

A single change is easy to reason about by hand. The reason the loop is operated by a fleet is that organizations do not ship one change. They ship hundreds a week, across services no one person fully holds in their head, and a growing share of that code is machine-generated—Zof's research puts AI-generated code at roughly 41% of codebases.

Run the same five steps on every meaningful change, continuously, and reliability stops being a heroic post-incident scramble and becomes a property of the pipeline. One Series C fintech VP of Engineering reported 94% fewer production incidents within 90 days; we report it as one organization's result, not a guarantee.

The shape of the run never changes—Understand, Test, Reproduce, Remediate, Verify, with a human at the boundary and the gate. What changes is that it now happens at the rate your team actually ships.

11

Final takeaway

A Zof run is the opposite of a black box. It is a closed loop that produces evidence at every step and stops at a human before anything ships. Autonomous, here, means governed, evidence-producing, and human-authorized—not unattended.

If you want to see the loop against your own stack, the most useful starting point is a real change on a real service. Bring the workflow you are most afraid to break, and we will walk the five steps with you.

Frequently asked questions

No. The fleet drafts a fix, validates it in staging, and opens a pull request with full evidence. A human reviews that evidence and authorizes the merge. Governance policy decides who that approver is for a given service, and the approval is logged. Agents propose; humans authorize.

Related guides

Continue Reading

01Zof Console

One surface for posture, operations, and what needs attention next.

The authenticated home that engineering, QA, and SRE teams open every day: quality posture, in-flight runs, coverage by module, and what needs attention next.

OPERATIONAL KPIs

  • Runs
  • Coverage
  • Risk

Live across every environment you ship to.

WORK SPINE

  • Specs
  • Tests
  • Schedules

From specification to scheduled regression.

GUARDRAILS

  • RBAC
  • SSO
  • audit

Every action attributable to a named human.

LIVE/console
Zof AI home command center showing 12 runs at 94% pass, 3 open critical issues, 84% coverage, four module traceability bars, the specification pipeline, upcoming schedules, and recommended next actions with an active-runs sidebar.
Console home · Checkout Service · Staging · captured live from the product.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

Inside a Zof Run: The Five-Step Reliability Loop | Zof AI Blog