Compañía

The Closed Loop: Why Reliability Is Five Steps, Not One Tool

A founder's case for why reliability is an operating loop, not a tool: Understand, Test, Reproduce, Remediate, Verify, built for SREs drowning in AI-speed change.

Book a demo

Equipo de Fiabilidad de Zof · Ingeniería y producto

20 de mayo de 2026 · 8 min de lectura · Actualizado 20 de mayo de 2026

Resumen

When I talk to SREs, the complaint is rarely "we don't have enough tools." It's the opposite. They have a scanner, a test suite, an observability stack, a chatops bot, and three dashboards, and none of them can answer the only question that matters on call: is the system safe right now, and can I prove it? That gap is why I built Zof around a loop instead of a product. Reliability is not a feature you bolt on. It is five steps you operate continuously, Understand, Test, Reproduce, Remediate, Verify, and skipping any one of them is where reliability quietly leaks out. This is a founder walk-through of that thesis. Not a feature tour. The argument is that the loop, not any single capability inside it, is the unit of reliability. Once you see it that way, most of the tooling decisions an SRE org agonizes over get simpler.

It scans, it tests, it alerts, and then it hands you the output.
The thesis is most visible in its gaps, and as an SRE you've felt every one of these.
It's a loop you operate, Understand, Test, Reproduce, Remediate, Verify, run continuously and owned by a governed control layer.

Why a loop and not a tool

A tool runs once and stops. It scans, it tests, it alerts, and then it hands you the output. The system, meanwhile, never stops. A dependency bumps a minor version. A service quietly reshapes a contract. An AI agent rewrites a module overnight. By the time you read this morning's clean scan, it's describing a system that no longer exists.

That is the structural reason a single tool can't deliver reliability: reliability is a property of the whole system over time, and tools model a slice at a moment. The math has gotten worse fast. Roughly 41% of codebases are now AI-generated, and industry research puts the rate at which AI coding tasks introduce critical flaws or security issues near 45%. You are no longer reviewing change at human speed. You are absorbing it at machine speed. A point tool in that environment is a flashlight in a flood.

A loop is the answer because a loop closes. The output of the last step feeds the first. Understand what changed, test it in context, reproduce what fails, remediate under control, verify the fix held, then update your understanding and go again. The point of naming the five steps is operational, not poetic. When you map your own stack against them, you can see exactly where you have nothing.

The five steps, and what each one earns

Most SRE orgs are lopsided across these. Strong on Test, decent on alerting, almost nothing on governed Remediate, and a Verify step that amounts to "the deploy didn't page anyone in the first hour." Here's what each step is actually for.

Understand is context. You cannot validate a change you can't locate, and you cannot scope blast radius you can't see. This is the job of the System Graph: a live map of services, dependencies, and CI/CD that makes validation change-aware. The payoff for an SRE is targeting. Without a system map, validation degrades into brute force, re-run everything, every time, and hope. With a graph, you ask the sharper question: given this diff, what's in the blast radius, which contracts are at risk, and which paths are actually reachable? Reachability is where this gets concrete. Reachability-based prioritization can mean 70 to 90% less exploitable exposure to triage, because you stop treating every theoretical finding as equal. That is the difference between a queue of 800 alerts and the 40 that can actually hurt you this week.

Test has to keep pace with the system, which means fleets, not scripts. A test written against last quarter's API is a liability the moment the contract moves, it fails for the wrong reason or, worse, passes while validating nothing. Testing Fleets are coordinated agents that plan, execute, observe, and maintain validation as the system evolves. The distinction from "AI test generation" is the entire argument. Generating a test is a one-time act that leaves you a bigger pile of scripts to maintain. Operating validation is continuous: when the graph reports a contract change, the fleet adapts coverage and retires checks that no longer map to real behavior. The failure mode to watch is coverage theater, a green dashboard that measures lines executed, not risk retired.

Reproduce is the step everyone skips, and it's the quiet discriminator between tools that find problems and systems that fix them. An alert without a reproducible case is a ticket that ages in a backlog. A deterministic reproduction is the seed of a fix and the only fair basis for verifying one. For regulated or security-sensitive teams, reproduction has to happen inside the customer boundary, against realistic state, without code or sensitive data leaving the perimeter. That is what Edge Runners are for, signed capsules that execute inside secure enclaves and produce audit-ready evidence. Reproduction you can't trust as evidence isn't reproduction. It's an anecdote in a screenshot.

Remediate is the hardest and most critical step, and it's where most "autonomous" pitches quietly overreach. I'll be blunt about my own conviction here, because it's the one I get the most pushback on.

### Remediation is governed, full stop

Letting agents rewrite production code without oversight is not ambition. It is recklessness wearing the costume of progress. The operating principle is simple and load-bearing: agents propose, humans authorize. Remediation Fleets generate candidate fixes grounded in the reproduced failure and the graph's blast-radius analysis. They do not merge on their own authority. Every proposed change flows through Governance, policy that defines what an agent may touch, approval that puts a named human on the decision, and an audit trail that records who authorized what, against which evidence.

This is not bureaucracy bolted onto automation. It is the engineering. Here's the number that proves it: industry research finds that roughly 80% of developers already bypass policy and guardrails when those controls slow them down. A governance layer that lives outside the loop gets routed around. A governance layer that is the remediation path, where the only way to ship the fix is through the approval, is the one that actually holds. Governance is what makes autonomy safe enough to use at all. A serious enterprise doesn't want more autonomy for its own sake. It wants control.

Verify closes the loop, and "the build is green" is not verification. Verify re-runs the reproduced failure against the remediated system, confirms the regression is gone, and confirms nothing in the blast radius broke in the process. Then it feeds the result back to Understand so the graph updates and the system's known-good state advances. Verification produces evidence tied to a specific change, a specific reproduction, and a specific authorized fix, the kind of record you can hand to an auditor or a skeptical release manager. Reliability Analytics turns that evidence stream into a defensible read on release readiness instead of a feeling.

The loop, mapped

Understand → System Graph: change-aware context and blast radius
Test → Testing Fleets: continuous, adaptive validation
Reproduce → Edge Runners in the secure enclave: provable failures
Remediate → Remediation Fleets + Governance: agents propose, humans authorize
Verify → Reliability Analytics: evidence-backed release readiness

What breaks when you skip a step

The thesis is most visible in its gaps, and as an SRE you've felt every one of these.

Skip Understand and your tests run blind, burning compute on irrelevance while the real blast radius goes unchecked. Skip Reproduce and remediation is guesswork, you're fixing a symptom you can't trigger on demand. Skip Governance and you get one of two failure modes: you freeze, because nobody trusts the agents enough to let them act, or you get bypassed, because the controls are friction without leverage and engineers route around them. Skip Verify and you ship hope, then find out at 3 a.m.

Consider a hypothetical fintech platform team merging dozens of AI-assisted PRs a day. Without the loop, they're stuck choosing between a review queue that can't keep up and a velocity that quietly raises their exposure. The aggregate cost of getting that wrong, across the industry, is estimated at around $2.41 trillion in poor software quality. With the loop, the graph scopes each change, fleets validate what matters, failures reproduce as evidence, fixes route through a named approval, and verification proves the result. The speed-versus-safety tradeoff doesn't get balanced. It dissolves, not because a human left the loop, but because the loop made the human's authorization fast, contextual, and auditable.

That's the founder thesis in one line: autonomy carries the volume, governance keeps it accountable, and the loop is the structure that lets both be true at once.

The bottom line

IA empresarial System Graph Flotas de pruebas Flotas de remediación Edge Runners

Guías relacionadas

Autonomous reliability infrastructure

Producto relacionado

Continuar leyendo

Compañía

From Microsoft Scale to a New Category: How TAS23 Became Zof

The founder arc behind Zof: running engineering at Microsoft scale, a 2023 conference talk, and the reframe from QA tooling to governed reliability infrastructure.

Equipo de Fiabilidad de Zof24 jun 20267 min de lectura

Compañía

Agents Propose, Humans Authorize: The Principle Behind Governed Autonomy

Why \"agents propose, humans authorize\" is the founding design rule that separates a credible reliability control layer from reckless autonomous fixing.

Equipo de Fiabilidad de Zof22 abr 20267 min de lectura

Compañía

The Silent Enemy: A First-Principles Look at the Cost of Rework

Rework, not slow developers, is what kills engineering momentum. A first-principles look at why it scales with AI-generated code and how to attack it at the source.

Equipo de Fiabilidad de Zof18 mar 20267 min de lectura

Why a loop and not a tool

The five steps, and what each one earns

The loop, mapped

What breaks when you skip a step

The bottom line

Continuar leyendo

From Microsoft Scale to a New Category: How TAS23 Became Zof

Agents Propose, Humans Authorize: The Principle Behind Governed Autonomy

The Silent Enemy: A First-Principles Look at the Cost of Rework

Una superficie para la postura, las operaciones y lo que necesita atención a continuación.