Company

From Microsoft Scale to a New Category: How TAS23 Became Zof

The founder arc behind Zof: running engineering at Microsoft scale, a 2023 conference talk, and the reframe from QA tooling to governed reliability infrastructure.

Book a demo

Zof Reliability Team · Engineering & product

June 24, 2026 · 7 min read · Updated June 24, 2026

Summary

Most categories don't get invented in a strategy deck. They get named the moment a hard-won pattern finally has language. For me that moment arrived on a conference stage in Atlanta in 2023, at the Test Automation Summit, where I tried to explain something I'd watched fail at scale for years. The talk wasn't about a product. It was about a category error: we keep treating reliability as a testing problem, when it's actually a control problem. This is the arc from there to here, and why it matters if you're building a company on top of AI-generated software.

Running engineering systems at Microsoft scale changes your intuitions permanently.
Years later, a CTO put it to me more bluntly than any architecture diagram ever could.
The Test Automation Summit talk was where I committed to the reframe out loud.

What scale teaches you that a tutorial never will

Running engineering systems at Microsoft scale changes your intuitions permanently. You stop thinking about software as code and start thinking about it as a living system: thousands of services, dependencies you didn't write, deploy pipelines no single person fully holds in their head, and change arriving continuously from every direction. At that altitude, the failures that hurt are almost never the bug a developer obviously should have caught. They're the second-order failures. A change that was correct in isolation interacts with a dependency three hops away and takes down something nobody connected to it.

The lesson that stuck with me is uncomfortable for anyone who sells testing tools: more tests do not equal more reliability. I watched teams with enormous suites still get surprised, because the suite tested what someone thought to test at authoring time, not what the system actually did once it moved. Coverage on paper kept climbing. Confidence at release time did not. The thing that actually correlated with reliability wasn't test count. It was whether anyone could answer, with evidence, a single question: is this specific change safe to ship into this specific system right now?

Almost nobody could answer that. Not because they lacked tools. They had too many. They lacked a place where the answer lived.

The CTO complaint that named the problem

Years later, a CTO put it to me more bluntly than any architecture diagram ever could. He said his engineers didn't hate building software. They hated testing it. They experienced quality work as a tax: brittle, after-the-fact, disconnected from the thing they actually cared about, which was shipping working systems.

That framing is the whole problem in one sentence. When reliability is something you do *to* engineers rather than something the system does *for* them, you get exactly the behavior you'd predict. People route around it. The industry data has since caught up to that intuition: roughly 80% of developers bypass policy or guardrails. Read that not as a discipline problem but as a verdict. A guardrail that gets skipped four times out of five isn't a guardrail. It's friction with a compliance label on it.

So the question I kept circling wasn't "how do we make better tests?" It was "why is the thing meant to protect the system the first thing everyone abandons under pressure?" The honest answer: because we built it as a checkpoint instead of as infrastructure. Checkpoints get bypassed. Infrastructure gets relied on.

TAS23: reframing QA as reliability infrastructure

The Test Automation Summit talk was where I committed to the reframe out loud. The argument was simple and, at the time, slightly heretical for a room full of test-automation practitioners: the goal is not to automate QA. The goal is to make reliability a property of the system rather than an event in the calendar.

That distinction sounds academic until you trace its consequences. If reliability is an event, you schedule it, staff it, and inevitably compress it when the release date slips. If reliability is infrastructure, it runs continuously, understands the system it's protecting, and produces a defensible answer every time something changes. One is a phase. The other is a control layer.

A few principles fell out of that talk and never left:

Reliability is a system property, not a file property. You can't validate a change without understanding what it touches. That requires a live map of services, dependencies, and pipelines, not a folder of scripts.
Static checks rot the moment the system moves. Anything authored once and frozen is testing yesterday's architecture against today's traffic.
Speed without clarity is just motion. Velocity doesn't kill quality. The absence of visibility into what a change actually reaches does.

At the time I didn't have a name for the thing those principles implied. I had a critique of the thing that existed.

From a critique to a category

The market then did me a favor by getting much worse. AI turned code generation into a faucet. Today roughly 41% of codebases are AI-generated, and industry research puts the share of AI coding tasks that introduce critical flaws or security issues near 45%. Generation got radically cheaper. Validation did not. The cost of poor software quality is estimated at around $2.41 trillion, and that number is just the aggregate of incidents, breaches, and rework flowing from changes nobody could fully vouch for.

That shift is what turned a critique into a category. When humans wrote most of the code, you could almost pretend that more reviewers and more tests would scale with the risk. They can't anymore. You cannot human-review your way out of machine-speed production. But the answer is not the 2025 fantasy of removing humans entirely and letting agents rewrite production unsupervised. I've operated systems at scale; unsupervised autonomous fixing is how you turn one incident into ten. The answer is governed autonomy: agents propose, humans authorize.

That principle is the spine of how Zof works. The closed loop is the operational form of the TAS23 thesis: understand the system, test against it, reproduce what fails, remediate under governance, verify the fix held.

Understand. A System Graph maps services, dependencies, and CI/CD so validation is change-aware. Reachability-based prioritization built on that map can mean 70 to 90% less exploitable exposure to chase, because you stop triaging findings that can't actually be hit.
Validate continuously. Testing Fleets plan, execute, observe, and maintain validation as the system evolves, instead of decaying between releases like static scripts.
Remediate under control. Remediation Fleets operate inside policy, approval, and audit. The agent proposes the fix. A human owns the decision.
Prove it. The output isn't a green check. It's an audit-ready record of what was tested, what was found, what was fixed, and who signed off.

That's the difference between a QA tool and reliability infrastructure. A tool gives you a result. Infrastructure owns the release decision and can prove the call it made. We named that the control layer because that's what it is: the plane above your existing stack that decides, with evidence, what is allowed to ship.

What this means if you're a founder

I'm not writing this to sell you on a tool category. I'm writing it because the reframe is useful even before you buy anything. If you're building a company on AI-generated software, three things follow directly from the arc above.

First, stop measuring reliability by activity. Test count, scan count, and tool count are inputs, not outcomes. The outcome is whether you can answer the release question with evidence.

Second, find out who actually owns that answer today. If it's one engineer reading five dashboards at deploy time, you don't have a tooling gap. You have a control gap, and adding a sixth tool deepens it.

Third, decide what role you want autonomy to play before the pressure hits, not during an incident. The defensible posture for a serious enterprise isn't more AI. It's more control: autonomy that proposes and humans who authorize, with an audit trail either way. If that resonates, the build-vs-buy comparison and our about page lay out the thesis in more depth.

The bottom line

Enterprise AI System Graph Testing Fleets Remediation Fleets QA

Related guides

Autonomous reliability infrastructure

Continue Reading

Company

The Closed Loop: Why Reliability Is Five Steps, Not One Tool

A founder's case for why reliability is an operating loop, not a tool: Understand, Test, Reproduce, Remediate, Verify, built for SREs drowning in AI-speed change.

Zof Reliability TeamMay 20, 20268 min read

Company

Agents Propose, Humans Authorize: The Principle Behind Governed Autonomy

Why \"agents propose, humans authorize\" is the founding design rule that separates a credible reliability control layer from reckless autonomous fixing.

Zof Reliability TeamApr 22, 20267 min read

Company

The Silent Enemy: A First-Principles Look at the Cost of Rework

Rework, not slow developers, is what kills engineering momentum. A first-principles look at why it scales with AI-generated code and how to attack it at the source.

Zof Reliability TeamMar 18, 20267 min read

What scale teaches you that a tutorial never will

The CTO complaint that named the problem

TAS23: reframing QA as reliability infrastructure

From a critique to a category

What this means if you're a founder

The bottom line

Continue Reading

The Closed Loop: Why Reliability Is Five Steps, Not One Tool

Agents Propose, Humans Authorize: The Principle Behind Governed Autonomy

The Silent Enemy: A First-Principles Look at the Cost of Rework

One surface for posture, operations, and what needs attention next.