Skip to content
Company

From Microsoft Scale to a New Category: How TAS23 Became Zof

The founder arc behind Zof: running engineering at Microsoft scale, a 2023 conference talk, and the reframe from QA tooling to governed reliability infrastructure.

Zof Reliability Team · Engineering & product

June 24, 2026 · 7 min read · Updated June 24, 2026

Share
01

What scale teaches you that a tutorial never will

Running engineering systems at Microsoft scale changes your intuitions permanently. You stop thinking about software as code and start thinking about it as a living system: thousands of services, dependencies you didn't write, deploy pipelines no single person fully holds in their head, and change arriving continuously from every direction. At that altitude, the failures that hurt are almost never the bug a developer obviously should have caught. They're the second-order failures. A change that was correct in isolation interacts with a dependency three hops away and takes down something nobody connected to it.

The lesson that stuck with me is uncomfortable for anyone who sells testing tools: more tests do not equal more reliability. I watched teams with enormous suites still get surprised, because the suite tested what someone thought to test at authoring time, not what the system actually did once it moved. Coverage on paper kept climbing. Confidence at release time did not. The thing that actually correlated with reliability wasn't test count. It was whether anyone could answer, with evidence, a single question: is this specific change safe to ship into this specific system right now?

Almost nobody could answer that. Not because they lacked tools. They had too many. They lacked a place where the answer lived.

02

The CTO complaint that named the problem

Years later, a CTO put it to me more bluntly than any architecture diagram ever could. He said his engineers didn't hate building software. They hated testing it. They experienced quality work as a tax: brittle, after-the-fact, disconnected from the thing they actually cared about, which was shipping working systems.

That framing is the whole problem in one sentence. When reliability is something you do *to* engineers rather than something the system does *for* them, you get exactly the behavior you'd predict. People route around it. The industry data has since caught up to that intuition: roughly 80% of developers bypass policy or guardrails. Read that not as a discipline problem but as a verdict. A guardrail that gets skipped four times out of five isn't a guardrail. It's friction with a compliance label on it.

So the question I kept circling wasn't "how do we make better tests?" It was "why is the thing meant to protect the system the first thing everyone abandons under pressure?" The honest answer: because we built it as a checkpoint instead of as infrastructure. Checkpoints get bypassed. Infrastructure gets relied on.

03

TAS23: reframing QA as reliability infrastructure

The Test Automation Summit talk was where I committed to the reframe out loud. The argument was simple and, at the time, slightly heretical for a room full of test-automation practitioners: the goal is not to automate QA. The goal is to make reliability a property of the system rather than an event in the calendar.

That distinction sounds academic until you trace its consequences. If reliability is an event, you schedule it, staff it, and inevitably compress it when the release date slips. If reliability is infrastructure, it runs continuously, understands the system it's protecting, and produces a defensible answer every time something changes. One is a phase. The other is a control layer.

A few principles fell out of that talk and never left:

  • Reliability is a system property, not a file property. You can't validate a change without understanding what it touches. That requires a live map of services, dependencies, and pipelines, not a folder of scripts.
  • Static checks rot the moment the system moves. Anything authored once and frozen is testing yesterday's architecture against today's traffic.
  • Speed without clarity is just motion. Velocity doesn't kill quality. The absence of visibility into what a change actually reaches does.

At the time I didn't have a name for the thing those principles implied. I had a critique of the thing that existed.

04

From a critique to a category

The market then did me a favor by getting much worse. AI turned code generation into a faucet. Today roughly 41% of codebases are AI-generated, and industry research puts the share of AI coding tasks that introduce critical flaws or security issues near 45%. Generation got radically cheaper. Validation did not. The cost of poor software quality is estimated at around $2.41 trillion, and that number is just the aggregate of incidents, breaches, and rework flowing from changes nobody could fully vouch for.

That shift is what turned a critique into a category. When humans wrote most of the code, you could almost pretend that more reviewers and more tests would scale with the risk. They can't anymore. You cannot human-review your way out of machine-speed production. But the answer is not the 2025 fantasy of removing humans entirely and letting agents rewrite production unsupervised. I've operated systems at scale; unsupervised autonomous fixing is how you turn one incident into ten. The answer is governed autonomy: agents propose, humans authorize.

That principle is the spine of how Zof works. The closed loop is the operational form of the TAS23 thesis: understand the system, test against it, reproduce what fails, remediate under governance, verify the fix held.

  • Understand. A System Graph maps services, dependencies, and CI/CD so validation is change-aware. Reachability-based prioritization built on that map can mean 70 to 90% less exploitable exposure to chase, because you stop triaging findings that can't actually be hit.
  • Validate continuously. Testing Fleets plan, execute, observe, and maintain validation as the system evolves, instead of decaying between releases like static scripts.
  • Remediate under control. Remediation Fleets operate inside policy, approval, and audit. The agent proposes the fix. A human owns the decision.
  • Prove it. The output isn't a green check. It's an audit-ready record of what was tested, what was found, what was fixed, and who signed off.

That's the difference between a QA tool and reliability infrastructure. A tool gives you a result. Infrastructure owns the release decision and can prove the call it made. We named that the control layer because that's what it is: the plane above your existing stack that decides, with evidence, what is allowed to ship.

05

What this means if you're a founder

I'm not writing this to sell you on a tool category. I'm writing it because the reframe is useful even before you buy anything. If you're building a company on AI-generated software, three things follow directly from the arc above.

First, stop measuring reliability by activity. Test count, scan count, and tool count are inputs, not outcomes. The outcome is whether you can answer the release question with evidence.

Second, find out who actually owns that answer today. If it's one engineer reading five dashboards at deploy time, you don't have a tooling gap. You have a control gap, and adding a sixth tool deepens it.

Third, decide what role you want autonomy to play before the pressure hits, not during an incident. The defensible posture for a serious enterprise isn't more AI. It's more control: autonomy that proposes and humans who authorize, with an audit trail either way. If that resonates, the build-vs-buy comparison and our about page lay out the thesis in more depth.

06

The bottom line

Continue Reading

01Zof Console

One surface for posture, operations, and what needs attention next.

The authenticated home that engineering, QA, and SRE teams open every day: quality posture, in-flight runs, coverage by module, and what needs attention next.

OPERATIONAL KPIs

  • Runs
  • Coverage
  • Risk

Live across every environment you ship to.

WORK SPINE

  • Specs
  • Tests
  • Schedules

From specification to scheduled regression.

GUARDRAILS

  • RBAC
  • SSO
  • audit

Every action attributable to a named human.

LIVE/console
Zof AI home command center showing 12 runs at 94% pass, 3 open critical issues, 84% coverage, four module traceability bars, the specification pipeline, upcoming schedules, and recommended next actions with an active-runs sidebar.
Console home · Checkout Service · Staging · captured live from the product.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

From Microsoft Scale to a New Category: How TAS23 Became Zof