AI Is Missing a Control Layer, Not More Models
More capable models won't make software reliable. A first-principles teardown of why reliability is a system property and the missing piece is a governed control layer.
Capability and correctness are orthogonal
Start from first principles. A model's job is to produce a plausible output for a given input. A reliability system's job is to guarantee that what ships meets a defined bar, every time, with evidence. These are not the same skill made better by scale. They are different functions with different success criteria.
A more capable model produces better first drafts. It does not produce a verdict on whether the change is safe to deploy into your specific system, against your specific dependencies, under your specific policies. No amount of parameter count answers "will this break the payments path under load?" because that answer lives in your runtime, your data, and your topology, not in the model's weights. Correctness is a property of the system the code runs in. Capability is a property of the thing that wrote the code. You cannot scale your way from one to the other.
Here is the uncomfortable corollary: a more capable model can make the reliability problem worse. It writes more code, faster, that looks more correct on inspection, which means more changes clear human review on plausibility rather than proof. The better the model, the more confidently wrong code slips through, because the surface signals that humans use to catch sloppy work, awkward structure, obvious gaps, get smoother as capability rises. Fluency is not safety. Often it's camouflage.
The evidence says the gap is governance, not generation
The numbers describe an industry that solved the wrong half of the problem. Roughly 41% of codebases are now AI-generated. Generation is effectively solved and getting cheaper by the month. But industry research puts the share of AI coding tasks that introduce critical flaws or security issues near 45%. We got radically better at producing change and barely better at validating it.
The cost of that asymmetry is not theoretical. The cost of poor software quality is estimated at around $2.41 trillion, the aggregate bill for incidents, breaches, and rework that flow from shipping changes nobody could fully vouch for. More capable models pour more code into the top of that funnel. They do nothing about the leak at the bottom.
The most damning figure is behavioral: roughly 80% of developers bypass policy or guardrails. Read that as the verdict on controls that are advisory, slow, or disconnected from the work. A better model does not change this. If anything it accelerates the bypass, because it produces more changes faster, and the only controls in the way are the same friction engineers were already routing around. The bottleneck was never intelligence. It was authority and evidence.
Reliability is a system property, not a model property
The deepest reason more models can't fix reliability is architectural. Reliability emerges from the interaction of components, not from the quality of any single one. A change that four tools each bless in isolation can still be the change that takes you down, because none of them models how it propagates through your actual dependency graph.
This is why the model-scaling thesis is a category error. You're optimizing a component when the failure lives in the seams. A perfect model writing a perfect function still has no idea that the function it just wrote is called by a service whose retry logic, under a specific failure mode, amplifies its one edge case into an outage. That knowledge is system-level. It requires a live map of services, dependencies, and CI/CD that turns blind, brute-force checking into targeted, change-aware validation. In our architecture that map is the System Graph, and it is the thing that lets you ask "what can this specific change actually reach?" instead of re-running everything and hoping.
Reachability is the clearest illustration of why context beats raw capability. Reachability-based prioritization can mean 70 to 90% less exploitable exposure, because you stop triaging vulnerabilities in code paths that can't actually be hit. But reachability is only as good as your model of the system. A scanner without that context guesses. Context, not capability, is the multiplier.
What the missing layer actually is
A control layer is not another model and not another scanner. It is the plane that sits above your tools and models and answers one question with evidence: does this change meet the bar to ship? Concretely, that requires four capabilities most stacks have never had in one place.
- A live model of the system. Validation has to be change-aware, which means it has to know what changed and what that change touches. Without the System Graph, every check is running blind.
- Validation that keeps pace. Static scripts rot the moment the system moves. Testing Fleets are coordinated agents that plan, execute, observe, and maintain validation as the system evolves, so coverage doesn't silently decay between releases.
- Governed action, not unsupervised action. When something breaks, the layer can propose a fix, but a human authorizes it. Remediation Fleets operate under policy, approval, and audit. Agents propose; humans authorize. Letting a model rewrite production code unsupervised isn't autonomy, it's an incident waiting for a postmortem. Remediation is the hardest part of the loop, which is exactly why governance, not raw model power, is the engineering.
- Evidence as the output. The deliverable isn't a green check. It's an audit-ready record of what was tested, what was found, what was fixed, and who approved it.
These tie together as a closed loop: understand the system, test against it, reproduce what fails, remediate under governance, verify the fix held. The loop produces a defensible answer. A model produces a guess. Those are different artifacts for different jobs.
What to do Monday morning
You don't need to swap models or rip out your stack. You need to change what owns the release decision and stop treating capability as a proxy for safety.
- Separate the two budgets. Track what you spend on generation capability versus what you spend on validation, governance, and evidence. If the ratio is lopsided toward writing code faster, you're funding the leak, not fixing it.
- Find the decision-maker. Ask who, or what, actually decides a release is safe. If it's the engineer whose name is on the deploy reading five dashboards, that's a control gap no model upgrade will close.
- Make change-awareness the requirement. Any validation that can't tell you what a specific change reaches is testing in the dark. Prioritize context over raw test volume and over model size.
- Demand evidence, not status. "The model is very capable and tests passed" is not a release argument. "Here is what we tested, what we found, what we fixed, and who signed off" is. Only the second survives an audit, a board question, or a breach.
Consider a hypothetical fintech team merging forty AI-assisted PRs a day across a tangle of services. Upgrading to a stronger coding model gets them fifty PRs a day and the same unanswered question at release. A control layer above the stack gives them one evidence-backed answer per release, scoped to what each change can actually reach. That's the difference between more output and more control.
The bottom line
続きを読む
From Microsoft Scale to a New Category: How TAS23 Became Zof
The founder arc behind Zof: running engineering at Microsoft scale, a 2023 conference talk, and the reframe from QA tooling to governed reliability infrastructure.
The Closed Loop: Why Reliability Is Five Steps, Not One Tool
A founder's case for why reliability is an operating loop, not a tool: Understand, Test, Reproduce, Remediate, Verify, built for SREs drowning in AI-speed change.
Agents Propose, Humans Authorize: The Principle Behind Governed Autonomy
Why \"agents propose, humans authorize\" is the founding design rule that separates a credible reliability control layer from reckless autonomous fixing.
