Sécurité et gouvernance

41% AI Codebases Shatter Legacy QA Assumptions

Explore how AI-generated code is challenging and transforming traditional QA practices.

Équipe Fiabilité Zof · Ingénierie et produit

4 février 2025 · 7 min de lecture · Mis à jour le 4 février 2025

Résumé

Traditional QA was not designed badly. It was designed for a world that no longer exists, one where a human wrote every line, change arrived at human speed, and a reviewer understood the intent behind each diff. Roughly 41% of codebases are now AI-generated, and that single shift invalidates several premises your test strategy quietly depends on. The risk is not that QA gets harder. It is that QA keeps reporting green against assumptions that are no longer true. This is a trends piece for the people who own that risk: heads of engineering, platform leads, the SREs who get paged when a "tested" release fails in a way the suite never anticipated. The argument is simple. Most QA practices encode beliefs about how code is produced. AI broke those beliefs faster than teams updated the practices. Below are the five that matter, why each one failed, and what to do before it costs you an incident.

Code review, the most trusted gate in most organizations, runs on a hidden premise: that the author can explain why the change is the way it is.
Validation strategy is usually sized to the team.
Most test prioritization assumes locality: a change here mostly affects things near here, so validate the diff and its immediate neighbors.

Assumption 1: a human understood the intent behind every change

Code review, the most trusted gate in most organizations, runs on a hidden premise: that the author can explain why the change is the way it is. A reviewer asks "why this approach?" and gets a real answer rooted in someone's mental model of the system. That conversation is where most subtle defects die.

AI-generated change weakens that loop on both ends. The author often did not write the code so much as accept it, and the model that produced it has no durable model of your system, your incident history, or which services cannot tolerate a regression. The reviewer is now reviewing output that nobody fully reasoned about, at a volume that keeps climbing. Industry research puts the share of AI coding tasks that introduce a critical flaw or security issue near 45%. A flaw at that rate is rarely a syntax error a reviewer would catch. It is a change that is locally plausible and globally wrong, which is exactly the class of defect that "looks fine to me" approves.

The fix is not to slow down review. It is to stop treating human attention as the thing that understands the change, and to give the gate a model of the system that the human no longer holds in their head. That is what a System Graph provides: a live map of services, dependencies, and CI/CD so validation knows what a change actually reaches, not just what it looks like.

Assumption 2: the rate of risky change is roughly proportional to headcount

Validation strategy is usually sized to the team. More engineers, more review capacity, more test maintenance budget. The unspoken math is that risk scales with people, because people are what produce change.

AI severs that proportionality. One engineer with a capable assistant can generate change that would have taken a squad a sprint, while the review and security capacity to inspect it has not grown at all. So the gap between how fast risky change is produced and how fast it can be validated widens with every productivity gain. Counterintuitively, the more "productive" your AI adoption looks on a velocity dashboard, the larger your ungoverned surface may be. Velocity dashboards measure output. They do not measure whether anything verified it.

The practical signal to watch is your override rate. Roughly 80% of developers already bypass policy and guardrails, and that number gets worse, not better, as generation accelerates, because slow blanket gates cannot keep pace with machine-speed output. When a gate punishes every change equally, teams route around it under deadline. The gates that survive are the ones that respond proportionately to what changed.

Assumption 3: defects are local, so testing the changed area is enough

Most test prioritization assumes locality: a change here mostly affects things near here, so validate the diff and its immediate neighbors. This is why "we ran the tests for the files that changed" feels like diligence.

AI-generated code breaks locality in a quiet way. Models optimize for producing something that satisfies the prompt, not for respecting the architectural boundaries your team negotiated over years. The result is change that reaches further than it appears to: a helper that silently couples two domains, an "innocent" refactor that alters a contract three services downstream depend on. The defect is not where you are looking, because the change does not respect the map your prioritization assumed.

This is where reachability matters in both directions. You need to know what a change can reach to test it correctly, and you need to know what an attacker can reach to triage findings sanely. Reachability-based prioritization can mean 70 to 90% less exploitable exposure to act on, which is the difference between a queue a team can clear and one it learns to ignore. Neither is possible without an actual model of what connects to what. Guessing the blast radius was always a tax. At AI volume it becomes a liability.

Assumption 4: a green suite means the system is healthy

Test suites are snapshots of intent. They encode what the system was supposed to do when someone last wrote them down. The deep assumption is that the suite and the system stay roughly in sync, so a passing suite is evidence of a working system.

AI accelerates the drift between those two things. The architecture moves faster than the assertions describing it, so suites keep passing while the reality underneath shifts. A green suite that no longer reflects the system is worse than no suite, because it manufactures confidence at precisely the moment you should be skeptical. Add a scanner emitting four hundred findings with no sense of which twelve are reachable, and you have trained your team to ignore the signal entirely. Alert fatigue is just bypass with extra steps.

The replacement for static snapshots is validation that maintains itself as the system evolves. Testing Fleets plan, execute, observe, and maintain coverage as architecture changes, rather than running scripts that rot the moment something moves. The goal is continuity, not a coverage number that looks healthy and means nothing.

Assumption 5: "we reviewed it" is a defensible answer

For years, "a qualified human reviewed and approved this" was a sufficient account of release readiness, to a regulator, a board, or a customer security questionnaire. It worked because review was the binding constraint and review was human.

As AI authorship rises toward the majority of change, that answer thins out. Saying a human reviewed it means less when humans are reviewing a shrinking fraction of total change and the rest is accepted output. The expectation is shifting from attestation to evidence: for this release, here is what changed, what was validated, what policy applied, what was reachable, and who authorized it. The estimated $2.41 trillion cost of poor software quality was accumulated mostly by human-authored code at human speed; the same defect-injection problem at machine speed against a growing majority of your codebase is how a tolerable tax becomes an existential one. Evidence is what keeps that curve from bending the wrong way.

This is the part teams get wrong by over-correcting. The answer to ungoverned AI is not to remove the human, and it is not to let agents fix and ship unsupervised. Agents propose; humans authorize. Remediation Fleets generate the change, and governance decides whether it proceeds, with an audit trail as a byproduct of normal operation. Remediation is the hardest, highest-consequence part of the loop, so the engineering is in the policy and approval, not the speed of the patch.

What to do Monday morning

You do not need a platform migration to start. You need to stop running on assumptions the 41% already invalidated.

Measure your AI authorship share and where it lands. If you cannot state roughly what percentage of recent change is AI-assisted, you are operating on human-paced safety assumptions.
Treat your override rate as your real policy. Find your most-bypassed gate. If it is skipped because it is slow or noisy, the fix is proportionality, not stricter enforcement.
Re-rank one finding queue by reachability. Cutting non-exploitable noise is the fastest way to make a team trust its own tooling again.
Name your authorizer. For your riskiest change class, write down who authorizes release and on what evidence.

The bottom line

Gouvernance de l'IA IA d'entreprise System Graph Flottes de test Flottes de remédiation

Guides associés

Governed AI remediation

Produit associé

Continuer la lecture

Sécurité et gouvernance

Agents Propose, Humans Authorize: A Reference Architecture for Governed Autonomy

A reference architecture for letting agents act on production safely: the four control surfaces, policy, approval, evidence, attribution, and how they wire into the loop.

Équipe Fiabilité Zof16 juin 20268 min de lecture

Sécurité et gouvernance

More Models Won't Save You: Why AI-Generated Code Needs a Control Layer, Not Smarter Autocomplete

Better code generation can't validate its own output. Why AI-written code needs a governed control layer that maps, tests, and proves every change.

Équipe Fiabilité Zof14 mai 20267 min de lecture

Sécurité et gouvernance

Code Without Provenance: The Real Risk When 41% of Your Codebase Has No Author

When 41% of your codebase has no author, the real risk isn't bugs, it's lost intent. How a System Graph restores the provenance AI-generated code strips away.

Équipe Fiabilité Zof5 mai 20267 min de lecture

Assumption 1: a human understood the intent behind every change

Assumption 2: the rate of risky change is roughly proportional to headcount

Assumption 3: defects are local, so testing the changed area is enough

Assumption 4: a green suite means the system is healthy

Assumption 5: "we reviewed it" is a defensible answer

What to do Monday morning

The bottom line

Continuer la lecture

Agents Propose, Humans Authorize: A Reference Architecture for Governed Autonomy

More Models Won't Save You: Why AI-Generated Code Needs a Control Layer, Not Smarter Autocomplete

Code Without Provenance: The Real Risk When 41% of Your Codebase Has No Author

Une surface pour la posture, les opérations et ce qui nécessite une attention particulière.