Skip to content
Security & Governance

12 Ways AI Coding Assistants Quietly Introduce Critical Flaws

Industry research finds ~45% of AI coding tasks introduce critical flaws. Here are 12 concrete ways that happens, and how to govern it.

Zof Reliability Team · Engineering & product

November 11, 2025 · 8 min read · Updated November 11, 2025

Share
01

Flaws of fabrication: the model invents what should exist

1. Hallucinated dependencies. The assistant imports a package that does not exist, or names a real package incorrectly. In the best case the build fails loudly. In the worse case, an attacker has already registered that hallucinated name on a public registry and seeded it with malware, so your install step pulls hostile code. This "slopsquatting" pattern works precisely because the suggested name looks reasonable. Review rarely catches it, because reviewers read logic, not the provenance of every import.

2. Phantom APIs and method signatures. The model calls a function that was deprecated two versions ago, or invents a parameter that the library never accepted, because it pattern-matched across training data that spanned incompatible versions. The code reads correctly. It compiles against the wrong assumptions and fails only when a specific path executes in your actual environment.

3. Confident-but-wrong logic. The most dangerous fabrications are subtle: an off-by-one in a pagination boundary, an inverted conditional in a retry, a rounding choice in a financial calculation. The assistant produces clean, idiomatic, well-commented code that is simply incorrect about your domain. There is no syntax error to flag it. It passes the happy-path test the same assistant wrote for it.

02

Flaws of omission: the defaults that ship insecure

4. Insecure defaults. Asked to "set up a database connection" or "add file upload," assistants reach for the simplest configuration that demonstrates the feature: TLS verification disabled, permissive CORS, debug mode on, world-readable storage. Each is a reasonable demo default and a production liability. Nobody decided to weaken security. The assistant optimized for "works in the example," and that bias compounds across every scaffold it generates.

5. Missing input validation. Generated handlers frequently trust their inputs. The code accepts a request, parses it, and acts on it without bounding length, checking type, or sanitizing what flows into a query or a shell. This is how SQL injection, command injection, and path traversal re-enter codebases that "solved" those problems a decade ago. The assistant wasn't told the input was hostile, so it assumed it wasn't.

6. Absent error handling around the dangerous path. Assistants tend to write the success case cleanly and treat failure as an afterthought. The result is unhandled exceptions that leak stack traces, partial writes that corrupt state, and retries that double-charge. The code looks complete because the part you read first works.

03

Flaws of authority: who is allowed to do what

7. Broken authorization. This is the failure mode that should worry an engineering manager most, because it is invisible to functional testing. The assistant correctly checks that a user is authenticated, then forgets to check whether *this* user is allowed to touch *this* resource. The endpoint works perfectly for the developer testing it with their own account. It also lets any authenticated user read any other user's records by changing an ID in the URL. Static suites pass. The flaw is the absence of a check, and you cannot easily test for the absence of something nobody specified.

8. Privilege and scope creep. Generated infrastructure-as-code and service configurations lean toward broad permissions, because broad permissions never block the demo. An IAM role gets * where it needed three actions. A service account is granted admin where it needed read. Every one of these is a quiet expansion of blast radius that no functional test will ever exercise.

04

Flaws of exposure: secrets and data that leak

9. Hardcoded secrets. When an assistant needs a credential to make example code run, it inserts a plausible placeholder, and developers frequently replace it with a real key inline rather than wiring up a secret manager. The key then rides into version control. Industry data on leaked credentials in public repositories has been climbing for years; assistant-authored scaffolds accelerate the pattern by normalizing the inline-key shape.

10. Sensitive data in logs and responses. To be "helpful," generated code over-shares: full request bodies logged at info level, internal error details returned to the client, PII written to traces. None of it is malicious. It is the model defaulting to verbosity, and verbosity around sensitive data is a breach waiting for a compliance audit.

05

Flaws of context: code that's right in isolation, wrong in your system

11. Locally correct, globally breaking changes. An assistant edits one service competently while remaining blind to the forty services downstream that depend on its current behavior. It changes a response shape, tightens a timeout, or renames a field. The change is correct in the file you are reviewing and catastrophic in the dependency graph you are not. Reachability matters here: knowing whether a change actually touches a critical path is the difference between a routine merge and an incident. This is also why reachability-based prioritization, done well, can mean 70 to 90% less exploitable exposure, you stop chasing flaws that can't be reached and focus on the ones that can.

12. Confidently wrong test coverage. The most corrosive flaw is the one that hides the other eleven. Ask an assistant to write tests and it will write tests that pass, often by asserting the behavior the code already has, bugs included. Coverage numbers climb. Confidence climbs with them. But a test that codifies a defect is worse than no test, because it converts an open question into a false reassurance and makes the next engineer trust the broken path.

06

Why review and a pile of scanners don't close this

Each flaw above has a tool that catches some of it: a linter, a secret scanner, a SAST pass, an SCA check. So why does the 45% number persist? Because the flaws live in the *seams*. A change that passes four tools in isolation can still be the broken-authorization change, because no individual scanner understands how it propagates through your system. And the human-process answer fails for a measurable reason: roughly 80% of developers bypass policy or guardrails when those guardrails are slow or noisy. A control that gets routed around four times out of five is theater, not protection. This is not a moral failing of engineers; it is what happens when controls add friction without leverage.

The economics are not subtle either. The cost of poor software quality is estimated at around $2.41 trillion. A meaningful share of that is the aggregate of exactly these defects shipping at machine speed and getting caught, if at all, in production.

07

What to do Monday morning

You do not need to ban assistants. You need to stop treating their output as trusted by default and start governing it as untrusted change.

  • Treat every assisted change as untrusted until validated against the system, not the diff. A live System Graph of services, dependencies, and CI/CD lets you ask what a change can actually reach, instead of reviewing it in isolation.
  • Replace static suites with validation that keeps pace. Testing Fleets plan and execute checks that are aware of what changed and what depends on it, so coverage doesn't decay and assistant-written tests don't get to grade their own homework.
  • Make remediation governed, never unsupervised. When the system proposes a fix, a human authorizes it. Remediation under policy, approval, and audit is the engineering. Agents propose; humans authorize.
  • Demand evidence, not a green check. "Tests passed" is a status. A record of what was tested, what was found, what was fixed, and who signed off is evidence, and it is the only thing that survives an audit or a breach review.

Consider a hypothetical fintech team merging forty AI-assisted PRs a day. Adding a sixth scanner gives them a sixth queue. Putting a governed control layer above the stack gives them one verdict per release, scoped to what each change can reach, with a signed record of the call.

08

The bottom line

Continue Reading

01Zof Console

One surface for posture, operations, and what needs attention next.

The authenticated home that engineering, QA, and SRE teams open every day: quality posture, in-flight runs, coverage by module, and what needs attention next.

OPERATIONAL KPIs

  • Runs
  • Coverage
  • Risk

Live across every environment you ship to.

WORK SPINE

  • Specs
  • Tests
  • Schedules

From specification to scheduled regression.

GUARDRAILS

  • RBAC
  • SSO
  • audit

Every action attributable to a named human.

LIVE/console
Zof AI home command center showing 12 runs at 94% pass, 3 open critical issues, 84% coverage, four module traceability bars, the specification pipeline, upcoming schedules, and recommended next actions with an active-runs sidebar.
Console home · Checkout Service · Staging · captured live from the product.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

12 Ways AI Coding Assistants Quietly Introduce Critical Flaws