Skip to content
セキュリティとガバナンス

12 Ways AI Coding Assistants Quietly Introduce Critical Flaws

Industry research finds ~45% of AI coding tasks introduce critical flaws. Here are 12 concrete ways that happens, and how to govern it.

Zof Reliability Team · エンジニアリング & プロダクト

2025年11月11日 · 読了時間 8 分 · 2025年11月11日 更新

Share
01

Flaws of fabrication: the model invents what should exist

1. Hallucinated dependencies. The assistant imports a package that does not exist, or names a real package incorrectly. In the best case the build fails loudly. In the worse case, an attacker has already registered that hallucinated name on a public registry and seeded it with malware, so your install step pulls hostile code. This "slopsquatting" pattern works precisely because the suggested name looks reasonable. Review rarely catches it, because reviewers read logic, not the provenance of every import.

2. Phantom APIs and method signatures. The model calls a function that was deprecated two versions ago, or invents a parameter that the library never accepted, because it pattern-matched across training data that spanned incompatible versions. The code reads correctly. It compiles against the wrong assumptions and fails only when a specific path executes in your actual environment.

3. Confident-but-wrong logic. The most dangerous fabrications are subtle: an off-by-one in a pagination boundary, an inverted conditional in a retry, a rounding choice in a financial calculation. The assistant produces clean, idiomatic, well-commented code that is simply incorrect about your domain. There is no syntax error to flag it. It passes the happy-path test the same assistant wrote for it.

02

Flaws of omission: the defaults that ship insecure

4. Insecure defaults. Asked to "set up a database connection" or "add file upload," assistants reach for the simplest configuration that demonstrates the feature: TLS verification disabled, permissive CORS, debug mode on, world-readable storage. Each is a reasonable demo default and a production liability. Nobody decided to weaken security. The assistant optimized for "works in the example," and that bias compounds across every scaffold it generates.

5. Missing input validation. Generated handlers frequently trust their inputs. The code accepts a request, parses it, and acts on it without bounding length, checking type, or sanitizing what flows into a query or a shell. This is how SQL injection, command injection, and path traversal re-enter codebases that "solved" those problems a decade ago. The assistant wasn't told the input was hostile, so it assumed it wasn't.

6. Absent error handling around the dangerous path. Assistants tend to write the success case cleanly and treat failure as an afterthought. The result is unhandled exceptions that leak stack traces, partial writes that corrupt state, and retries that double-charge. The code looks complete because the part you read first works.

03

Flaws of authority: who is allowed to do what

7. Broken authorization. This is the failure mode that should worry an engineering manager most, because it is invisible to functional testing. The assistant correctly checks that a user is authenticated, then forgets to check whether *this* user is allowed to touch *this* resource. The endpoint works perfectly for the developer testing it with their own account. It also lets any authenticated user read any other user's records by changing an ID in the URL. Static suites pass. The flaw is the absence of a check, and you cannot easily test for the absence of something nobody specified.

8. Privilege and scope creep. Generated infrastructure-as-code and service configurations lean toward broad permissions, because broad permissions never block the demo. An IAM role gets * where it needed three actions. A service account is granted admin where it needed read. Every one of these is a quiet expansion of blast radius that no functional test will ever exercise.

04

Flaws of exposure: secrets and data that leak

9. Hardcoded secrets. When an assistant needs a credential to make example code run, it inserts a plausible placeholder, and developers frequently replace it with a real key inline rather than wiring up a secret manager. The key then rides into version control. Industry data on leaked credentials in public repositories has been climbing for years; assistant-authored scaffolds accelerate the pattern by normalizing the inline-key shape.

10. Sensitive data in logs and responses. To be "helpful," generated code over-shares: full request bodies logged at info level, internal error details returned to the client, PII written to traces. None of it is malicious. It is the model defaulting to verbosity, and verbosity around sensitive data is a breach waiting for a compliance audit.

05

Flaws of context: code that's right in isolation, wrong in your system

11. Locally correct, globally breaking changes. An assistant edits one service competently while remaining blind to the forty services downstream that depend on its current behavior. It changes a response shape, tightens a timeout, or renames a field. The change is correct in the file you are reviewing and catastrophic in the dependency graph you are not. Reachability matters here: knowing whether a change actually touches a critical path is the difference between a routine merge and an incident. This is also why reachability-based prioritization, done well, can mean 70 to 90% less exploitable exposure, you stop chasing flaws that can't be reached and focus on the ones that can.

12. Confidently wrong test coverage. The most corrosive flaw is the one that hides the other eleven. Ask an assistant to write tests and it will write tests that pass, often by asserting the behavior the code already has, bugs included. Coverage numbers climb. Confidence climbs with them. But a test that codifies a defect is worse than no test, because it converts an open question into a false reassurance and makes the next engineer trust the broken path.

06

Why review and a pile of scanners don't close this

Each flaw above has a tool that catches some of it: a linter, a secret scanner, a SAST pass, an SCA check. So why does the 45% number persist? Because the flaws live in the *seams*. A change that passes four tools in isolation can still be the broken-authorization change, because no individual scanner understands how it propagates through your system. And the human-process answer fails for a measurable reason: roughly 80% of developers bypass policy or guardrails when those guardrails are slow or noisy. A control that gets routed around four times out of five is theater, not protection. This is not a moral failing of engineers; it is what happens when controls add friction without leverage.

The economics are not subtle either. The cost of poor software quality is estimated at around $2.41 trillion. A meaningful share of that is the aggregate of exactly these defects shipping at machine speed and getting caught, if at all, in production.

07

What to do Monday morning

You do not need to ban assistants. You need to stop treating their output as trusted by default and start governing it as untrusted change.

  • Treat every assisted change as untrusted until validated against the system, not the diff. A live System Graph of services, dependencies, and CI/CD lets you ask what a change can actually reach, instead of reviewing it in isolation.
  • Replace static suites with validation that keeps pace. Testing Fleets plan and execute checks that are aware of what changed and what depends on it, so coverage doesn't decay and assistant-written tests don't get to grade their own homework.
  • Make remediation governed, never unsupervised. When the system proposes a fix, a human authorizes it. Remediation under policy, approval, and audit is the engineering. Agents propose; humans authorize.
  • Demand evidence, not a green check. "Tests passed" is a status. A record of what was tested, what was found, what was fixed, and who signed off is evidence, and it is the only thing that survives an audit or a breach review.

Consider a hypothetical fintech team merging forty AI-assisted PRs a day. Adding a sixth scanner gives them a sixth queue. Putting a governed control layer above the stack gives them one verdict per release, scoped to what each change can reach, with a signed record of the call.

08

The bottom line

続きを読む

01Zof Console

姿勢、操作、次に注意が必要なことを 1 つの面で確認できます。

エンジニアリング、QA、SREの各チームが毎日開く認証済みのホーム。品質の姿勢、進行中の実行、モジュールごとのカバレッジ、そして次に注目すべきことが分かります。

運用上の KPI

実行数、カバレッジ、リスク

出荷先のあらゆる環境に対応します。

ワークスパイン

仕様・テスト・スケジュール

仕様から計画された回帰まで。

ガードレール

RBAC・SSO・監査

指定された人間に起因するすべての行為。

LIVE/console
Zof AI ホーム コマンド センターには、94% パスでの 12 件の実行、3 つの未解決の重大な問題、84% のカバレッジ、4 つのモジュール トレーサビリティ バー、仕様パイプライン、今後のスケジュール、アクティブ実行サイドバー付きの推奨される次のアクションが表示されます。
ホーム ビュー · チェックアウト サービス · ステージング · 製品からライブでキャプチャ。
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

12 Ways AI Coding Assistants Quietly Introduce Critical Flaws