From Prompt to PR: The Checklist for Letting AI Write Production Code Safely
A control-layer checklist for platform engineers: the provenance, validation, reachability, approval, and evidence gates an AI-authored change must clear before merge.
Why "looks fine" is not a control
The default review for an AI-authored PR is a human skim plus a green pipeline. Both signals are weaker than they look. A skim evaluates whether the diff reads plausibly, which is precisely what a language model is optimized to make true. A green pipeline tells you the tests that exist passed; it says nothing about the tests that should exist for the change in front of you. Plausibility and aggregate pass rate are exactly the two signals AI-generated code is best at satisfying while still being wrong.
The deeper problem is incentive. When the gate is slow, subjective, or ceremonial, engineers route around it. Around 80% of developers admit to bypassing policy or guardrails when those guardrails get in the way. A gate that gets bypassed protects nothing, and a gate that treats a copy tweak and a payments-path change identically teaches people to bypass it. The fix is not more ceremony. It is a small number of gates that understand what they are gating and produce evidence a skeptic can read.
The checklist below maps to the control-layer requirements directly: provenance, validation, reachability, approval, and evidence. Treat each as a gate the change must pass, not a box someone clicks.
1. Provenance: know what you are merging and how it got here
Before anything else, the change has to declare itself. You cannot govern what you cannot attribute.
- Author and tool of record. Was this written by an agent, a copilot-assisted human, or a human alone? You are not penalizing AI authorship; you are calibrating scrutiny to it.
- Scope of intent. What was the agent asked to do, and does the diff stay inside that intent? AI changes routinely drift: asked to fix a null check, they refactor a function signature three callers depend on.
- Dependency and supply-chain delta. New packages, bumped versions, new transitive dependencies. Machine-written code pulls in dependencies casually, and a casual import is how supply-chain risk enters.
Provenance is cheap to capture and expensive to reconstruct after an incident. Capture it at PR time, attached to the change, not in a chat log someone has to subpoena later.
2. Validation: test the change, not the system's average health
This is the gate most teams think they have and mostly do not. Running your existing suite against an AI-authored change tells you the system still passes the tests written for a system that no longer exists. It does not tell you whether *this* change is correct.
Change-aware validation requires two things working together. First, a model of what the change actually touches. A live System Graph maps services, dependencies, and CI/CD into one change-aware picture, so validation can be scoped to the real blast radius: it knows the cart service calls payments, that payments has a downstream rate limit, and that a config change three repos away is reachable from checkout. Second, validation that adapts. Static scripts cannot keep pace with systems that change continuously. Coordinated Testing Fleets plan, execute, and maintain validation scoped to the change as the system evolves, rather than re-running a frozen suite and calling the green check proof.
The question the validation gate must answer is concrete: which paths did this change exercise, what regressed, and what coverage is missing for the surface it touches? "The build is green" is not an answer to that question.
3. Reachability: prioritize the risk that can actually be exploited
A 45% flaw-introduction rate produces a finding list long enough to guarantee one of two failures: the team triages everything and ships nothing, or it triages nothing and ships the dangerous ones. Both come from treating every finding as equally urgent.
Reachability is the discriminator. The question is not "is there a vulnerability in this code" but "does this vulnerability sit on a path that is actually reachable in the deployed system." Reachability-based prioritization can mean 70 to 90% less exploitable exposure to triage. Applied to a merge gate, it has a clean rule: a flaw on an unreachable path does not have to block the release, while a reachable one routes straight to a human. You stop spending attention on theoretical risk and concentrate it on the small set that can actually hurt you. For a deeper treatment of how this reshapes the security backlog, the security debt crisis whitepaper is the reference.
5. Evidence: produce an artifact that survives an audit
The last gate is the one teams skip and regret. Every merge, including every auto-merge, should leave a reproducible record: the provenance, the validation results scoped to the change, the reachability posture, the policy result, and who authorized what. The auto-merged changes are exactly the ones nobody watches, which is exactly why they need the same evidence trail as a reviewed one. Removing a human from the path raises the bar on the record; it does not lower it.
For code that runs inside a customer boundary or a regulated enclave, the requirement is stricter. Edge Runners execute as signed capsules inside secure enclaves and emit audit-ready evidence from inside the boundary, so the merge record satisfies a compliance review rather than living in a CI log someone can edit.
What to do Monday morning
You do not need to rebuild the pipeline to start. Pick one path and make it real.
- Tag AI-authored PRs. For two weeks, capture provenance on every change. You are establishing the baseline you currently lack.
- Pick one high-stakes surface and write its merge policy down. "Reachable critical findings = 0; payment-path change requires one named approval." If you cannot write it, you cannot govern it.
- Scope validation to that surface with the graph, not the whole suite. Let the dependency map define blast radius instead of the loudest reviewer.
- Make every merge leave an evidence record. Auto-merges included. Start the trail before you need it.
The bottom line
続きを読む
Agents Propose, Humans Authorize: A Reference Architecture for Governed Autonomy
A reference architecture for letting agents act on production safely: the four control surfaces, policy, approval, evidence, attribution, and how they wire into the loop.
More Models Won't Save You: Why AI-Generated Code Needs a Control Layer, Not Smarter Autocomplete
Better code generation can't validate its own output. Why AI-written code needs a governed control layer that maps, tests, and proves every change.
Code Without Provenance: The Real Risk When 41% of Your Codebase Has No Author
When 41% of your codebase has no author, the real risk isn't bugs, it's lost intent. How a System Graph restores the provenance AI-generated code strips away.
