Skip to content
الأمان والحوكمة

From Prompt to PR: The Checklist for Letting AI Write Production Code Safely

A control-layer checklist for platform engineers: the provenance, validation, reachability, approval, and evidence gates an AI-authored change must clear before merge.

فريق الموثوقية في Zof · الهندسة والمنتج

11 فبراير 2025 · قراءة 7 دقيقة · تم التحديث 11 فبراير 2025

Share
01

Why "looks fine" is not a control

The default review for an AI-authored PR is a human skim plus a green pipeline. Both signals are weaker than they look. A skim evaluates whether the diff reads plausibly, which is precisely what a language model is optimized to make true. A green pipeline tells you the tests that exist passed; it says nothing about the tests that should exist for the change in front of you. Plausibility and aggregate pass rate are exactly the two signals AI-generated code is best at satisfying while still being wrong.

The deeper problem is incentive. When the gate is slow, subjective, or ceremonial, engineers route around it. Around 80% of developers admit to bypassing policy or guardrails when those guardrails get in the way. A gate that gets bypassed protects nothing, and a gate that treats a copy tweak and a payments-path change identically teaches people to bypass it. The fix is not more ceremony. It is a small number of gates that understand what they are gating and produce evidence a skeptic can read.

The checklist below maps to the control-layer requirements directly: provenance, validation, reachability, approval, and evidence. Treat each as a gate the change must pass, not a box someone clicks.

02

1. Provenance: know what you are merging and how it got here

Before anything else, the change has to declare itself. You cannot govern what you cannot attribute.

  • Author and tool of record. Was this written by an agent, a copilot-assisted human, or a human alone? You are not penalizing AI authorship; you are calibrating scrutiny to it.
  • Scope of intent. What was the agent asked to do, and does the diff stay inside that intent? AI changes routinely drift: asked to fix a null check, they refactor a function signature three callers depend on.
  • Dependency and supply-chain delta. New packages, bumped versions, new transitive dependencies. Machine-written code pulls in dependencies casually, and a casual import is how supply-chain risk enters.

Provenance is cheap to capture and expensive to reconstruct after an incident. Capture it at PR time, attached to the change, not in a chat log someone has to subpoena later.

03

2. Validation: test the change, not the system's average health

This is the gate most teams think they have and mostly do not. Running your existing suite against an AI-authored change tells you the system still passes the tests written for a system that no longer exists. It does not tell you whether *this* change is correct.

Change-aware validation requires two things working together. First, a model of what the change actually touches. A live System Graph maps services, dependencies, and CI/CD into one change-aware picture, so validation can be scoped to the real blast radius: it knows the cart service calls payments, that payments has a downstream rate limit, and that a config change three repos away is reachable from checkout. Second, validation that adapts. Static scripts cannot keep pace with systems that change continuously. Coordinated Testing Fleets plan, execute, and maintain validation scoped to the change as the system evolves, rather than re-running a frozen suite and calling the green check proof.

The question the validation gate must answer is concrete: which paths did this change exercise, what regressed, and what coverage is missing for the surface it touches? "The build is green" is not an answer to that question.

04

3. Reachability: prioritize the risk that can actually be exploited

A 45% flaw-introduction rate produces a finding list long enough to guarantee one of two failures: the team triages everything and ships nothing, or it triages nothing and ships the dangerous ones. Both come from treating every finding as equally urgent.

Reachability is the discriminator. The question is not "is there a vulnerability in this code" but "does this vulnerability sit on a path that is actually reachable in the deployed system." Reachability-based prioritization can mean 70 to 90% less exploitable exposure to triage. Applied to a merge gate, it has a clean rule: a flaw on an unreachable path does not have to block the release, while a reachable one routes straight to a human. You stop spending attention on theoretical risk and concentrate it on the small set that can actually hurt you. For a deeper treatment of how this reshapes the security backlog, the security debt crisis whitepaper is the reference.

05

4. Approval: agents propose, humans authorize

The governing principle of the whole pipeline is one line: agents propose, humans authorize. An agent can assemble the change, run the validation, compute the risk, and stage a fix. It does not get to authorize the dangerous ones itself. Unsupervised autonomous merging into production is reckless, and the governance around the decision is the engineering, not an afterthought.

The way to make that scale without a bottleneck is to tier by blast radius rather than by line count or author:

  • Auto-merge. Low-criticality nodes, validation passed, no policy-sensitive surface. The control layer records evidence and merges. No human gate.
  • Single approver. High-criticality node, partial coverage, or a touched-but-not-regulated data path.
  • Multi-party / change-control. Authentication and authorization, payments, regulated data, irreversible operations, or any hard policy failure.

The aim is for the safe majority to auto-merge and human attention to land only on the genuinely risky minority. The tier must be derived from the graph and policy, not declared by the author, or you have rebuilt the bottleneck with extra steps. This is the work that Governance holds as first-class configuration: the policy, the tiering, and the approval rules live as code, enforced uniformly on every change.

06

5. Evidence: produce an artifact that survives an audit

The last gate is the one teams skip and regret. Every merge, including every auto-merge, should leave a reproducible record: the provenance, the validation results scoped to the change, the reachability posture, the policy result, and who authorized what. The auto-merged changes are exactly the ones nobody watches, which is exactly why they need the same evidence trail as a reviewed one. Removing a human from the path raises the bar on the record; it does not lower it.

For code that runs inside a customer boundary or a regulated enclave, the requirement is stricter. Edge Runners execute as signed capsules inside secure enclaves and emit audit-ready evidence from inside the boundary, so the merge record satisfies a compliance review rather than living in a CI log someone can edit.

07

What to do Monday morning

You do not need to rebuild the pipeline to start. Pick one path and make it real.

  1. Tag AI-authored PRs. For two weeks, capture provenance on every change. You are establishing the baseline you currently lack.
  2. Pick one high-stakes surface and write its merge policy down. "Reachable critical findings = 0; payment-path change requires one named approval." If you cannot write it, you cannot govern it.
  3. Scope validation to that surface with the graph, not the whole suite. Let the dependency map define blast radius instead of the loudest reviewer.
  4. Make every merge leave an evidence record. Auto-merges included. Start the trail before you need it.
08

The bottom line

أدلة ذات صلة

مواصلة القراءة

01Zof Console

سطح واحد للوضعية والعمليات وما يحتاج إلى الاهتمام بعد ذلك.

المنزل المُوثَّق الذي تفتحه فرق الهندسة وضمان الجودة وSRE كل يوم: وضعية الجودة، والتشغيل الجاري، والتغطية حسب الوحدة، وما يحتاج إلى الانتباه تاليًا.

مؤشرات الأداء الرئيسية التشغيلية

  • أشواط
  • تغطية
  • خطر

عش عبر كل بيئة تشحن إليها.

العمود الفقري للعمل

  • المواصفات
  • الاختبارات
  • الجداول

من المواصفات إلى الانحدار المجدول.

الدرابزين

  • RBAC
  • SSO
  • التدقيق

كل فعل ينسب إلى إنسان مسمى.

LIVE/console
يعرض مركز القيادة المنزلي Zof AI 12 عملية تشغيل بنسبة نجاح 94%، و3 مشكلات حرجة مفتوحة، وتغطية 84%، وأربعة أشرطة لتتبع الوحدات النمطية، ومسار المواصفات، والجداول الزمنية القادمة، والإجراءات التالية الموصى بها مع شريط جانبي للتشغيل النشط.
عرض الصفحة الرئيسية · خدمة الخروج · التدريج · تم التقاطها مباشرة من المنتج.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

From Prompt to PR: The Checklist for Letting AI Write Production Code