Governing Remediation Fleets: How to Let AI Fix Code Without Losing Control
An SRE's guide to governing autonomous remediation: scope fixes by blast radius, gate approvals with policy, and keep every change reversible.
Why unsupervised fixing is the reckless part
Remediation is the hardest stage of the reliability loop for a simple reason: it mutates production behavior. Understanding a system is read-only. Testing and reproduction are observation. Remediation acts. And it acts in an environment where the inputs have gotten worse, not better. Roughly 41% of codebases are now AI-generated, and industry research suggests around 45% of AI coding tasks introduce a critical flaw or security issue. An autonomous fixer trained on that distribution, left to merge on its own judgment, is just adding another author with a high defect rate and no accountability trail.
The failure pattern is easy to picture if you run retail reliability. An agent sees elevated latency on the cart service, identifies a hot query, and "fixes" it by adding an aggressive cache. Latency drops. It also starts serving stale inventory counts during a flash sale, and you oversell. The agent solved the symptom it was pointed at and created a worse problem it was never scoped to consider. Nobody authorized that tradeoff because nobody was asked.
This is why the governing principle is non-negotiable: agents propose, humans authorize. Unsupervised autonomous fixing is not advanced automation; it is removing the one checkpoint that contains blast radius. The engineering work in Remediation Fleets is not the fix generation, which is now commoditized. It is the governance: the scoping, the policy, the approval, and the reversal.
Scope first: bound the fix to a known blast radius
A fix you cannot scope is a fix you cannot govern. Before an agent proposes anything, the control layer has to answer one question: if this change is wrong, what breaks and who is exposed? That answer does not come from the diff. It comes from the dependency graph.
This is where a live System Graph does the load-bearing work. Because it maps services, dependencies, and CI/CD into one change-aware model, it can tell you that the proposed cart-service fix sits on the path to the payments authorization service, which has a downstream rate limit, and that a config change two repos away is reachable from checkout. Scope is computed from that topology, not from how many lines moved or who wrote it.
Concretely, every remediation proposal should carry a scope envelope before it reaches a human:
- The change surface. Exactly which services, dependencies, and data paths the fix touches, derived from the graph.
- The blast radius. What is downstream of the change surface, and which of those nodes are revenue-bearing, regulated, or on a critical request path.
- The reachability of the original defect. Whether the flaw being fixed is even exploitable from a live entry point. Reachability-based prioritization can mean 70-90% less exploitable exposure to triage, which keeps your fleet from generating churn against theoretical risk.
A fix whose blast radius touches checkout or payment authorization is a fundamentally different object than one isolated to an internal admin tool. If your remediation system cannot tell those two apart before it asks for approval, it is not governed. It is guessing, and asking you to co-sign the guess.
Make every fix reversible by construction
Approval is only half of control. The other half is reversal. An SRE's real question during an incident is not "is this fix correct", you cannot always know that in the moment, but "if it is wrong, how fast and how cleanly can I undo it." A remediation system that cannot answer that should not be allowed to act.
Reversibility is a property you design in, not a hope you hold afterward. Three mechanisms make a fix safe to apply:
- A verified rollback path. Before a fix is applied, the control layer should know the exact reverse operation and have validated it against the same scope. A fix with no clean inverse is, by definition, a Tier-4 irreversible operation and routes to the strictest gate.
- Re-validation on the same scope. This is the Verify stage of the loop. The proposed fix is re-tested against the dependency surface it touches, not against a stale aggregate suite. Coordinated Testing Fleets plan and execute validation for *this* change as the system evolves, so "it passed" means the changed path was actually exercised, not that some unrelated green build flattered the dashboard.
- An immutable evidence record. Every applied fix carries who or what authorized it, what validation ran, and what the rollback was. When the incident review six weeks later asks "why did we ship this," the answer is an artifact, not a memory.
For retail teams operating under PCI scope or inside a customer boundary, the evidence bar is higher and the data cannot leave. Edge Runners run the remediation loop as signed capsules inside secure enclaves and emit audit-ready evidence from within the boundary, so the record survives a compliance review instead of living in a CI log someone can edit.
What to do Monday morning
You do not need to hand the keys to a fleet to start. You need to make one class of fix governed end to end and prove the envelope holds.
- Write your irreversible-operation list. Schema migrations, payment-path changes, anything touching auth or regulated data. This is the list you are conservative about; everything off it is a candidate for automation.
- Require a rollback path as an approval precondition. No verified inverse, no auto-apply. Make it a hard policy check, not a reviewer's discretion.
- Pick one low-criticality service and let the fleet auto-apply behind evidence checks. Measure whether anything breaks. It usually does not, and you learn your scoping is sound before you trust it on checkout.
- Drive tiering from the graph, not the diff. Replace line-count and author heuristics with blast-radius signals from your dependency model.
The bottom line
Guides associés
Produit associé
Continuer la lecture
Who's Accountable When the Agent Ships the Bug? Building an Audit Trail That Holds Up
When an AI agent ships the bug, accountability comes down to your audit trail. How to build immutable, explainable records of autonomous action that hold up to a regulator.
A Glossary of Enterprise AI Agent Governance: Control Plane, Policy-as-Code, Authority Scoping, and More
Plain-English definitions of the enterprise AI agent governance vocabulary: control plane, policy-as-code, authority scoping, blast radius, and more.
The Governed-Autonomy Maturity Model: Where Is Your Org on the Curve?
A five-stage maturity model for governed autonomy in software delivery, from manual gates to policy-driven control, plus a self-assessment for engineering leaders.
