Engineering

CI Is Green and the Release Is Still Broken: A Reliability Post-Mortem

A reliability post-mortem where every static check passed and the release still broke. Why green CI lies, and what change-aware, dependency-grounded validation does instead.

Book a demo

Zof Reliability Team · Engineering & product

December 2, 2025 · 7 min read · Updated December 2, 2025

The incident: every check was green

Consider a hypothetical payments platform. A team merges a change to a transaction-enrichment service. The diff is small: it adds a field to an outbound event and bumps a shared serialization library to pick up a security patch.

CI runs the full battery. Unit tests pass. Integration tests pass against the service's mocked dependencies. Linting, type checks, SAST, license scan, and the dependency-vulnerability gate all come back clean. Coverage is up two points. The pull request is approved in four minutes. The pipeline is green end to end, and the change ships on a Friday afternoon because the dashboard gave everyone permission to relax.

Ninety minutes later, a downstream reconciliation job starts silently dropping a subset of events. No alarms fire, because nothing crashed and latency looks normal. The first real signal arrives the next morning as a finance discrepancy, and the team spends the weekend reconstructing what happened.

Why every gate passed and the system still broke

The post-mortem is uncomfortable because there is no smoking gun. Nobody was negligent. Every check did exactly what it was designed to do. The failure lived in the space between the checks.

The serialization bump changed wire behavior, not API surface. The new library version reordered how optional fields were encoded. Type checks and unit tests passed because the contract in the service's own repo was unchanged. The break was in a *consumer* three hops downstream that parsed the payload positionally.
Integration tests ran against mocks frozen in time. The mocked downstream returned what the team expected the downstream to return, which is not what the live downstream actually did after its own recent change. The test suite validated a system that no longer existed.
The vulnerability gate confirmed the patch was safe to adopt. It had no opinion on whether adopting it was safe for *this* dependency graph. Those are different questions, and only one of them was being asked.
No check was change-aware. Every gate evaluated the diff in isolation or the repo in isolation. Nothing evaluated the diff against the live dependency reality it was about to enter.

This is the structural problem, not a tooling gap you can close by adding one more linter. Static checks are change-*present* but dependency-*blind*. They see the change clearly and the system not at all.

The scale that makes this routine, not rare

A decade ago this incident would have been a near-miss caught by an engineer who happened to know the downstream consumer parsed positionally. That tribal knowledge does not scale to current change volume.

Roughly 41% of codebases are now AI-generated, and around 45% of AI coding tasks introduce critical flaws or security issues. The throughput went up and the defect density went up at the same time. Worse, AI-generated changes are exactly the kind that look locally correct and pass static checks while violating an assumption nobody wrote down. The generated diff compiles, types check, and the model had no view of the positional parser three services away. Green CI on AI-authored code is a weaker signal than green CI on hand-written code, precisely when teams are leaning on it more.

The aggregate cost of poor software quality is estimated at around $2.41 trillion. A meaningful share of that is not dramatic outages. It is exactly this: changes that passed every gate and broke something the gate could not see.

What change-aware, dependency-grounded validation actually means

The fix is not "more tests." A bigger static suite is a bigger snapshot of a system that keeps moving. The fix is to validate every change against a live model of the system it is entering. Two capabilities have to be present, and most pipelines have neither.

First, a live dependency and context map. You cannot reason about blast radius from memory or from a wiki diagram that was accurate two quarters ago. A System Graph that knows the enrichment service feeds the reconciliation job, that the job parses positionally, and that a serialization bump is therefore reachable from a money-moving path is what turns "the diff looks fine" into "this diff touches a consumer with a fragile contract." This is what makes validation change-aware: it scopes every check to the real surface the change reaches, not the repo it lives in.

Second, validation that plans and adapts instead of replaying. Static scripts and frozen mocks decay the moment the system around them changes, which is continuously. Testing Fleets are coordinated agents that plan, execute, observe, and maintain validation as the system evolves. In this incident, that means generating a contract test against the *current* downstream behavior, exercising the actual serialization path, and flagging the positional-parse mismatch before merge, because the validation was grounded in the live dependency rather than a recorded expectation.

The contrast in one line: static checks ask "is this code correct in isolation?" Change-aware validation asks "is this specific change safe for this specific system right now, and what is the evidence?"

Reachability is what makes the answer usable

Change-aware validation can surface more signal, and an SRE's reasonable fear is that this just means more noise. That is where reachability prioritization earns its place. The point is not to list every theoretical defect; it is to surface the ones reachable from a live entry point. Reachability-based prioritization can mean 70 to 90% less exploitable exposure to triage. The serialization mismatch matters not because it exists but because it is reachable from a path that moves money. A verdict an SRE can read in two minutes beats a backlog nobody reads.

This also addresses the human reality behind most bypassed gates. Around 80% of developers route around policy and guardrails, and they do it for a rational reason: slow, noisy, or vague gates tax velocity without obviously protecting anything. A gate that produces a short list of reachable, change-scoped risks is one engineers route *through*, not around. Reliability Analytics then turns the accumulated evidence into the metrics that actually govern the gate: time-to-validate a change, reachable-risk trend, escaped-defect rate.

Where remediation and governance fit

When validation does catch the mismatch, the next question is who fixes it and on what authority. Remediation Fleets can propose a fix, an adapter on the consumer or a pin on the serialization version, with the reproduction attached. They do not ship it silently. Agents propose; humans authorize. Unsupervised autonomous fixing inside a release path would be reckless; the governance layer, policy, named approval, and an audit trail, is the engineering, not an afterthought. For teams that cannot send code or telemetry to a vendor cloud, the same loop runs inside your boundary via Edge Runners as signed capsules with audit-ready evidence.

What to do Monday morning

Stop trusting green as readiness. Treat passing static checks as necessary, not sufficient. Write down, for one critical path, what "validated" means beyond a green pipeline.
Map one blast radius for real. Pick your highest-stakes service and document its live downstream consumers and their contract assumptions. The fragile positional parser in your stack is already there; find it before it finds you.
Audit your mocks for decay. Identify integration tests whose mocked dependencies have not been reconciled against live behavior in a quarter. Those are silent liabilities.
Measure escaped defects, not coverage. Coverage flatters the dashboard. Escaped-defect rate and time-to-validate tell you whether your gates are catching what matters.

The bottom line

Software Testing QA System Graph Testing Fleets Remediation Fleets

Related guides

Testing fleets

Continue Reading

Engineering

The Last Manual Gate: Why QA Sign-Off Is the Bottleneck in an Automated Pipeline

Your CI/CD is automated end to end, then stalls at manual QA sign-off. Here's why the last human regression gate breaks under AI-era load, and how to close it.

Zof Reliability TeamMay 6, 20267 min read

Engineering

Why Fintech Can't Afford Manual Regression Cycles Anymore

At fintech's code velocity, manual regression cycles cost release latency and let reportable risk through. Why governed autonomous validation is the control-layer fix.

Zof Reliability TeamApr 7, 20266 min read

Engineering

A Migration Playbook: Retiring Your Selenium Suite Onto Testing Fleets

A staged playbook for platform teams retiring a brittle Selenium suite onto governed Testing Fleets without opening a coverage gap.

Zof Reliability TeamFeb 3, 20267 min read

The incident: every check was green

Why every gate passed and the system still broke

The scale that makes this routine, not rare

What change-aware, dependency-grounded validation actually means

Reachability is what makes the answer usable

Where remediation and governance fit

What to do Monday morning

The bottom line

Continue Reading

The Last Manual Gate: Why QA Sign-Off Is the Bottleneck in an Automated Pipeline

Why Fintech Can't Afford Manual Regression Cycles Anymore

A Migration Playbook: Retiring Your Selenium Suite Onto Testing Fleets

One surface for posture, operations, and what needs attention next.