Mean Time to Reproduce: The Most Underrated Reliability KPI
Why mean time to reproduce, not just MTTR-to-resolve, is the real reliability bottleneck, and how to instrument it with a change-aware System Graph.
The metric hiding inside MTTR
MTTR is a composite, and treating it as one number flattens the part you can most affect. Decompose any incident and you get roughly four spans: time to detect, time to reproduce, time to fix, and time to verify. Detection gets the dashboards and the alerting budget. The fix gets the engineering glory. Verification gets a green build. The span in the middle, getting the failure to happen again, on demand, in an environment you control, gets almost no instrumentation at all, despite often being the longest.
That blind spot is not an accident. Reproduction is hard to measure because it is messy work: re-running with production-like state, chasing a race condition, reconstructing the exact dependency versions and feature flags that were live when the page fired. It rarely produces a clean event you can timestamp. So it disappears into "investigating," and "investigating" is where MTTR quietly bloats.
Call the span what it is: mean time to reproduce (MTTRepro), the elapsed time from a confirmed failure signal to a deterministic, re-runnable reproduction of that failure. It deserves its own line on the board, because you cannot manage what you refuse to name.
Why reproduction is the real bottleneck
Three forces are making reproduction the dominant cost in incident response, and all three are accelerating.
The first is composition. Modern failures are rarely a single bad line. They emerge from the interaction of a service, a dependency that bumped a minor version, a config that drifted, and a load pattern that only appears at peak. Reproducing that means reconstructing a *state*, not finding a *statement*. The search space is the system, not the diff.
The second is the rate of change. Roughly 41% of codebases are now AI-generated, and industry research puts the rate at which AI coding tasks introduce critical flaws or security issues near 45%. More change per day means more candidate causes per incident, and a wider gap between the code you are reading and the code that was running when the failure occurred. Reproduction gets harder precisely as the volume of things to reproduce goes up.
The third is environment drift. The failure happened in production, against real data, under real concurrency. Your reproduction attempt happens in staging, against synthetic data, single-threaded, with a different feature-flag matrix. Every gap between those two worlds is a place where the bug hides. Much of MTTRepro is not investigation at all; it is the labor of closing that gap by hand, one variable at a time.
When reproduction is slow, everything downstream inherits the delay. A fix you cannot reproduce against is a guess. A verification you cannot anchor to a reproduced failure is theater. The reproduction span is load-bearing for the entire loop.
What slow reproduction actually costs
The cost is not only the wall-clock hours on one incident. Slow reproduction degrades the quality of every decision that follows it.
- It pushes teams toward speculative fixes. When reproduction drags, the pressure to "just ship something and watch" wins. Speculative fixes that are never validated against a real reproduction are how you turn one incident into two.
- It corrupts prioritization. A bug nobody can reproduce gets reclassified as "not reproducible" and aged out of the backlog, even when it is firing for real users. The hardest-to-reproduce failures are frequently the most expensive ones.
- It taxes your best engineers. Reproduction is senior work; it is the people who hold the system's mental model who get pulled into it. That is the most expensive labor you have, spent on reconstruction rather than design.
Set against the macro number, industry research puts the cost of poor software quality near $2.41 trillion, the reproduction tax is not a rounding error. It is a structural drain that no resolve-time dashboard will ever show you.
Instrumenting mean time to reproduce
You cannot improve MTTRepro until you can see it. Start by separating it from the MTTR blob, then attack the work itself.
Mark the boundaries. Add two explicit states to your incident lifecycle: *reproduction started* and *reproduction confirmed* (a deterministic, re-runnable case exists). The delta is MTTRepro. This single change turns an invisible span into a managed one, and it will likely embarrass your current dashboard in a useful way.
Capture state, not just logs. Most reproduction time is spent reconstructing the conditions of failure. The fix is to capture those conditions when the failure happens, the dependency graph, config, flags, and relevant input shape, so a reproduction starts from recorded reality instead of a clean-room guess.
Make reproduction change-aware. This is where a System Graph collapses the span. A live map of services, dependencies, and CI/CD topology lets you answer the question reproduction actually turns on: given this failure, what changed in the blast radius, and which of those changes are reachable from the failing path? Reachability-based prioritization can mean 70-90% less exploitable exposure to triage, and the same logic compresses reproduction: you stop bisecting the whole system and start with the handful of changes that could plausibly produce this failure. The graph narrows the search space from "everything" to "these few things."
Run reproduction where the failure lives. For regulated and security-sensitive teams, the realistic state needed to reproduce a failure cannot leave the customer boundary. Edge Runners are signed capsules that execute inside secure enclaves and produce audit-ready evidence, so the reproduced failure is both real and provable, not a screenshot pasted into a ticket. A reproduction you can attach as evidence is worth more than one you have to describe.
How the control layer collapses the span
Instrumenting MTTRepro tells you where the time goes. A control layer is what shortens it, by making reproduction a continuous capability rather than a heroic one-off.
In the closed loop, Understand, Test, Reproduce, Remediate, Verify, reproduction is a named stage with a named owner, not an improvised scramble. Testing Fleets are coordinated agents that plan, execute, observe, and maintain validation as the system evolves. When the graph reports a change in the blast radius, the fleet can drive toward a reproduction against realistic conditions, rather than waiting for an engineer to hand-build one. The output is the deterministic case that seeds a fix and anchors verification.
This is also where the governance line matters. The point is not to let agents reproduce *and* remediate unsupervised; that is the reckless version of autonomy a serious enterprise has no use for. The discipline is agents propose, humans authorize. Fleets do the expensive reconstruction work and produce a provable reproduction; a named human still authorizes any fix that follows, through Governance, policy, approval, and an audit trail. Reproduction is exactly the right place to spend autonomy: it is high-effort, low-judgment, evidence-producing work. Remediation is where human authority stays load-bearing.
The result is a reliability metric you can finally move. When reproduction is continuous and change-aware, MTTRepro stops being the silent majority of your MTTR and becomes a span you compress on purpose.
What to do Monday morning
- Decompose your last ten incidents into detect / reproduce / fix / verify. You will likely find reproduction is the largest span and the least instrumented.
- Add *reproduction started* and *reproduction confirmed* states to your incident process this week. Begin recording the delta.
- Audit whether your team can reconstruct the failing state from captured evidence, or whether they rebuild it by hand every time.
- Ask whether your validation is change-aware. If reproduction means searching the whole system rather than the blast radius, you are paying the full tax.
The bottom line
Related guides
Related product
Continue Reading
Activity vs. Outcome: Why Your Reliability Metrics Are Measuring the Wrong Thing
Test counts and run volumes are activity theater. Here's why only outcome metrics, escaped defects and proven-safe releases, justify reliability investment.
Reliability ROI for E-commerce: Measuring Confidence on Every Checkout Release
A case-study model for pricing avoided revenue loss on every checkout, payments, and inventory release, so product managers can defend reliability as ROI.
Velocity Doesn't Kill Quality, Lack of Visibility Does
The speed-vs-quality tradeoff is a measurement failure, not a law of physics. Here's why full traceability across the reliability loop dissolves it.
