Remediation Fleet Benchmarks
Measure reproduction speed, root-cause quality, fix proposal safety, and validation reliability, without publishing unverified fix success rates.
Remediation fleets only create enterprise value if they shorten incident cycles while respecting approval policy. Buyers need benchmarks that score governance and verification, not auto-merge hype.
What this suite tracks
minutes
Time to reproduce bug
Wall-clock from incident signal to minimal reproducing path with evidence.
minutes
Time to root-cause
Time to attach graph-backed hypothesis with supporting telemetry.
minutes
Time to generate candidate fix
Time to staged proposal with diff, tests, and rollback plan.
minutes
Human approval cycle time
Elapsed time in approval queue excluding engineer idle time.
rate
Fix validation success rate
Share of approved fixes that pass verify-after-fix suites.
rate
Rollback / verification reliability
Successful rollback or verification when validation fails.
How we measure
Success requires reproducible steps, graph context, staged proposals, recorded approvals, and verify-after-fix execution. Policy violations fail the run regardless of fix quality.
| Test environment | Sanitized production-like fixtures with injected defects, policy engine enabled, staging deploy target, evidence store, and approval workflow mirroring enterprise defaults. |
|---|---|
| Dataset / workload | Curated incident narratives spanning UI regressions, API contract breaks, race conditions, and config drift. Adversarial scenarios test policy bypass attempts. |
| Sample size | Minimum 25 incidents × 2 policy profiles (to be confirmed at first run). |
| Number of runs | 3 attempts per incident with fixed seeds; failures classified by phase (repro, RCA, proposal, verify). |
| Variance | Not yet measured. Future runs will report p50, p95, and coefficient of variation. |
| Excluded runs | None defined until first benchmark run is completed. |
| Date last run | Pending first benchmark run |
| Version tested | Pending first benchmark run |
| Repeatability | Incident pack version, policy hash, and agent versions are pinned. Evidence bundles export for third-party replay. |
Assumptions
- -No auto-apply without explicit approval in benchmark profile.
- -Verify-after-fix runs use the same fleet configuration as detection.
- -Synthetic incidents may omit org-specific runbooks.
Results pending first benchmark run
This page does not display performance numbers until completed runs pass validation. When published, results include confidence ranges and sample sizes.
| Metric | Value | Confidence range | Notes |
|---|---|---|---|
| Time to reproduce bug | Pending | - | Awaiting completed runs |
| Time to root-cause | Pending | - | Awaiting completed runs |
| Time to generate candidate fix | Pending | - | Awaiting completed runs |
| Human approval cycle time | Pending | - | Awaiting completed runs |
| Fix validation success rate | Pending | - | Awaiting completed runs |
| Rollback / verification reliability | Pending | - | Awaiting completed runs |
What this benchmark does not claim
- -Synthetic incidents may not capture proprietary tooling or change-management constraints.
- -Lab policies may differ from your production policy set; map controls during architecture review.
- -Until results are published, no fix success rates or speedup percentages are stated.
Enterprise interpretation
Evaluate whether remediation fleets compress reproduction and RCA time while keeping humans in control. Published metrics will separate proposal quality from approval latency.
Continue your evaluation
Evaluate Zof against your reliability requirements
Review methodology, run a structured assessment, or benchmark against your workflow with enterprise architects.
