System Graph Benchmarks
Measure change-impact accuracy, dependency completeness, and targeted test selection, grounded in observed failures, not graph size vanity metrics.
System Graph value is measured by whether teams run the right tests for each change. Incomplete graphs waste fleet capacity; inaccurate impact maps miss regressions.
What this suite tracks
score
Change impact accuracy
Precision/recall of predicted impacted services vs observed failures.
services
Impacted service detection
Correct identification of services touched by a tagged change.
workflows
Impacted workflow detection
Correct identification of user/API workflows affected by change.
rate
Targeted test selection accuracy
Share of failures caught by selected tests vs full suite.
rate
Dependency mapping completeness
Coverage of known dependencies vs ground-truth architecture inventory.
score
Release-risk scoring usefulness
Correlation between risk score bands and observed defect severity.
How we measure
We score precision/recall for impact sets, selection coverage, and risk band calibration. Graph node counts alone are not reported as success metrics.
| Test environment | Multi-service reference application with known dependency graph, connector-fed CI/deploy events, injected failure tags, and exported ground-truth inventory. |
|---|---|
| Dataset / workload | Tagged commits with controlled failure injection across services and workflows. Compare predicted impact sets to executed test outcomes. |
| Sample size | Minimum 40 tagged changes × 3 connector coverage tiers (to be confirmed at first run). |
| Number of runs | 5 graph rebuild cycles per tier; impact predictions frozen per commit SHA. |
| Variance | Not yet measured. Future runs will report p50, p95, and coefficient of variation. |
| Excluded runs | None defined until first benchmark run is completed. |
| Date last run | Pending first benchmark run |
| Version tested | Pending first benchmark run |
| Repeatability | Graph schema version, connector configs, and failure injection manifest published with each benchmark pack. |
Assumptions
- -Connectors supply merge, deploy, and observability signals.
- -Ground-truth inventory maintained independently of System Graph.
- -Full-suite runs establish failure labels for scoring.
Results pending first benchmark run
This page does not display performance numbers until completed runs pass validation. When published, results include confidence ranges and sample sizes.
| Metric | Value | Confidence range | Notes |
|---|---|---|---|
| Change impact accuracy | Pending | - | Awaiting completed runs |
| Impacted service detection | Pending | - | Awaiting completed runs |
| Impacted workflow detection | Pending | - | Awaiting completed runs |
| Targeted test selection accuracy | Pending | - | Awaiting completed runs |
| Dependency mapping completeness | Pending | - | Awaiting completed runs |
| Release-risk scoring usefulness | Pending | - | Awaiting completed runs |
What this benchmark does not claim
- -Graph completeness depends on connector coverage in each environment.
- -Dynamic dependencies (feature flags, runtime routing) may differ from static models.
- -No accuracy percentages are published until labeled runs complete.
Enterprise interpretation
Use this suite to validate whether System Graph reduces unnecessary test execution while maintaining defect detection. Expect confidence intervals on precision/recall, not point estimates.
Continue your evaluation
Evaluate Zof against your reliability requirements
Review methodology, run a structured assessment, or benchmark against your workflow with enterprise architects.
