Benchmark framework · results pending

System Graph Benchmarks

Measure change-impact accuracy, dependency completeness, and targeted test selection, grounded in observed failures, not graph size vanity metrics.

Run a reliability assessment Review benchmark methodology

Benchmark framework, results pending. Methodology and measurement definitions are published; performance numbers appear only after completed runs.

Why this benchmark matters

System Graph value is measured by whether teams run the right tests for each change. Incomplete graphs waste fleet capacity; inaccurate impact maps miss regressions.

Metrics measured

What this suite tracks

score

Change impact accuracy

Precision/recall of predicted impacted services vs observed failures.

services

Impacted service detection

Correct identification of services touched by a tagged change.

workflows

Impacted workflow detection

Correct identification of user/API workflows affected by change.

rate

Targeted test selection accuracy

Share of failures caught by selected tests vs full suite.

rate

Dependency mapping completeness

Coverage of known dependencies vs ground-truth architecture inventory.

score

Release-risk scoring usefulness

Correlation between risk score bands and observed defect severity.

Methodology

How we measure

We score precision/recall for impact sets, selection coverage, and risk band calibration. Graph node counts alone are not reported as success metrics.

Test environment	Multi-service reference application with known dependency graph, connector-fed CI/deploy events, injected failure tags, and exported ground-truth inventory.
Dataset / workload	Tagged commits with controlled failure injection across services and workflows. Compare predicted impact sets to executed test outcomes.
Sample size	Minimum 40 tagged changes × 3 connector coverage tiers (to be confirmed at first run).
Number of runs	5 graph rebuild cycles per tier; impact predictions frozen per commit SHA.
Variance	Not yet measured. Future runs will report p50, p95, and coefficient of variation.
Excluded runs	None defined until first benchmark run is completed.
Date last run	Pending first benchmark run
Version tested	Pending first benchmark run
Repeatability	Graph schema version, connector configs, and failure injection manifest published with each benchmark pack.

Assumptions

-Connectors supply merge, deploy, and observability signals.
-Ground-truth inventory maintained independently of System Graph.
-Full-suite runs establish failure labels for scoring.

Results

Results pending first benchmark run

This page does not display performance numbers until completed runs pass validation. When published, results include confidence ranges and sample sizes.

Metric	Value	Confidence range	Notes
Change impact accuracy	Pending	-	Awaiting completed runs
Impacted service detection	Pending	-	Awaiting completed runs
Impacted workflow detection	Pending	-	Awaiting completed runs
Targeted test selection accuracy	Pending	-	Awaiting completed runs
Dependency mapping completeness	Pending	-	Awaiting completed runs
Release-risk scoring usefulness	Pending	-	Awaiting completed runs

Limitations

What this benchmark does not claim

-Graph completeness depends on connector coverage in each environment.
-Dynamic dependencies (feature flags, runtime routing) may differ from static models.
-No accuracy percentages are published until labeled runs complete.

Enterprise interpretation

Use this suite to validate whether System Graph reduces unnecessary test execution while maintaining defect detection. Expect confidence intervals on precision/recall, not point estimates.

Continue your evaluation

Product

Guides

Next steps

Evaluate Zof against your reliability requirements

Review methodology, run a structured assessment, or benchmark against your workflow with enterprise architects.

Run a reliability assessment Benchmark Zof against your workflow Review benchmark methodology Talk to an enterprise architect