Skip to content
Benchmark framework · results pending

System Graph Benchmarks

Measure change-impact accuracy, dependency completeness, and targeted test selection, grounded in observed failures, not graph size vanity metrics.

Benchmark framework, results pending. Methodology and measurement definitions are published; performance numbers appear only after completed runs.
Why this benchmark matters

System Graph value is measured by whether teams run the right tests for each change. Incomplete graphs waste fleet capacity; inaccurate impact maps miss regressions.

Metrics measured

What this suite tracks

score

Change impact accuracy

Precision/recall of predicted impacted services vs observed failures.

services

Impacted service detection

Correct identification of services touched by a tagged change.

workflows

Impacted workflow detection

Correct identification of user/API workflows affected by change.

rate

Targeted test selection accuracy

Share of failures caught by selected tests vs full suite.

rate

Dependency mapping completeness

Coverage of known dependencies vs ground-truth architecture inventory.

score

Release-risk scoring usefulness

Correlation between risk score bands and observed defect severity.

Methodology

How we measure

We score precision/recall for impact sets, selection coverage, and risk band calibration. Graph node counts alone are not reported as success metrics.

Test environmentMulti-service reference application with known dependency graph, connector-fed CI/deploy events, injected failure tags, and exported ground-truth inventory.
Dataset / workloadTagged commits with controlled failure injection across services and workflows. Compare predicted impact sets to executed test outcomes.
Sample sizeMinimum 40 tagged changes × 3 connector coverage tiers (to be confirmed at first run).
Number of runs5 graph rebuild cycles per tier; impact predictions frozen per commit SHA.
VarianceNot yet measured. Future runs will report p50, p95, and coefficient of variation.
Excluded runsNone defined until first benchmark run is completed.
Date last runPending first benchmark run
Version testedPending first benchmark run
RepeatabilityGraph schema version, connector configs, and failure injection manifest published with each benchmark pack.

Assumptions

  • -Connectors supply merge, deploy, and observability signals.
  • -Ground-truth inventory maintained independently of System Graph.
  • -Full-suite runs establish failure labels for scoring.
Results

Results pending first benchmark run

This page does not display performance numbers until completed runs pass validation. When published, results include confidence ranges and sample sizes.

MetricValueConfidence rangeNotes
Change impact accuracyPending-Awaiting completed runs
Impacted service detectionPending-Awaiting completed runs
Impacted workflow detectionPending-Awaiting completed runs
Targeted test selection accuracyPending-Awaiting completed runs
Dependency mapping completenessPending-Awaiting completed runs
Release-risk scoring usefulnessPending-Awaiting completed runs
Limitations

What this benchmark does not claim

  • -Graph completeness depends on connector coverage in each environment.
  • -Dynamic dependencies (feature flags, runtime routing) may differ from static models.
  • -No accuracy percentages are published until labeled runs complete.

Enterprise interpretation

Use this suite to validate whether System Graph reduces unnecessary test execution while maintaining defect detection. Expect confidence intervals on precision/recall, not point estimates.

Related content

Continue your evaluation

Next steps

Evaluate Zof against your reliability requirements

Review methodology, run a structured assessment, or benchmark against your workflow with enterprise architects.

01The operational surface

One surface for posture, operations, and what needs attention next.

Zof Console at console.zof.ai is the authenticated operational surface engineering, QA, and SRE teams use every day: quality posture, in-flight runs, coverage by module, and the actions that need attention next.

OPERATIONAL KPIs

  • Runs
  • Coverage
  • Risk

Live across every environment you ship to.

WORK SPINE

  • Specs
  • Tests
  • Schedules

From specification to scheduled regression.

GUARDRAILS

  • RBAC
  • SSO
  • audit

Every action attributable to a named human.

LIVE/console
Zof AI home command center showing 12 runs at 94% pass, 3 open critical issues, 84% coverage, four module traceability bars, the specification pipeline, upcoming schedules, and recommended next actions with an active-runs sidebar.
Console home · Checkout Service · Staging · captured live from the product.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

System Graph Benchmark | Zof AI