Skip to content
Opérations de fiabilité

Audit-Ready by Default: Tying Every Reliability Metric to a Fleet Run and an Approval

A playbook for compliance and risk officers: make every reliability metric trace to a fleet run, an approval, and System Graph context so audit exports hold up.

Équipe Fiabilité Zof · Ingénierie et produit

1 octobre 2025 · 6 min de lecture · Mis à jour le 1 octobre 2025

Share
01

Why your current reliability metrics fail an audit

Most reliability reporting is assembled, not generated. A coverage figure is exported from one tool, a defect trend from another, a release log from a third, and someone in engineering pastes them into a deck for the quarterly risk review. The number on the slide is plausible. It is also unfalsifiable. When an examiner asks "show me the run that produced this," the chain breaks.

The problem is getting worse for a specific, structural reason. Roughly 41% of codebases are now AI-generated, and around 45% of AI coding tasks introduce critical flaws or security issues. The volume of change and its defect rate are rising together. Meanwhile, around 80% of developers bypass policy and guardrails when those controls are slow or subjective. So the underlying activity your metrics are supposed to attest to is becoming both higher-velocity and less governed at the same moment you are being asked to vouch for it.

For insurance specifically, this lands hard. Policy rating engines, claims adjudication workflows, and regulatory reporting pipelines all carry direct financial and conduct risk. When a model regulation examiner or your own internal audit asks how you validated a change to the claims workflow before it shipped, "the build was green" is not an answer that survives. You need the run, the result, the approval, and the context the result was measured against. An assembled dashboard cannot give you that. It was never built to.

02

The unit of evidence: a run, an approval, and context

The fix is to change what a metric *is*. Stop treating a reliability number as a value and start treating it as a record with provenance. A defensible reliability metric carries three things:

  • A fleet run. The metric was produced by a specific, identifiable validation execution, not aggregated from a tool that no longer remembers what it did. Testing Fleets are coordinated agents that plan, execute, and observe validation as the system evolves, so each run is a discrete, replayable event rather than a static script's pass/fail.
  • An approval. A named human authorized the action the metric describes, under a policy you defined. This is the "agents propose, humans authorize" principle made into an artifact: the proposal, the policy it was checked against, and the person who signed.
  • System Graph context. The metric is anchored to the actual dependency surface it was measured against. The System Graph is a live map of services, dependencies, and CI/CD that makes validation change-aware. It is what lets you say *this* coverage figure applies to *this* change against *these* downstream systems, not to a platform average.

When those three travel together, "87% coverage" stops being a slide and becomes a claim an auditor can pull the thread on and find solid every time. Zof's published positioning for Reliability Analytics states it directly: reports tie to fleet runs, approvals, and System Graph context. That is the architectural difference between evidence and a vanity metric.

03

The playbook: building audit-readiness by default

Audit-readiness as a periodic scramble is the failure mode. The goal is to make it the resting state of the system, so that producing an export is a query rather than a project. Five moves:

1. Make the run the source of record, not the dashboard. Reporting should read from validation runs, not from a separate analytics store that drifts. If your metric and your operations team are looking at different numbers, the auditor has already found the gap. Reliability Analytics is explicit that it serves the same data as operations.

2. Bind every gated action to a policy and an approval. Through Governance, define which changes require a named approval and which can pass on evidence alone. A change to the rating engine requires a human sign-off; a documentation-only change does not. The control layer enforces this uniformly, every release, which means the approval trail is complete by default rather than reconstructed under deadline.

3. Anchor metrics to change scope. Use the System Graph so each metric answers "validated against what?" A claims-workflow coverage number means nothing if it includes a change to an unrelated internal tool. Change-aware scoping is what lets you attest to a specific system to a specific examiner.

4. Prioritize by reachability, and record why. Not every finding is exploitable. Reachability-based prioritization can mean 70-90% less exploitable exposure to triage, but the audit value is in recording the reasoning: this finding was deprioritized because it is not reachable from a live entry point, and here is the graph evidence. That is a defensible risk decision, not a swept-under-the-rug one.

5. Make export a first-class output. The evidence has to leave the system in a form your examiners and internal audit accept. Reliability Analytics supports scheduled delivery and API export, so the same governed evidence flows into your GRC tooling on a cadence rather than being hand-assembled the week before a review.

04

What good looks like, and what to watch for

Consider a hypothetical regional carrier modernizing its claims platform. Engineering ships changes weekly. Under the assembled-dashboard model, the risk officer learns about a validation gap when an incident surfaces it. Under the run-bound model, every claims-path change carries a fleet run, a reachability-scored risk posture, and a named approval before it ships, and the quarterly export is generated from those records directly. The examiner asks for the validation history of a specific rate change; it is one query, with provenance intact.

A few failure modes to govern against:

  • Provenance theater. A timestamp and a username are not provenance. The approval has to be bound to a policy and a specific proposal, or it is decoration.
  • Drift between report and reality. If analytics read from a copy, the copy will eventually disagree with the system. Read from the run.
  • Unsupervised remediation. It is tempting to let agents fix and ship. Don't. Remediation Fleets propose fixes; humans authorize them. Unsupervised autonomous fixing inside a regulated workflow is reckless, and an auditor will treat an unapproved automated change exactly as severely as it deserves. The governance around the fix is the engineering.

For carriers that cannot send code or telemetry outside their boundary, the same loop runs locally. Edge Runners execute as signed capsules inside a secure enclave and produce the same audit-ready evidence, which matters when data residency and the examination both have to be satisfied at once.

05

The bottom line

Guides associés

Continuer la lecture

01Zof Console

Une surface pour la posture, les opérations et ce qui nécessite une attention particulière.

Le foyer authentifié que les équipes d'ingénierie, de QA et de SRE ouvrent chaque jour : posture de qualité, exécutions en vol, couverture par module et ce qui requiert de l'attention ensuite.

KPI OPÉRATIONNELS

  • Courses
  • Couverture
  • Risque

Vivez dans tous les environnements dans lesquels vous expédiez.

TRAVAIL DE LA Colonne Vertébrale

  • Spécifications
  • Tests
  • Horaires

De la spécification à la régression planifiée.

GARDE-CORPS

  • RBAC
  • SSO
  • audit

Chaque action attribuable à un humain nommé.

LIVE/console
Centre de commande domestique Zof AI affichant 12 exécutions à 94 % de réussite, 3 problèmes critiques ouverts, une couverture de 84 %, quatre barres de traçabilité des modules, le pipeline de spécifications, les calendriers à venir et les prochaines actions recommandées avec une barre latérale d'exécutions actives.
Vue d'accueil · Service de paiement · Mise en scène · capturé en direct à partir du produit.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

Audit-Ready by Default: Tying Every Reliability Metric to a Fleet Run