Skip to content

Testing

Test Health

Failure analysis, flakiness detection, and quarantine workflows.

Overview

Test Health is the operational analysis layer for failures, flakiness, and failure clusters across runs. It helps teams prioritize stabilization before release gates block shipping, complementing raw run triage with pattern recognition and quarantine workflows.

Use Test Health when failure volume exceeds manual per-run inspection, during hardening sprints, after major UI refactors, or when pipeline noise erodes trust in smoke gates. The command palette shortcut (⌘K → "Test Health") is common in on-call runbooks.

Quarantine is a governed decision: exclude chronically flaky cases from blocking suites while tracking remediation owners and reinstatement criteria. Document quarantine rationale for audit-sensitive releases.

Who should read this

  • QA engineers, SREs, platform teams, and developers operating Zof Console and APIs.

Prerequisites

  • Multiple runs with failure history for meaningful clustering
  • Permission to view Quality → Test Health and edit quarantine state

When to use this workflow

  • Flaky smoke tests blocking deployments without product defects
  • Post-migration failure spikes needing cluster-level prioritization
  • Monthly hardening sprints with explicit stabilization OKRs

Flakiness quarantine workflow

Flakiness quarantine workflow

Standard enterprise flow from detection through reinstatement.

Step-by-step procedure

Open Test Health

Navigate Quality → Test Health or ⌘K → "Test Health".

Filter by project, environment, and time window matching your investigation.

Review top clusters sorted by failure rate or recency.

Analyze failure clusters

Open cluster detail to see affected cases, common error signatures, and related runs.

Differentiate assertion failures from timeout and connectivity patterns.

Compare cluster timeline to deployment events in your change calendar.

Determine root cause category

Product defect: file engineering ticket with run IDs and artifacts attached.

Test debt: assign QA owner to update steps, data, or selectors in test library.

Infrastructure: route to SRE with Error-status runs and agent telemetry.

Apply quarantine when appropriate

Quarantine case from smoke or release-blocking suites when flakiness is confirmed and fix is scheduled.

Record owner, target date, and reinstatement criteria in your issue tracker.

Notify release managers when quarantine changes gate semantics for an upcoming release.

Remediate and verify

Implement product fix or test stabilization in staging.

Run targeted suite excluding unrelated cases to conserve agent capacity.

Observe consecutive passes meeting your reinstatement threshold (commonly 5-10 greens).

Reinstate and monitor

Remove quarantine status and restore case to blocking suites.

Watch Test Health for 48-72 hours for cluster recurrence.

Post retrospective entry if flakiness stemmed from environment drift or data pollution.

Key concepts

Organization scope
All Zof Console and API operations are isolated to your authenticated tenant.
Governed execution
Agent output and remediation follow policy packs with human approval when configured.

Best practices

  • Do not quarantine without an owner and dated remediation plan
  • Prefer fixing root cause over permanent quarantine for P0 cases
  • Use cluster titles in standups to align QA and engineering
  • Reinstate only after staging passes, not only local developer machines

Common issues

Cluster shows mixed root causes
Split cases into separate remediation tickets; avoid bulk quarantine masking product defects.
Quarantine but pipeline still fails
CI may reference case IDs directly rather than suite membership. Update pipeline scope.

Was this page helpful?

01Zof Console

Isang surface para sa posture, operasyon, at kung ano ang kailangang asikasuhin susunod.

Ang authenticated na home na binubuksan araw-araw ng mga team ng engineering, QA, at SRE: quality posture, mga in-flight na run, coverage ayon sa module, at kung ano ang dapat asikasuhin susunod.

OPERATIONAL KPIs

  • Mga Run
  • Coverage
  • Panganib

Live sa bawat environment na sini-ship mo.

WORK SPINE

  • Specs
  • Tests
  • Schedules

Mula sa specification hanggang scheduled regression.

GUARDRAILS

  • RBAC
  • SSO
  • audit

Bawat aksyon ay maiuugnay sa pinangalanang tao.

LIVE/console
Zof AI home command center na nagpapakita ng 12 run sa 94% pass, 3 bukas na kritikal na isyu, 84% coverage, apat na module traceability bar, ang specification pipeline, mga paparating na iskedyul, at mga inirerekomendang susunod na aksyon na may active-runs sidebar.
Home view · Checkout Service · Staging · captured live from the product.
  • 01 · RUNS · 24H

    94% pass

    12 runs across staging

  • 02 · COVERAGE

    84%

    Across four modules

  • 03 · ACTIVE RUNS

    3 running

    Live on this branch

  • 04 · NEXT ACTIONS

    Recommended

    Triage gaps, new spec

Test Health | Zof AI Documentation