Testing
Test Health
Failure analysis, flakiness detection, and quarantine workflows.
Overview
Test Health is the operational analysis layer for failures, flakiness, and failure clusters across runs. It helps teams prioritize stabilization before release gates block shipping, complementing raw run triage with pattern recognition and quarantine workflows.
Use Test Health when failure volume exceeds manual per-run inspection, during hardening sprints, after major UI refactors, or when pipeline noise erodes trust in smoke gates. The command palette shortcut (⌘K → "Test Health") is common in on-call runbooks.
Quarantine is a governed decision: exclude chronically flaky cases from blocking suites while tracking remediation owners and reinstatement criteria. Document quarantine rationale for audit-sensitive releases.
Who should read this
- QA engineers, SREs, platform teams, and developers operating Zof Console and APIs.
Prerequisites
- Multiple runs with failure history for meaningful clustering
- Permission to view Quality → Test Health and edit quarantine state
When to use this workflow
- Flaky smoke tests blocking deployments without product defects
- Post-migration failure spikes needing cluster-level prioritization
- Monthly hardening sprints with explicit stabilization OKRs
Flakiness quarantine workflow
Flakiness quarantine workflow
Standard enterprise flow from detection through reinstatement.
Step-by-step procedure
Open Test Health
Navigate Quality → Test Health or ⌘K → "Test Health".
Filter by project, environment, and time window matching your investigation.
Review top clusters sorted by failure rate or recency.
Analyze failure clusters
Open cluster detail to see affected cases, common error signatures, and related runs.
Differentiate assertion failures from timeout and connectivity patterns.
Compare cluster timeline to deployment events in your change calendar.
Determine root cause category
Product defect: file engineering ticket with run IDs and artifacts attached.
Test debt: assign QA owner to update steps, data, or selectors in test library.
Infrastructure: route to SRE with Error-status runs and agent telemetry.
Apply quarantine when appropriate
Quarantine case from smoke or release-blocking suites when flakiness is confirmed and fix is scheduled.
Record owner, target date, and reinstatement criteria in your issue tracker.
Notify release managers when quarantine changes gate semantics for an upcoming release.
Remediate and verify
Implement product fix or test stabilization in staging.
Run targeted suite excluding unrelated cases to conserve agent capacity.
Observe consecutive passes meeting your reinstatement threshold (commonly 5-10 greens).
Reinstate and monitor
Remove quarantine status and restore case to blocking suites.
Watch Test Health for 48-72 hours for cluster recurrence.
Post retrospective entry if flakiness stemmed from environment drift or data pollution.
Key concepts
- Organization scope
- All Zof Console and API operations are isolated to your authenticated tenant.
- Governed execution
- Agent output and remediation follow policy packs with human approval when configured.
Best practices
- Do not quarantine without an owner and dated remediation plan
- Prefer fixing root cause over permanent quarantine for P0 cases
- Use cluster titles in standups to align QA and engineering
- Reinstate only after staging passes, not only local developer machines
Common issues
- Cluster shows mixed root causes
- Split cases into separate remediation tickets; avoid bulk quarantine masking product defects.
- Quarantine but pipeline still fails
- CI may reference case IDs directly rather than suite membership. Update pipeline scope.
Was this page helpful?