Introduction
Engineering teams ship faster when failures are classified quickly and consistently. Yet most organizations still triage failures manually—copying stack traces into chats, pasting logs into docs, and debating whether an issue is a test bug, a product bug, or an environment issue. This reactive workflow slows releases and drains team energy.
AI Failure Triage changes that. It automatically classifies failed tests into meaningful, team-defined categories and presents clear, explainable suggestions with confidence scores. The result is faster triage, fewer handoffs, and a feedback loop that gets smarter with every override.
The Problem
Traditional failure triage is noisy, repetitive, and error-prone. Without intelligence, teams struggle to separate signal from noise and to apply labels consistently across builds and squads.
- Inconsistent labeling — Different engineers categorize the same failure differently, degrading analytics and accountability.
- Slow feedback — Manual review adds hours or days to the discovery-to-fix cycle.
- No learning — Overrides and human input rarely feed back into an improving system.
- Cost and privacy risks — Sending large raw logs to external services without redaction can be risky and expensive.
The Solution
AI Failure Triage combines embeddings-first kNN classification with smart reuse of cluster fingerprints to deliver fast, explainable, and privacy-aware categorization. It integrates directly into your builds UI, supports bulk actions, and records explanations so decisions are auditable and repeatable.
- Top-1 prediction + alternatives — Always see the best label and top-3 suggestions with scores.
- Confidence buckets — High, Medium, Low thresholds drive defaults like “To Be Investigated”.
- Immediate learning — Human overrides become new examples that improve future results.
- Privacy-first — Reuses the existing redaction pipeline; embeddings cache keeps costs predictable.
Feature Details
Embeddings-first kNN
We reuse the same high-quality embeddings pipeline from Smart Failure Clusters. For each failed test, we retrieve or generate an embedding on sanitized text (message + stack) and perform nearest-neighbor search over the last 90 days of labeled examples. We compute a weighted vote to assign a top-level category and a calibrated confidence score.
Cluster-assisted labeling
When failures belong to a known Smart Failure Cluster, we apply the previously chosen category for that cluster instantly with High confidence, making labeling consistent by design.
Human-in-the-loop learning
Any engineer can accept or change a suggested label—single or bulk. Every override stores the normalized text and chosen category so the system improves immediately, reducing override rate over time.
Clear explanations
Every suggestion includes similar past failures with similarity scores and highlighted snippets, so reviewers understand why a label is recommended.
Operational guardrails
Hard caps keep triage predictable (e.g., up to 500 failures per run). Processing is synchronous in the MVP for simplicity. Costs are bounded through redaction, truncation, and cached embeddings.
Getting Started
- Open a failed build and navigate to the “AI Failure Triage” tab.
- Run triage to classify failures. Progress messages show clustering and classification steps.
- Review suggestions with confidence and explanations; accept or change labels individually or in bulk.
- Manage categories in Project Settings. Global defaults include Automation Bug, Environment Issue, Product Bug, and To Be Investigated.
Conclusion
Automated, explainable failure classification unlocks faster triage, better analytics, and calmer releases. With AI Failure Triage, teams resolve issues sooner, align on ownership, and continuously improve the accuracy of their labeling—without changing how they write tests.
