Self-Healing Strategies for Agent-Driven Tests
What "Self-Healing" Means
Traditional self-healing (testRigor, Mabl, etc.): When a selector breaks, the tool automatically finds an alternative using heuristics (visual position, text content, nearby elements).
Agent-driven self-healing: When a selector breaks, the AI agent reasons about the failure and decides how to recover. This is fundamentally more flexible because the agent can understand context, not just apply rules.
The Three Levels of Recovery
Level 1: Automatic Retry with Alternative Selectors
When the primary selector fails, the agent tries alternatives:
Agent: Bash("vibe-check click '#submit-btn'")
→ Error: timeout waiting for '#submit-btn'
Agent thinking: "The submit button selector failed. Let me find what buttons
exist on the page."
Agent: Bash("vibe-check find-all 'button' --json")
→ [
{"tag": "button", "text": "Cancel", "selector": "button:nth-of-type(1)"},
{"tag": "button", "text": "Submit Form", "selector": "button:nth-of-type(2)"},
{"tag": "button", "text": "Save Draft", "selector": "button:nth-of-type(3)"}
]
Agent thinking: "I see a button with text 'Submit Form'. That's likely the
submit button with a changed ID. Let me try clicking it."
Agent: Bash("vibe-check click 'button:nth-of-type(2)'")
→ Success
Token cost: ~500 tokens for the recovery (find-all + reasoning + retry) Success rate: ~80% for simple selector changes
Level 2: Page Analysis with Text and Screenshots
When Level 1 doesn't work, the agent takes a deeper look:
Agent: Bash("vibe-check find-all 'button'")
→ No buttons found (the page didn't load correctly)
Agent: Bash("vibe-check screenshot -o debug.png")
Agent: Bash("vibe-check text")
→ "Loading... Please wait"
Agent thinking: "The page is still loading. There's a loading state I need
to wait for."
Agent: Bash("vibe-check wait '.loading' --state hidden --timeout 60000")
Agent: Bash("vibe-check click '#submit-btn'")
→ Success
Token cost: ~800 tokens Success rate: ~90% (catches timing issues, loading states, redirects)
Level 3: MCP Fallback for Semantic Analysis
When the page structure has fundamentally changed:
Agent: "CLI-based recovery hasn't worked. Let me use MCP for semantic analysis."
Agent: browser_snapshot() # MCP tool - returns accessibility tree
→ [role="form" name="Contact Form"]
[role="textbox" name="Your email"]
[role="textbox" name="Your message"]
[role="button" name="Send message"]
Agent thinking: "The form has been completely redesigned. The submit button
is now called 'Send message' and uses different HTML structure. Let me
update my approach."
Agent: Bash("vibe-check click '[aria-label=\"Send message\"]'")
→ Success
Token cost: ~5,500 tokens (expensive but necessary) Success rate: ~95% (handles major redesigns)
Recovery Strategy Matrix
| Failure Type | Detection | Recovery | Cost |
|---|---|---|---|
| ID/class changed | Selector timeout | Find by text content | Low |
| Element moved | Selector found wrong element | Find by context/position | Low |
| Page restructured | Multiple selectors fail | Screenshot + text analysis | Medium |
| Loading issue | Element not yet present | Wait for loading to complete | Low |
| Modal/overlay blocking | "Obscured by" error | Close overlay first | Low |
| Complete redesign | All approaches fail | MCP semantic analysis | High |
| Authentication expired | Redirect to login | Re-authenticate | Medium |
Implementing Self-Healing in Your Framework
Wrapper Function
# smart_click: Click with self-healing
smart_click() {
local primary_selector="$1"
local expected_text="${2:-}" # Optional: what the element should say
# Attempt 1: Primary selector
if vibe-check click "$primary_selector" 2>/dev/null; then
return 0
fi
echo "[heal] Primary selector failed: $primary_selector"
# Attempt 2: Find by text content
if [ -n "$expected_text" ]; then
local alt_selector=$(vibe-check find-all "button, a, [role=button]" --json 2>/dev/null | \
python3 -c "
import json, sys
for el in json.load(sys.stdin):
if '$expected_text'.lower() in el.get('text', '').lower():
print(el['selector'])
break
")
if [ -n "$alt_selector" ] && vibe-check click "$alt_selector" 2>/dev/null; then
echo "[heal] Recovered using text match: $alt_selector"
return 0
fi
fi
# Attempt 3: Screenshot for debugging
vibe-check screenshot -o "heal_debug_$(date +%s).png" 2>/dev/null
echo "[heal] All recovery attempts failed for: $primary_selector"
return 1
}
# Usage:
smart_click "#submit-btn" "Submit"
smart_click ".login-button" "Log in"
Agent-Native Approach
Rather than scripting recovery logic, let the agent handle it naturally:
# Test definition with recovery hints
test: Submit contact form
steps:
- Fill in the contact form
- Click the submit button
recovery_hints:
submit_button:
primary: "#submit-btn"
fallbacks:
- "button[type=submit]"
- "button:has-text('Submit')"
- "button:has-text('Send')"
semantic: "The button that submits the form"
The agent reads the recovery hints and uses them if the primary selector fails.
Self-Healing vs Flaky Tests
What's the Difference?
Flaky test: The test itself is unreliable (race conditions, timing issues, external dependencies). Self-healing doesn't fix this — it masks it.
Broken selector: The application changed, the test selector is stale. Self-healing legitimately fixes this.
How to Tell Them Apart
| Signal | Flaky Test | Broken Selector |
|---|---|---|
| Fails intermittently | Yes | No (consistent failure) |
| Same selector works sometimes | Yes | No |
| UI visually unchanged | Yes | No (UI was updated) |
| Error message | Timeout (element exists but timing varies) | Not found (element genuinely missing) |
The Rule
Self-healing should fix broken selectors, not mask flaky tests. If a test needs healing on every run, it's flaky and needs to be fixed at the source.
Tracking Self-Healing Events
Log every healing event for analysis:
{
"timestamp": "2026-02-09T14:23:05Z",
"test": "login_valid",
"step": 4,
"primary_selector": "#submit-btn",
"healed_selector": "button:has-text('Sign in')",
"healing_level": 1,
"healing_cost_tokens": 450,
"reason": "ID changed from 'submit-btn' to 'sign-in-btn'"
}
Review healing logs weekly:
- High healing rate on one test → Update the test selectors
- High healing rate across tests → Major UI refactor happened, update test suite
- Healing always at Level 3 → Your selectors are too fragile, use more robust strategies
Selector Robustness Hierarchy
From most to least robust:
| Strategy | Example | Resilience | Readability |
|---|---|---|---|
data-testid |
[data-testid="submit"] |
Highest | High |
| ARIA label | [aria-label="Submit form"] |
High | High |
| Role + text | button:has-text("Submit") |
High | High |
| Semantic HTML | form button[type=submit] |
Medium | Medium |
| Class name | .btn-submit |
Medium | Medium |
| ID | #submit-btn |
Medium | High |
| CSS path | div > form > div:nth-child(3) > button |
Lowest | Lowest |
Recommendation for AI-driven tests: Use data-testid attributes where possible (requires collaboration with developers). Fall back to ARIA labels or role+text selectors.
Interview Talking Point
"Our self-healing approach has three tiers. First, when a selector fails, the agent searches for alternative elements by text content — this handles 80% of cases at ~500 tokens. Second, the agent takes a screenshot and reads page text to understand the current state — catching loading issues and redirects. Third, as a fallback, we use MCP's accessibility tree for semantic analysis when the page has been fundamentally redesigned. The key insight is that AI agent self-healing is qualitatively different from traditional rule-based healing: the agent can reason about context, understand error messages, and make judgment calls. But we track healing events carefully — if a test needs healing every run, that's a flaky test that needs fixing, not a valid healing scenario."