Self-Healing Strategies for Agent-Driven Tests

What "Self-Healing" Means

Traditional self-healing (testRigor, Mabl, etc.): When a selector breaks, the tool automatically finds an alternative using heuristics (visual position, text content, nearby elements).

Agent-driven self-healing: When a selector breaks, the AI agent reasons about the failure and decides how to recover. This is fundamentally more flexible because the agent can understand context, not just apply rules.

The Three Levels of Recovery

Level 1: Automatic Retry with Alternative Selectors

When the primary selector fails, the agent tries alternatives:

Agent: Bash("vibe-check click '#submit-btn'")
→ Error: timeout waiting for '#submit-btn'

Agent thinking: "The submit button selector failed. Let me find what buttons
exist on the page."

Agent: Bash("vibe-check find-all 'button' --json")
→ [
    {"tag": "button", "text": "Cancel", "selector": "button:nth-of-type(1)"},
    {"tag": "button", "text": "Submit Form", "selector": "button:nth-of-type(2)"},
    {"tag": "button", "text": "Save Draft", "selector": "button:nth-of-type(3)"}
  ]

Agent thinking: "I see a button with text 'Submit Form'. That's likely the
submit button with a changed ID. Let me try clicking it."

Agent: Bash("vibe-check click 'button:nth-of-type(2)'")
→ Success

Token cost: ~500 tokens for the recovery (find-all + reasoning + retry) Success rate: ~80% for simple selector changes

Level 2: Page Analysis with Text and Screenshots

When Level 1 doesn't work, the agent takes a deeper look:

Agent: Bash("vibe-check find-all 'button'")
→ No buttons found (the page didn't load correctly)

Agent: Bash("vibe-check screenshot -o debug.png")
Agent: Bash("vibe-check text")
→ "Loading... Please wait"

Agent thinking: "The page is still loading. There's a loading state I need
to wait for."

Agent: Bash("vibe-check wait '.loading' --state hidden --timeout 60000")
Agent: Bash("vibe-check click '#submit-btn'")
→ Success

Token cost: ~800 tokens Success rate: ~90% (catches timing issues, loading states, redirects)

Level 3: MCP Fallback for Semantic Analysis

When the page structure has fundamentally changed:

Agent: "CLI-based recovery hasn't worked. Let me use MCP for semantic analysis."

Agent: browser_snapshot()  # MCP tool - returns accessibility tree
→ [role="form" name="Contact Form"]
    [role="textbox" name="Your email"]
    [role="textbox" name="Your message"]
    [role="button" name="Send message"]

Agent thinking: "The form has been completely redesigned. The submit button
is now called 'Send message' and uses different HTML structure. Let me
update my approach."

Agent: Bash("vibe-check click '[aria-label=\"Send message\"]'")
→ Success

Token cost: ~5,500 tokens (expensive but necessary) Success rate: ~95% (handles major redesigns)

Recovery Strategy Matrix

Failure Type	Detection	Recovery	Cost
ID/class changed	Selector timeout	Find by text content	Low
Element moved	Selector found wrong element	Find by context/position	Low
Page restructured	Multiple selectors fail	Screenshot + text analysis	Medium
Loading issue	Element not yet present	Wait for loading to complete	Low
Modal/overlay blocking	"Obscured by" error	Close overlay first	Low
Complete redesign	All approaches fail	MCP semantic analysis	High
Authentication expired	Redirect to login	Re-authenticate	Medium

Implementing Self-Healing in Your Framework

Wrapper Function

# smart_click: Click with self-healing
smart_click() {
  local primary_selector="$1"
  local expected_text="${2:-}"  # Optional: what the element should say

  # Attempt 1: Primary selector
  if vibe-check click "$primary_selector" 2>/dev/null; then
    return 0
  fi

  echo "[heal] Primary selector failed: $primary_selector"

  # Attempt 2: Find by text content
  if [ -n "$expected_text" ]; then
    local alt_selector=$(vibe-check find-all "button, a, [role=button]" --json 2>/dev/null | \
      python3 -c "
import json, sys
for el in json.load(sys.stdin):
    if '$expected_text'.lower() in el.get('text', '').lower():
        print(el['selector'])
        break
")
    if [ -n "$alt_selector" ] && vibe-check click "$alt_selector" 2>/dev/null; then
      echo "[heal] Recovered using text match: $alt_selector"
      return 0
    fi
  fi

  # Attempt 3: Screenshot for debugging
  vibe-check screenshot -o "heal_debug_$(date +%s).png" 2>/dev/null

  echo "[heal] All recovery attempts failed for: $primary_selector"
  return 1
}

# Usage:
smart_click "#submit-btn" "Submit"
smart_click ".login-button" "Log in"

Agent-Native Approach

Rather than scripting recovery logic, let the agent handle it naturally:

# Test definition with recovery hints
test: Submit contact form
steps:
  - Fill in the contact form
  - Click the submit button
recovery_hints:
  submit_button:
    primary: "#submit-btn"
    fallbacks:
      - "button[type=submit]"
      - "button:has-text('Submit')"
      - "button:has-text('Send')"
    semantic: "The button that submits the form"

The agent reads the recovery hints and uses them if the primary selector fails.

Self-Healing vs Flaky Tests

What's the Difference?

Flaky test: The test itself is unreliable (race conditions, timing issues, external dependencies). Self-healing doesn't fix this — it masks it.

Broken selector: The application changed, the test selector is stale. Self-healing legitimately fixes this.

How to Tell Them Apart

Signal	Flaky Test	Broken Selector
Fails intermittently	Yes	No (consistent failure)
Same selector works sometimes	Yes	No
UI visually unchanged	Yes	No (UI was updated)
Error message	Timeout (element exists but timing varies)	Not found (element genuinely missing)

The Rule

Self-healing should fix broken selectors, not mask flaky tests. If a test needs healing on every run, it's flaky and needs to be fixed at the source.

Tracking Self-Healing Events

Log every healing event for analysis:

{
  "timestamp": "2026-02-09T14:23:05Z",
  "test": "login_valid",
  "step": 4,
  "primary_selector": "#submit-btn",
  "healed_selector": "button:has-text('Sign in')",
  "healing_level": 1,
  "healing_cost_tokens": 450,
  "reason": "ID changed from 'submit-btn' to 'sign-in-btn'"
}

Review healing logs weekly:

High healing rate on one test → Update the test selectors
High healing rate across tests → Major UI refactor happened, update test suite
Healing always at Level 3 → Your selectors are too fragile, use more robust strategies

Selector Robustness Hierarchy

From most to least robust:

Strategy	Example	Resilience	Readability
`data-testid`	`[data-testid="submit"]`	Highest	High
ARIA label	`[aria-label="Submit form"]`	High	High
Role + text	`button:has-text("Submit")`	High	High
Semantic HTML	`form button[type=submit]`	Medium	Medium
Class name	`.btn-submit`	Medium	Medium
ID	`#submit-btn`	Medium	High
CSS path	`div > form > div:nth-child(3) > button`	Lowest	Lowest

Recommendation for AI-driven tests: Use data-testid attributes where possible (requires collaboration with developers). Fall back to ARIA labels or role+text selectors.

Interview Talking Point

"Our self-healing approach has three tiers. First, when a selector fails, the agent searches for alternative elements by text content — this handles 80% of cases at ~500 tokens. Second, the agent takes a screenshot and reads page text to understand the current state — catching loading issues and redirects. Third, as a fallback, we use MCP's accessibility tree for semantic analysis when the page has been fundamentally redesigned. The key insight is that AI agent self-healing is qualitatively different from traditional rule-based healing: the agent can reason about context, understand error messages, and make judgment calls. But we track healing events carefully — if a test needs healing every run, that's a flaky test that needs fixing, not a valid healing scenario."