Architecture Decision Records for an AI Test Automation Framework

ADR-001: Agent-Driven vs Traditional Test Execution

Context

Traditional test frameworks (Selenium, Playwright, Cypress) execute tests as deterministic scripts. AI agents introduce non-deterministic reasoning into the test execution loop.

Decision

Use an agent-as-orchestrator model: the AI agent reads test definitions (natural language or structured), decides how to interact with the application, and reports results. The framework provides the tools (vibe-check CLI), but the agent decides the execution strategy.

Consequences

(+) Tests are more resilient — the agent can reason about unexpected states
(+) Test definitions can be higher-level ("verify login works") instead of step-by-step
(+) Self-healing: when selectors break, the agent can find alternatives
(-) Non-deterministic — the same test might execute differently each time
(-) Harder to debug — agent reasoning is opaque compared to line-by-line scripts
(-) Slower per-test than deterministic execution

Mitigation

Log every command the agent executes (for reproducibility)
Screenshot on every state change (for debugging)
Set deterministic timeouts (prevent infinite loops)
Fallback to deterministic scripts for critical paths

ADR-002: CLI Skill as Primary Browser Interface

Context

We evaluated three approaches:

Playwright MCP server (structured tools, accessibility trees)
Vibium vibe-check skill (CLI commands via Bash)
Direct Playwright/Vibium client library

Decision

Use vibe-check CLI skill as the primary interface, with Playwright MCP available as fallback for page discovery.

Rationale

Criterion	MCP	Skill	Library
Token cost per step	~5,000	~130	N/A (not usable by agent)
Agent integration	Native	Via Bash tool	Requires code generation
Setup complexity	Medium	Low	High
Page understanding	Rich	Text only	Programmatic
CI compatibility	Yes	Yes	Yes

The skill approach leaves 98% of context for reasoning while still providing all needed browser interactions.

Consequences

(+) 29x lower token cost than MCP
(+) Simple setup (one npx skills add command)
(+) Composable with other CLI tools
(-) No structured page understanding (text/screenshots only)
(-) Requires CSS selector knowledge
(-) Less rich error context than MCP

ADR-003: Test Definition Format

Context

Tests need to be defined in a format the agent can read and execute. Options:

Natural language descriptions
Gherkin (Given/When/Then)
Structured YAML/JSON
Hybrid (structured with natural language steps)

Decision

Use structured YAML with natural language steps:

test: Login with valid credentials
preconditions:
  - User "test@example.com" exists with password "secret123"
steps:
  - Navigate to the login page
  - Enter email "test@example.com"
  - Enter password "secret123"
  - Click the login button
  - Verify the dashboard loads with welcome message
expected:
  - Dashboard page is displayed
  - Welcome message contains "test@example.com"
selectors:  # Optional hints for the agent
  login_page: https://app.example.com/login
  email_input: "input[name=email]"
  password_input: "input[name=password]"
  submit_button: "button[type=submit]"
  welcome_message: ".welcome-msg"

Rationale

Natural language steps give the agent freedom to adapt
Structured format enables programmatic test management
Optional selector hints speed up execution when available
The agent can ignore hints and discover selectors if they're stale

ADR-004: Daemon Mode for Development, Oneshot for CI

Context

Vibium supports daemon mode (persistent browser) and oneshot mode (fresh browser per command).

Decision

Development/local: Daemon mode for speed (~100ms per command vs ~2s)
CI/CD: Oneshot mode with --headless for isolation

Configuration

# Development
export VIBIUM_MODE=daemon

# CI
export VIBIUM_ONESHOT=1
export VIBIUM_HEADLESS=1

Consequences

(+) Fast feedback during development
(+) Clean isolation in CI
(-) State leaks possible in daemon mode (mitigated by explicit cleanup between tests)

ADR-005: Screenshot-Driven Debugging

Context

When a test fails, the agent needs to understand what went wrong. Options:

HTML dump of the page
Screenshot capture
Accessibility tree snapshot
All of the above

Decision

Capture screenshot + page text on every failure. HTML dump on request.

# On failure, the agent executes:
vibe-check screenshot -o "failures/${TEST_NAME}_$(date +%s).png"
vibe-check text > "failures/${TEST_NAME}_$(date +%s).txt"
vibe-check url >> "failures/${TEST_NAME}_$(date +%s).txt"

Rationale

Screenshots provide visual context for humans reviewing failures
Page text provides semantic context for the agent to reason about
HTML dumps are rarely needed but available via vibe-check html
All three are cheap (no MCP overhead)

ADR-006: Error Recovery Strategy

Context

When a command fails (selector not found, timeout, etc.), the agent needs a strategy.

Decision

Three-tier recovery:

Tier 1: Intelligent Retry (Agent Reasoning)

Agent: vibe-check click ".btn-submit" → TIMEOUT
Agent: "The submit button wasn't found. Let me investigate."
Agent: vibe-check find-all "button" → lists all buttons
Agent: vibe-check text "button" → reads button text
Agent: "Found a button with text 'Submit'. Trying a different selector."
Agent: vibe-check click "button:has-text('Submit')" → SUCCESS

Tier 2: Screenshot Analysis

Agent: vibe-check screenshot -o debug.png
Agent: *analyzes screenshot*
Agent: "I see a loading spinner. The page hasn't finished loading."
Agent: vibe-check wait ".spinner" --state hidden --timeout 60000
Agent: vibe-check click ".btn-submit" → SUCCESS

Tier 3: MCP Fallback (if available)

Agent: "CLI approach failed. Switching to MCP for rich page analysis."
Agent: browser_snapshot() → accessibility tree
Agent: *finds element via semantic analysis*
Agent: "Found the element. It's behind a modal. Need to close the modal first."

Consequences

(+) Most failures resolve at Tier 1 (cheap)
(+) Tier 2 provides visual debugging even for automated runs
(+) Tier 3 gives a safety net for complex scenarios
(-) Multi-tier recovery adds latency to failure cases

ADR-007: Test Result Reporting

Context

Test results need to be consumable by:

The AI agent (for reasoning about pass/fail)
Developers (for debugging)
CI systems (for pass/fail gates)

Decision

Three output formats:

Console output (for CI):

PASS  login_valid_credentials      (2.3s)
PASS  login_invalid_password       (1.8s)
FAIL  login_expired_account        (5.1s) — Expected "Account expired", got "Welcome"
PASS  signup_new_user              (3.2s)

4 tests: 3 passed, 1 failed

JSON report (for programmatic consumption):

{
  "suite": "authentication",
  "tests": [
    {
      "name": "login_valid_credentials",
      "status": "pass",
      "duration_ms": 2300,
      "commands": ["navigate", "type", "type", "click", "wait", "text"],
      "screenshots": ["login_step1.png", "login_result.png"]
    }
  ]
}

Failure artifacts (for debugging):

failures/
├── login_expired_account/
│   ├── screenshot.png
│   ├── page_text.txt
│   ├── page_url.txt
│   └── agent_reasoning.md

ADR-008: Parallel Test Execution

Context

Running tests sequentially is slow. But browser tests have shared state issues.

Decision

Sequential by default (daemon mode, shared browser)
Parallel via oneshot (each test gets its own browser instance)
Parallel limit: Match available CPU cores (each Chrome instance uses ~200MB RAM + 1 core)

# Parallel execution (4 workers)
cat test_list.txt | xargs -P4 -I{} bash -c '
  VIBIUM_ONESHOT=1 run_test "{}"
'

Constraints

Each parallel worker needs its own Chrome instance (~200MB RAM)
8-core CI machine: max ~6 parallel workers (leave 2 cores for OS + agent)
Network-bound tests may not benefit from parallelism