Architecture Decision Records for an AI Test Automation Framework
ADR-001: Agent-Driven vs Traditional Test Execution
Context
Traditional test frameworks (Selenium, Playwright, Cypress) execute tests as deterministic scripts. AI agents introduce non-deterministic reasoning into the test execution loop.
Decision
Use an agent-as-orchestrator model: the AI agent reads test definitions (natural language or structured), decides how to interact with the application, and reports results. The framework provides the tools (vibe-check CLI), but the agent decides the execution strategy.
Consequences
- (+) Tests are more resilient — the agent can reason about unexpected states
- (+) Test definitions can be higher-level ("verify login works") instead of step-by-step
- (+) Self-healing: when selectors break, the agent can find alternatives
- (-) Non-deterministic — the same test might execute differently each time
- (-) Harder to debug — agent reasoning is opaque compared to line-by-line scripts
- (-) Slower per-test than deterministic execution
Mitigation
- Log every command the agent executes (for reproducibility)
- Screenshot on every state change (for debugging)
- Set deterministic timeouts (prevent infinite loops)
- Fallback to deterministic scripts for critical paths
ADR-002: CLI Skill as Primary Browser Interface
Context
We evaluated three approaches:
- Playwright MCP server (structured tools, accessibility trees)
- Vibium vibe-check skill (CLI commands via Bash)
- Direct Playwright/Vibium client library
Decision
Use vibe-check CLI skill as the primary interface, with Playwright MCP available as fallback for page discovery.
Rationale
| Criterion | MCP | Skill | Library |
|---|---|---|---|
| Token cost per step | ~5,000 | ~130 | N/A (not usable by agent) |
| Agent integration | Native | Via Bash tool | Requires code generation |
| Setup complexity | Medium | Low | High |
| Page understanding | Rich | Text only | Programmatic |
| CI compatibility | Yes | Yes | Yes |
The skill approach leaves 98% of context for reasoning while still providing all needed browser interactions.
Consequences
- (+) 29x lower token cost than MCP
- (+) Simple setup (one
npx skills addcommand) - (+) Composable with other CLI tools
- (-) No structured page understanding (text/screenshots only)
- (-) Requires CSS selector knowledge
- (-) Less rich error context than MCP
ADR-003: Test Definition Format
Context
Tests need to be defined in a format the agent can read and execute. Options:
- Natural language descriptions
- Gherkin (Given/When/Then)
- Structured YAML/JSON
- Hybrid (structured with natural language steps)
Decision
Use structured YAML with natural language steps:
test: Login with valid credentials
preconditions:
- User "test@example.com" exists with password "secret123"
steps:
- Navigate to the login page
- Enter email "test@example.com"
- Enter password "secret123"
- Click the login button
- Verify the dashboard loads with welcome message
expected:
- Dashboard page is displayed
- Welcome message contains "test@example.com"
selectors: # Optional hints for the agent
login_page: https://app.example.com/login
email_input: "input[name=email]"
password_input: "input[name=password]"
submit_button: "button[type=submit]"
welcome_message: ".welcome-msg"
Rationale
- Natural language steps give the agent freedom to adapt
- Structured format enables programmatic test management
- Optional selector hints speed up execution when available
- The agent can ignore hints and discover selectors if they're stale
ADR-004: Daemon Mode for Development, Oneshot for CI
Context
Vibium supports daemon mode (persistent browser) and oneshot mode (fresh browser per command).
Decision
- Development/local: Daemon mode for speed (~100ms per command vs ~2s)
- CI/CD: Oneshot mode with
--headlessfor isolation
Configuration
# Development
export VIBIUM_MODE=daemon
# CI
export VIBIUM_ONESHOT=1
export VIBIUM_HEADLESS=1
Consequences
- (+) Fast feedback during development
- (+) Clean isolation in CI
- (-) State leaks possible in daemon mode (mitigated by explicit cleanup between tests)
ADR-005: Screenshot-Driven Debugging
Context
When a test fails, the agent needs to understand what went wrong. Options:
- HTML dump of the page
- Screenshot capture
- Accessibility tree snapshot
- All of the above
Decision
Capture screenshot + page text on every failure. HTML dump on request.
# On failure, the agent executes:
vibe-check screenshot -o "failures/${TEST_NAME}_$(date +%s).png"
vibe-check text > "failures/${TEST_NAME}_$(date +%s).txt"
vibe-check url >> "failures/${TEST_NAME}_$(date +%s).txt"
Rationale
- Screenshots provide visual context for humans reviewing failures
- Page text provides semantic context for the agent to reason about
- HTML dumps are rarely needed but available via
vibe-check html - All three are cheap (no MCP overhead)
ADR-006: Error Recovery Strategy
Context
When a command fails (selector not found, timeout, etc.), the agent needs a strategy.
Decision
Three-tier recovery:
Tier 1: Intelligent Retry (Agent Reasoning)
Agent: vibe-check click ".btn-submit" → TIMEOUT
Agent: "The submit button wasn't found. Let me investigate."
Agent: vibe-check find-all "button" → lists all buttons
Agent: vibe-check text "button" → reads button text
Agent: "Found a button with text 'Submit'. Trying a different selector."
Agent: vibe-check click "button:has-text('Submit')" → SUCCESS
Tier 2: Screenshot Analysis
Agent: vibe-check screenshot -o debug.png
Agent: *analyzes screenshot*
Agent: "I see a loading spinner. The page hasn't finished loading."
Agent: vibe-check wait ".spinner" --state hidden --timeout 60000
Agent: vibe-check click ".btn-submit" → SUCCESS
Tier 3: MCP Fallback (if available)
Agent: "CLI approach failed. Switching to MCP for rich page analysis."
Agent: browser_snapshot() → accessibility tree
Agent: *finds element via semantic analysis*
Agent: "Found the element. It's behind a modal. Need to close the modal first."
Consequences
- (+) Most failures resolve at Tier 1 (cheap)
- (+) Tier 2 provides visual debugging even for automated runs
- (+) Tier 3 gives a safety net for complex scenarios
- (-) Multi-tier recovery adds latency to failure cases
ADR-007: Test Result Reporting
Context
Test results need to be consumable by:
- The AI agent (for reasoning about pass/fail)
- Developers (for debugging)
- CI systems (for pass/fail gates)
Decision
Three output formats:
Console output (for CI):
PASS login_valid_credentials (2.3s)
PASS login_invalid_password (1.8s)
FAIL login_expired_account (5.1s) — Expected "Account expired", got "Welcome"
PASS signup_new_user (3.2s)
4 tests: 3 passed, 1 failed
JSON report (for programmatic consumption):
{
"suite": "authentication",
"tests": [
{
"name": "login_valid_credentials",
"status": "pass",
"duration_ms": 2300,
"commands": ["navigate", "type", "type", "click", "wait", "text"],
"screenshots": ["login_step1.png", "login_result.png"]
}
]
}
Failure artifacts (for debugging):
failures/
├── login_expired_account/
│ ├── screenshot.png
│ ├── page_text.txt
│ ├── page_url.txt
│ └── agent_reasoning.md
ADR-008: Parallel Test Execution
Context
Running tests sequentially is slow. But browser tests have shared state issues.
Decision
- Sequential by default (daemon mode, shared browser)
- Parallel via oneshot (each test gets its own browser instance)
- Parallel limit: Match available CPU cores (each Chrome instance uses ~200MB RAM + 1 core)
# Parallel execution (4 workers)
cat test_list.txt | xargs -P4 -I{} bash -c '
VIBIUM_ONESHOT=1 run_test "{}"
'
Constraints
- Each parallel worker needs its own Chrome instance (~200MB RAM)
- 8-core CI machine: max ~6 parallel workers (leave 2 cores for OS + agent)
- Network-bound tests may not benefit from parallelism