Agent Skills for driving Browser for UI Test Automation: The Complete QA Engineer's Guide
Prepared for: QA Engineer gentle introduction to the roles where heavy AI usage is expected. Focus: Driving a browser through agent skills (not MCP servers), building an AI-augmented test automation framework, and speaking about it credibly with architect-level developers.
Table of Contents
1. Foundations: How Agent Skills Work — 01-foundations/
- Skill Anatomy — SKILL.md structure with annotated examples
- Skill Lifecycle — Discovery, selection, invocation, execution, teardown
- Token Economics — Why skills cost dozens of tokens vs thousands for MCP
2. Vibium Deep Dive — 02-vibium-deep-dive/
- CLI Command Reference — All 22 commands with examples and edge cases
- Actionability Checks — The five checks with actual JavaScript source code
- Daemon Architecture — Process model, cleanup, zombie prevention
- Extension Commands — How
vibium:find,vibium:click,vibium:typework over BiDi
3. Skills vs MCP: When and Why — 03-skills-vs-mcp/
- Architectural Comparison — Deep side-by-side with diagrams
- Token Budget Analysis — Real numbers: how many tokens each approach costs
- Hybrid Strategies — Using both together in one framework
4. Building an AI Test Automation Framework — 04-building-test-framework/
- Architecture Decisions — ADRs for the framework
- Test Patterns — Patterns for AI-driven tests (navigation, forms, assertions, data extraction)
- CI/CD Integration — Running in GitHub Actions, handling headless mode, artifacts
- Reporting and Observability — What to capture, how to present results
- Self-Healing Strategies — How agents recover from broken selectors
5. WebDriver BiDi Protocol — 05-webdriver-bidi/
- Protocol Overview — Message format, sessions, commands vs events
- Evolution from Selenium — Historical context from WebDriver to BiDi
- Vibium Extension Commands — How
vibium:find/click/typeextend the protocol
6. Interview Preparation — 06-interview-preparation/
- Architect QA Scenarios — 20 questions with detailed answers
- Framework Presentation — How to present your framework in 5, 15, and 30 minutes
- Buzzword Decoder — What people actually mean by "agentic testing", "self-healing", "ReAct pattern"
7. Competitive Landscape — 07-competitive-landscape/
- Tool Comparison Matrix — Detailed feature-by-feature comparison
- When to Use What — Decision framework for choosing the right tool
- Future Directions — Where the industry is heading (AI locators, Cortex/Retina, video recording)
1. Foundations
Subfolder: `01-foundations/`
What Are Agent Skills?
Agent skills are reusable capability packages for AI coding agents. Unlike MCP servers that expose tool schemas over a protocol (adding thousands of tokens to context), skills inject procedural knowledge — a markdown file that teaches the agent how to accomplish domain-specific tasks using existing tools (Bash, Read, Write, etc.).
The critical insight: Skills don't add new tools. They teach the agent how to use existing tools for a specific domain.
The SKILL.md Contract
Every skill is defined by a single SKILL.md file with two sections:
---
name: vibe-check # Lowercase, hyphens only, max 64 chars
description: | # THE selection signal — Claude reads this to decide if the skill applies
Browser automation via CLI. Navigate pages, click elements,
fill forms, take screenshots, extract text.
allowed-tools: Bash # What tools the skill may use
---
# Instructions for the agent (markdown content)
The `vibe-check` CLI automates Chrome via the command line...
How Selection Works
There is no algorithmic routing. Claude receives a formatted list of all available skills inside the Skill tool description. When a user asks something like "take a screenshot of this page," Claude's language model matches intent to skill descriptions through its forward pass — no embeddings, no classifiers, just comprehension.
How Invocation Works
When Claude decides to invoke a skill:
- A visible "loading" message appears to the user
- The SKILL.md content is injected as a hidden system message into conversation context
- Tool permissions from
allowed-toolsare temporarily granted - Claude executes the skill's instructions using available tools (primarily Bash for CLI skills)
- Permissions revert when the skill completes
Key Files to Read
01-foundations/01-skill-anatomy.md— Full breakdown of SKILL.md structure with annotated examples01-foundations/02-skill-lifecycle.md— Discovery, selection, invocation, execution, teardown01-foundations/03-token-economics.md— Why skills cost dozens of tokens vs thousands for MCP
2. Vibium Deep Dive
Subfolder: `02-vibium-deep-dive/`
What Is Vibium?
Vibium is browser automation infrastructure built by the creator of Selenium and Appium. It's a single Go binary (~10MB) that:
- Launches and manages Chrome via WebDriver BiDi
- Exposes 22 CLI commands for browser control
- Runs as a daemon (browser persists between commands) or oneshot (fresh browser per command)
- Also provides an MCP server and JS/Python client libraries
The vibe-check Skill
The vibe-check skill (from skills/vibe-check/SKILL.md) is a single SKILL.md file that teaches an AI agent all 22 CLI commands. When installed, the agent can drive Chrome through Bash:
# The agent executes these through the Bash tool
vibe-check navigate https://app.example.com/login
vibe-check type "input[name=email]" "user@test.com"
vibe-check type "input[name=password]" "secret123"
vibe-check click "button[type=submit]"
vibe-check wait "h1"
vibe-check text "h1" # → "Welcome, User"
vibe-check screenshot -o dashboard.png
Architecture: Sense → Think → Act
Vibium's roadmap follows a robotics control loop:
| Layer | Component | Status | Purpose |
|---|---|---|---|
| Act | Clicker | Shipped (V1) | Browser automation via BiDi |
| Sense | Retina | V2 planned | Chrome extension that observes everything |
| Think | Cortex | V2 planned | SQLite-backed memory + navigation planning |
The Five Actionability Checks
Before any interaction, Vibium verifies server-side (in Go, not in client code):
- Visible — Element has non-zero size, not
display:noneorvisibility:hidden - Stable — Position unchanged over 50ms (catches animations)
- ReceivesEvents — Not obscured by another element (
elementFromPointcheck) - Enabled — Not
disabled,aria-disabled, or inside disabled<fieldset> - Editable — (for
typeonly) Accepts text input, notreadonly
These run in a polling loop with 100ms intervals until all pass or timeout (default 30s).
Daemon vs Oneshot
| Mode | How | Best For |
|---|---|---|
| Daemon (default) | Background process keeps browser alive | Interactive sessions, chaining commands |
| Oneshot | Fresh browser per command, torn down after | CI pipelines, isolated test runs |
Key Files to Read
02-vibium-deep-dive/01-cli-command-reference.md— All 22 commands with examples and edge cases02-vibium-deep-dive/02-actionability-checks.md— The five checks with actual JavaScript source code02-vibium-deep-dive/03-daemon-architecture.md— Process model, cleanup, zombie prevention02-vibium-deep-dive/04-extension-commands.md— Howvibium:find,vibium:click,vibium:typework over BiDi
3. Skills vs MCP: When and Why
Subfolder: `03-skills-vs-mcp/`
The Core Trade-off
| Dimension | Skills (CLI) | MCP Server |
|---|---|---|
| Token cost | Dozens (SKILL.md ~200-500 lines) | Thousands (tool schemas + accessibility trees) |
| State management | Stateless between commands (daemon handles browser state) | Persistent connection with rich introspection |
| Setup | npx skills add <repo> |
claude mcp add <name> -- <command> |
| How agent interacts | Bash tool executes CLI commands | Dedicated MCP tools (browser_click, etc.) |
| Error handling | Exit codes + stderr | Structured JSON error responses |
| Best for | High-throughput agents balancing many tasks | Exploratory automation, self-healing loops |
When to Use Skills (CLI Approach)
- Your agent is doing more than just browser work (writing code, running tests, reading files)
- You need to minimize context window consumption
- You want simple, composable commands that chain with other CLI tools
- You're in a CI/CD pipeline where token costs matter
- Playwright's own docs now acknowledge CLI+Skills is more token-efficient
When to Use MCP
- Exploratory testing where the agent needs rich page introspection
- Self-healing test flows that require iterative reasoning over DOM structure
- Long-running autonomous workflows with continuous browser context
- You need accessibility tree analysis for semantic understanding
Key Files to Read
03-skills-vs-mcp/01-architectural-comparison.md— Deep side-by-side with diagrams03-skills-vs-mcp/02-token-budget-analysis.md— Real numbers: how many tokens each approach costs03-skills-vs-mcp/03-hybrid-strategies.md— Using both together in one framework
4. Building an AI Test Automation Framework
Subfolder: `04-building-test-framework/`
Framework Architecture Overview
┌─────────────────────────────────────────────────────┐
│ Test Runner │
│ (pytest / Jest / custom orchestrator) │
├─────────────────────────────────────────────────────┤
│ AI Agent Layer (Claude Code) │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ vibe-check │ │ Test Skills │ │ Reporting │ │
│ │ skill │ │ (custom) │ │ skill │ │
│ └──────┬──────┘ └──────┬───────┘ └─────┬──────┘ │
├─────────┼────────────────┼────────────────┼─────────┤
│ │ Bash Tool │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ Vibium CLI │ │ Test Utils │ │ Report Gen │ │
│ │ (vibe-check)│ │ (scripts) │ │ (scripts) │ │
│ └──────┬──────┘ └─────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Chrome │ │
│ │ (BiDi) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────┘
Core Design Principles
- Agent as orchestrator, not executor — The AI agent decides what to test and how to interact, the framework handles mechanics
- Natural language test definitions — Tests describe intent ("verify login works with valid credentials") not implementation
- Self-healing selectors — When a selector fails, the agent uses
vibe-check find-allandvibe-check textto reason about alternatives - Screenshot-driven debugging — Every failure produces a screenshot + page text for the agent to analyze
Key Files to Read
04-building-test-framework/01-architecture-decisions.md— ADRs for the framework04-building-test-framework/02-test-patterns.md— Patterns for AI-driven tests (navigation, forms, assertions, data extraction)04-building-test-framework/03-ci-cd-integration.md— Running in GitHub Actions, handling headless mode, artifacts04-building-test-framework/04-reporting-and-observability.md— What to capture, how to present results04-building-test-framework/05-self-healing-strategies.md— How agents recover from broken selectors
5. WebDriver BiDi Protocol
Subfolder: `05-webdriver-bidi/`
Why This Matters for Your Interview
WebDriver BiDi is the W3C standard that Vibium is built on. Understanding it shows you know the layer below the tools — critical for architect-level conversations.
Evolution: WebDriver → CDP → BiDi
| Protocol | Year | Transport | Direction | Owned By |
|---|---|---|---|---|
| WebDriver | 2018 (W3C) | HTTP+JSON | Request/Response | W3C |
| CDP | 2017 | WebSocket | Bidirectional | |
| BiDi | 2021+ | WebSocket+JSON | Bidirectional | W3C |
BiDi combines the best of both: standardized like WebDriver, bidirectional like CDP, cross-browser by design.
How Vibium Uses BiDi
Client (JS/Python/CLI)
│
▼ WebSocket
Clicker (Go binary) ← BiDi Proxy
│
▼ WebSocket
Chrome (BiDi endpoint)
Vibium's Go binary sits as a proxy between clients and Chrome. It:
- Routes standard BiDi commands directly to Chrome
- Intercepts custom
vibium:*extension commands and handles them server-side - Implements actionability checks by sending JavaScript evaluation commands to Chrome
Key Files to Read
05-webdriver-bidi/01-protocol-overview.md— Message format, sessions, commands vs events05-webdriver-bidi/02-evolution-from-selenium.md— Historical context from WebDriver to BiDi05-webdriver-bidi/03-vibium-extension-commands.md— Howvibium:find/click/typeextend the protocol
6. Interview Preparation
Subfolder: `06-interview-preparation/`
What Architects Will Ask About
- "Why not just use Playwright/Selenium directly?" — You need to articulate the agent-native advantage
- "How do you handle flaky tests with AI?" — Self-healing selectors, intelligent retries, screenshot analysis
- "What's your CI/CD strategy?" — Headless mode, oneshot daemon, artifact collection, parallelization
- "How does this scale?" — Token costs, daemon pooling, test isolation
- "What about test maintenance?" — The key selling point: natural language tests + agent reasoning reduce maintenance burden by 60-85%
Key Talking Points (The "Three Levels" Framework)
When discussing your framework, structure answers at three levels:
Level 1 — The What (for PMs and non-technical stakeholders):
"We use AI agents that can drive a real browser just like a human would. They read pages, click buttons, fill forms, and verify results — but they do it through natural language instructions instead of brittle code."
Level 2 — The How (for senior engineers):
"The agent uses the
vibe-checkCLI skill which gives it 22 browser commands via Bash. Commands auto-wait for elements to be actionable using five server-side checks. The browser runs as a daemon for speed or oneshot for isolation. Under the hood it's WebDriver BiDi over WebSocket to Chrome."
Level 3 — The Why (for architects):
"We chose CLI skills over MCP for browser control because of token economics — a SKILL.md costs dozens of tokens vs thousands for MCP tool schemas. The skill approach means our agent's context window stays available for reasoning about test logic, analyzing failures, and writing code. The BiDi protocol is W3C-standardized, avoiding CDP vendor lock-in. Actionability is implemented server-side in Go so it's consistent across all client languages."
Key Files to Read
06-interview-preparation/01-architect-qa-scenarios.md— 20 questions with detailed answers06-interview-preparation/02-framework-presentation.md— How to present your framework in 5, 15, and 30 minutes06-interview-preparation/03-buzzword-decoder.md— What people actually mean by "agentic testing", "self-healing", "ReAct pattern"
7. Competitive Landscape
Subfolder: `07-competitive-landscape/`
The Current Map (2026)
| Tool | Approach | Best For | Limitation |
|---|---|---|---|
| Vibium | CLI skill + BiDi | AI agent integration, token efficiency | Young ecosystem, Chrome-only (V1) |
| Playwright MCP | MCP server + accessibility tree | Rich page understanding, exploratory testing | High token cost, context bloat |
| browser-use | Python library + vision models | Visual testing, complex UIs | Slow (vision API calls), expensive |
| agent-browser (Vercel) | Snapshot + Refs | Minimal context usage (93% reduction) | New, limited ecosystem |
| Selenium 4+ | WebDriver + BiDi support | Enterprise legacy integration | Heavy, complex setup |
| testRigor | NL-first commercial platform | Non-technical testers | Vendor lock-in, cost |
Key Files to Read
07-competitive-landscape/01-tool-comparison-matrix.md— Detailed feature-by-feature comparison07-competitive-landscape/02-when-to-use-what.md— Decision framework for choosing the right tool07-competitive-landscape/03-future-directions.md— Where the industry is heading (AI locators, Cortex/Retina, video recording)
Quick Reference: The vibe-check Skill Commands
Navigation
| Command | Purpose |
|---|---|
vibe-check navigate <url> |
Go to a page |
vibe-check url |
Print current URL |
vibe-check title |
Print page title |
Reading Content
| Command | Purpose |
|---|---|
vibe-check text |
Get all page text |
vibe-check text "<selector>" |
Get text of a specific element |
vibe-check html |
Get page HTML |
vibe-check find "<selector>" |
Element info (tag, text, bounding box) |
vibe-check find-all "<selector>" |
All matching elements |
vibe-check eval "<js>" |
Run JavaScript and print result |
vibe-check screenshot -o file.png |
Capture screenshot |
Interaction
| Command | Purpose |
|---|---|
vibe-check click "<selector>" |
Click an element |
vibe-check type "<selector>" "<text>" |
Type into an input |
vibe-check hover "<selector>" |
Hover over an element |
vibe-check scroll [direction] |
Scroll page |
vibe-check keys "<combo>" |
Press keys (Enter, Ctrl+a, etc.) |
vibe-check select "<selector>" "<value>" |
Pick a dropdown option |
Waiting
| Command | Purpose |
|---|---|
vibe-check wait "<selector>" |
Wait for element (visible/hidden/attached) |
Tabs
| Command | Purpose |
|---|---|
vibe-check tabs |
List open tabs |
vibe-check tab-new [url] |
Open new tab |
vibe-check tab-switch <index|url> |
Switch tab |
vibe-check tab-close [index] |
Close tab |
Daemon
| Command | Purpose |
|---|---|
vibe-check daemon start |
Start background browser |
vibe-check daemon status |
Check if running |
vibe-check daemon stop |
Stop daemon |
Reading Order Recommendation
For efficient preparation, read in this order:
- Start here:
01-foundations/01-skill-anatomy.md— understand what you're working with - Then:
03-skills-vs-mcp/01-architectural-comparison.md— the key architectural decision - Then:
02-vibium-deep-dive/01-cli-command-reference.md— the actual tool - Then:
02-vibium-deep-dive/02-actionability-checks.md— the "how it works under the hood" that impresses architects - Then:
04-building-test-framework/01-architecture-decisions.md— your framework design - Then:
05-webdriver-bidi/01-protocol-overview.md— the standard beneath everything - Then:
06-interview-preparation/01-architect-qa-scenarios.md— practice answers - Finally:
07-competitive-landscape/01-tool-comparison-matrix.md— know the alternatives
Source Repository
All analysis based on: VibiumDev/vibium (Apache 2.0, 2.6k+ stars, v0.1.7)
Key primary sources:
- skills/vibe-check/SKILL.md — The skill definition
- docs/explanation/actionability.md — Actionability checks with source code
- docs/explanation/internals.md — Architecture internals
- docs/explanation/webdriver-bidi.md — BiDi protocol explanation
- V2-ROADMAP.md — Future direction
Additional sources: