Token Budget Analysis: Real Numbers
Context Window Sizes (2026)
| Model | Context Window | Effective Budget* |
|---|---|---|
| Claude Opus 4.6 | 200K tokens | ~150K usable |
| Claude Sonnet 4.5 | 200K tokens | ~150K usable |
| Claude Haiku 4.5 | 200K tokens | ~150K usable |
| GPT-4o | 128K tokens | ~100K usable |
| Gemini Pro | 1M+ tokens | ~750K usable |
*Effective budget accounts for system prompts, tool definitions, and overhead.
MCP Token Breakdown: A Real Playwright Session
Tool Schema Cost (Per API Call)
Playwright MCP exposes these tools (actual schema sizes measured):
| Tool | Schema Tokens |
|---|---|
browser_launch |
~180 |
browser_navigate |
~220 |
browser_screenshot |
~200 |
browser_click |
~250 |
browser_type |
~280 |
browser_find |
~240 |
browser_hover |
~200 |
browser_select_option |
~260 |
browser_wait_for_selector |
~250 |
browser_evaluate |
~220 |
browser_go_back |
~150 |
browser_go_forward |
~150 |
browser_close |
~150 |
browser_get_text |
~200 |
browser_get_attribute |
~220 |
| Total per API call | ~3,370 |
This is loaded into EVERY API request, even when the agent isn't doing browser work.
Accessibility Tree Cost (Per Page Read)
A typical web page accessibility snapshot:
| Page Complexity | Elements | A11y Tree Tokens |
|---|---|---|
| Simple (landing page) | 20-50 | ~500-1,500 |
| Medium (form page) | 50-200 | ~1,500-5,000 |
| Complex (dashboard) | 200-500 | ~5,000-15,000 |
| Data-heavy (table) | 500-2000 | ~15,000-50,000+ |
A Realistic 20-Step Login Test via MCP
Step 1: browser_launch → 150 tokens (call) + 100 (response)
Step 2: browser_navigate → 200 tokens + 100
Step 3: [a11y tree loaded] → 3,000 tokens (login form)
Step 4: browser_type (email) → 180 tokens + 80
Step 5: browser_type (password) → 180 tokens + 80
Step 6: browser_click (submit) → 160 tokens + 80
Step 7: [a11y tree loaded] → 5,000 tokens (dashboard)
Step 8: browser_get_text (heading) → 150 tokens + 80
Step 9: browser_screenshot → 150 tokens + 100
Step 10: browser_navigate (profile) → 200 tokens + 100
Step 11: [a11y tree loaded] → 4,000 tokens (profile page)
Step 12: browser_get_text (name) → 150 tokens + 80
Step 13: browser_click (edit) → 160 tokens + 80
Step 14: [a11y tree loaded] → 4,500 tokens (edit form)
Step 15: browser_type (phone) → 180 tokens + 80
Step 16: browser_click (save) → 160 tokens + 80
Step 17: [a11y tree loaded] → 4,000 tokens (profile updated)
Step 18: browser_get_text (success) → 150 tokens + 80
Step 19: browser_screenshot → 150 tokens + 100
Step 20: browser_close → 120 tokens + 60
Tool schemas (loaded every turn): 3,370 × 20 = 67,400 tokens
A11y trees: 20,500 tokens
Tool calls + responses: ~4,200 tokens
────────────────────────────────────────────────────
TOTAL: ~92,100 tokens
That's ~61% of usable context on a simple login test.
Skill Token Breakdown: The Same Test via vibe-check
Skill Loading Cost (Once)
| Component | Tokens |
|---|---|
| SKILL.md injection | ~1,000 |
| Skill description in tool list | ~50 per turn |
The Same 20-Step Login Test via Skill
Step 0: Skill invoked (SKILL.md loaded) → 1,000 tokens (once)
Step 1: Bash("vibe-check daemon start") → 30 + 20 = 50
Step 2: Bash("vibe-check navigate https://...") → 40 + 20 = 60
Step 3: Bash("vibe-check type '#email' 'user'") → 45 + 20 = 65
Step 4: Bash("vibe-check type '#pass' 'secret'") → 45 + 20 = 65
Step 5: Bash("vibe-check click '#submit'") → 35 + 20 = 55
Step 6: Bash("vibe-check wait '.dashboard'") → 35 + 20 = 55
Step 7: Bash("vibe-check text 'h1'") → 30 + 30 = 60
Step 8: Bash("vibe-check screenshot -o s1.png") → 40 + 20 = 60
Step 9: Bash("vibe-check navigate .../profile") → 40 + 20 = 60
Step 10: Bash("vibe-check text '.name'") → 30 + 30 = 60
Step 11: Bash("vibe-check click '#edit'") → 35 + 20 = 55
Step 12: Bash("vibe-check type '#phone' '555'") → 40 + 20 = 60
Step 13: Bash("vibe-check click '#save'") → 35 + 20 = 55
Step 14: Bash("vibe-check wait '.success'") → 35 + 20 = 55
Step 15: Bash("vibe-check text '.success'") → 35 + 30 = 65
Step 16: Bash("vibe-check screenshot -o s2.png") → 40 + 20 = 60
Step 17-20: verification + cleanup → ~200
Skill description (per turn): 50 × 20 = 1,000 tokens
Bash tool schema (shared, not browser-specific): ~200 per turn (shared)
Skill initial load: 1,000 tokens
Commands + responses: ~1,200 tokens
────────────────────────────────────────────────────
TOTAL: ~3,200 tokens (browser-specific)
That's ~2% of usable context.
Side-by-Side Summary
| Metric | MCP | Skill | Difference |
|---|---|---|---|
| 20-step test total | ~92,100 tokens | ~3,200 tokens | 29x cheaper |
| Context consumed | ~61% | ~2% | 59% more available |
| Context remaining | ~58K tokens | ~147K tokens | 2.5x more headroom |
| Cost at $15/M tokens (input) | ~$1.38 | ~$0.05 | 28x cheaper |
What You Do with the Saved Context
The 147K tokens saved by using skills instead of MCP can hold:
| Content | Approximate Tokens | What It Enables |
|---|---|---|
| 10 source files (~200 lines each) | ~30,000 | Agent reads and modifies code |
| Full test suite definition | ~10,000 | Agent understands all tests |
| Error analysis + debugging | ~20,000 | Agent reasons about failures |
| Conversation history | ~50,000 | Agent remembers earlier context |
| Total additional capacity | ~110,000 | A richer, more capable agent |
When Token Cost Doesn't Matter
If you're using Gemini with a 1M+ context window, the token efficiency argument is much weaker. You can afford MCP's overhead and still have plenty of context.
However:
- API cost still matters (you pay per token)
- Latency scales with tokens (more tokens = slower responses)
- Quality can degrade with very long contexts (attention dilution)
So even with large windows, keeping things lean is still beneficial.
Interview Talking Point
"I've done the math on token costs. A 20-step test via MCP consumes about 92,000 tokens — 61% of a 200K context window — primarily from tool schemas loaded every turn and accessibility trees. The same test via CLI skill costs about 3,200 tokens — 2% of the window. That's a 29x reduction. More importantly, it means the agent has 147K tokens available for reasoning about test logic, analyzing failures, and maintaining context across the session. At $15 per million input tokens, each test run costs $1.38 via MCP versus $0.05 via skills. Across hundreds of CI runs per day, that's the difference between a viable approach and an unsustainable one."