Claude Code Workflow: Spec-to-Suite
Why Claude Code Excels at Test Generation
Claude Code operates as a CLI agent with a 200K-token context window, the ability to read files from your filesystem, execute commands, and iterate on results. This makes it uniquely powerful for test generation because it can:
- Read your actual codebase -- no copy-pasting needed
- Match existing test patterns -- it reads your test files and replicates the style
- Run the tests it generates -- it executes pytest/jest, reads failures, and fixes them
- Iterate autonomously -- a single prompt can produce, run, and fix a complete test suite
This is fundamentally different from using a chat interface where you paste snippets back and forth.
The Three-Step Workflow
Step 1: Feed Context and Generate
# Feed the spec and existing test as context, request a test suite
claude "Read the OpenAPI spec at ./docs/api-spec.yaml and the existing
test file at ./tests/test_users.py. Generate a comprehensive test suite
for the /api/v2/orders endpoint following the same patterns and style
as the users tests. Include error cases for all documented 4xx responses."
What happens under the hood:
- Claude reads
./docs/api-spec.yaml(the specification artifact) - Claude reads
./tests/test_users.py(the style reference) - Claude identifies the endpoint constraints, field types, auth requirements
- Claude generates a new test file following the existing patterns
Pro tips for Step 1:
- Point Claude at specific files rather than asking it to "find" things
- Mention the existing test file explicitly so it matches your team's style
- Be specific about what coverage you want ("all documented 4xx responses")
Step 2: Run and Self-Heal
# Ask Claude to run the tests and fix test bugs (not application bugs)
claude "Run the tests you just generated. Fix any failures that are
due to test bugs (not application bugs). Do not change assertions
that are testing real behavior."
What happens under the hood:
- Claude executes
pytest tests/test_orders.py -v - It reads the failure output
- It distinguishes between:
- Test bugs: import errors, wrong fixture names, missing setup
- Application bugs: assertions that fail because the app has a real defect
- It fixes only test bugs and reports application bugs
The critical instruction: "Do not change assertions that are testing real behavior." Without this, Claude might weaken assertions to make tests pass, defeating the purpose.
Step 3: Coverage Gap Analysis
# Run coverage and generate tests for uncovered lines
claude "Run pytest --cov=app.routes.orders --cov-report=term-missing
and tell me which lines in the orders route are not covered.
Generate additional tests to cover those lines."
What happens under the hood:
- Claude runs the coverage command
- It reads the
Missingcolumn to identify uncovered line ranges - It reads those specific lines in the source file
- It generates additional tests targeting those lines specifically
Advanced Workflow: Multi-File Suite Generation
For larger features that span multiple files:
# Step 1: Survey the feature
claude "Read all files in app/services/payment/ and app/routes/payment.py.
List every public function and endpoint, noting which ones have existing
tests in tests/test_payment*.py and which ones do not."
# Step 2: Generate for uncovered functions
claude "For every function you identified as not having tests,
generate a test file. Follow the patterns in tests/test_payment_processing.py.
Organize by module: one test file per source file."
# Step 3: Run and iterate
claude "Run all payment tests: pytest tests/test_payment*.py -v.
Fix test bugs. Report any application bugs you find."
# Step 4: Integration tests
claude "Now generate integration tests that test the full payment flow:
create order -> process payment -> confirm order -> send receipt.
Use the test database and mock external APIs (Stripe, SendGrid)."
Claude Code vs Chat-Based Generation
| Capability | Claude Code (CLI) | Chat Interface |
|---|---|---|
| Read your actual files | Yes (reads from filesystem) | No (you paste excerpts) |
| Match existing style | Yes (reads existing tests) | Partial (you paste examples) |
| Execute tests | Yes (runs pytest/jest) | No |
| Fix failures iteratively | Yes (read output, edit, re-run) | No (you copy errors back) |
| Multi-file generation | Yes (creates multiple files) | Awkward (one file at a time) |
| Token context | 200K (can hold entire spec + tests) | Varies (often smaller) |
Template: The Complete Claude Code Session
Here is a complete session template you can adapt for any feature:
# 1. Reconnaissance
claude "Read app/routes/orders.py and docs/openapi.yaml (the /orders section).
Summarize: how many endpoints, what methods, what auth is required,
what fields are validated."
# 2. Style reference
claude "Read tests/test_users.py. Note the patterns:
- How fixtures are used
- How auth headers are set up
- The naming convention for test functions
- How parametrize is used
You will follow these patterns exactly."
# 3. Generation
claude "Generate a comprehensive test suite for the orders endpoints.
Cover:
- All documented success responses
- All documented error responses (4xx)
- Boundary values for quantity (1, 100, 0, 101, -1)
- Boundary values for price fields
- Auth: valid token, expired token, missing token, wrong role
- Idempotency: same request twice with same idempotency key
Save to tests/test_orders.py"
# 4. Execution + fix
claude "Run pytest tests/test_orders.py -v --tb=long.
Fix any test bugs (import errors, fixture issues, setup problems).
Do NOT weaken assertions. Report any real application bugs separately."
# 5. Coverage analysis
claude "Run pytest tests/test_orders.py --cov=app.routes.orders
--cov-report=term-missing. Generate additional tests for uncovered lines.
Add them to the same file."
# 6. Flaky check
claude "Run pytest tests/test_orders.py --count=3 -v.
If any test fails inconsistently, identify the cause and fix it.
Common causes: time dependency, random data, database state."
# 7. Final review
claude "Review all tests in tests/test_orders.py. For each test, ask:
would this test fail if the feature was broken? Flag any tautology tests
(tests that pass regardless of implementation correctness)."
Common Pitfalls with Claude Code
Pitfall 1: Letting Claude Weaken Assertions
# BAD: Claude changes 403 to 200 to make the test pass
# This hides a real authorization bug
# GOOD: instruct Claude explicitly
claude "If a test expects 403 but gets 200, that is an APPLICATION BUG.
Report it but do not change the assertion."
Pitfall 2: Not Specifying the Test Framework
Claude defaults to the most common framework for the language. If you use vitest instead of jest, or pytest-asyncio instead of plain pytest, specify it.
Pitfall 3: Generating Everything in One Prompt
Very long prompts degrade output quality. Break large generation tasks into focused prompts of 20-30 tests each.
Pitfall 4: Forgetting to Commit Between Steps
Claude Code edits files in place. If Step 3 goes wrong and corrupts the file, you lose Steps 1-2. Commit (or at least stash) between major steps.
Metrics: Claude Code Test Generation
Based on practical usage patterns:
| Metric | Typical Value |
|---|---|
| Tests per hour (including review) | 40-60 |
| First-run pass rate | 70-85% |
| Tests needing fix after first run | 15-30% |
| Tests deleted during review | 5-15% |
| Coverage increase per session | 10-25 percentage points |
| Token cost per 50-test suite | $0.50-$2.00 |
Key Takeaway
Claude Code's killer feature is the generate-run-fix loop. Unlike chat-based generation where you manually copy errors back, Claude Code reads the test output, diagnoses failures, and fixes them autonomously. This turns a multi-hour manual process into a 30-minute supervised session.