Claude Code Workflow: Spec-to-Suite

Why Claude Code Excels at Test Generation

Claude Code operates as a CLI agent with a 200K-token context window, the ability to read files from your filesystem, execute commands, and iterate on results. This makes it uniquely powerful for test generation because it can:

Read your actual codebase -- no copy-pasting needed
Match existing test patterns -- it reads your test files and replicates the style
Run the tests it generates -- it executes pytest/jest, reads failures, and fixes them
Iterate autonomously -- a single prompt can produce, run, and fix a complete test suite

This is fundamentally different from using a chat interface where you paste snippets back and forth.

The Three-Step Workflow

Step 1: Feed Context and Generate

# Feed the spec and existing test as context, request a test suite
claude "Read the OpenAPI spec at ./docs/api-spec.yaml and the existing
test file at ./tests/test_users.py. Generate a comprehensive test suite
for the /api/v2/orders endpoint following the same patterns and style
as the users tests. Include error cases for all documented 4xx responses."

What happens under the hood:

Claude reads ./docs/api-spec.yaml (the specification artifact)
Claude reads ./tests/test_users.py (the style reference)
Claude identifies the endpoint constraints, field types, auth requirements
Claude generates a new test file following the existing patterns

Pro tips for Step 1:

Point Claude at specific files rather than asking it to "find" things
Mention the existing test file explicitly so it matches your team's style
Be specific about what coverage you want ("all documented 4xx responses")

Step 2: Run and Self-Heal

# Ask Claude to run the tests and fix test bugs (not application bugs)
claude "Run the tests you just generated. Fix any failures that are
due to test bugs (not application bugs). Do not change assertions
that are testing real behavior."

What happens under the hood:

Claude executes pytest tests/test_orders.py -v
It reads the failure output
It distinguishes between:
- Test bugs: import errors, wrong fixture names, missing setup
- Application bugs: assertions that fail because the app has a real defect
It fixes only test bugs and reports application bugs

The critical instruction: "Do not change assertions that are testing real behavior." Without this, Claude might weaken assertions to make tests pass, defeating the purpose.

Step 3: Coverage Gap Analysis

# Run coverage and generate tests for uncovered lines
claude "Run pytest --cov=app.routes.orders --cov-report=term-missing
and tell me which lines in the orders route are not covered.
Generate additional tests to cover those lines."

What happens under the hood:

Claude runs the coverage command
It reads the Missing column to identify uncovered line ranges
It reads those specific lines in the source file
It generates additional tests targeting those lines specifically

Advanced Workflow: Multi-File Suite Generation

For larger features that span multiple files:

# Step 1: Survey the feature
claude "Read all files in app/services/payment/ and app/routes/payment.py.
List every public function and endpoint, noting which ones have existing
tests in tests/test_payment*.py and which ones do not."

# Step 2: Generate for uncovered functions
claude "For every function you identified as not having tests,
generate a test file. Follow the patterns in tests/test_payment_processing.py.
Organize by module: one test file per source file."

# Step 3: Run and iterate
claude "Run all payment tests: pytest tests/test_payment*.py -v.
Fix test bugs. Report any application bugs you find."

# Step 4: Integration tests
claude "Now generate integration tests that test the full payment flow:
create order -> process payment -> confirm order -> send receipt.
Use the test database and mock external APIs (Stripe, SendGrid)."

Claude Code vs Chat-Based Generation

Capability	Claude Code (CLI)	Chat Interface
Read your actual files	Yes (reads from filesystem)	No (you paste excerpts)
Match existing style	Yes (reads existing tests)	Partial (you paste examples)
Execute tests	Yes (runs pytest/jest)	No
Fix failures iteratively	Yes (read output, edit, re-run)	No (you copy errors back)
Multi-file generation	Yes (creates multiple files)	Awkward (one file at a time)
Token context	200K (can hold entire spec + tests)	Varies (often smaller)

Template: The Complete Claude Code Session

Here is a complete session template you can adapt for any feature:

# 1. Reconnaissance
claude "Read app/routes/orders.py and docs/openapi.yaml (the /orders section).
Summarize: how many endpoints, what methods, what auth is required,
what fields are validated."

# 2. Style reference
claude "Read tests/test_users.py. Note the patterns:
- How fixtures are used
- How auth headers are set up
- The naming convention for test functions
- How parametrize is used
You will follow these patterns exactly."

# 3. Generation
claude "Generate a comprehensive test suite for the orders endpoints.
Cover:
- All documented success responses
- All documented error responses (4xx)
- Boundary values for quantity (1, 100, 0, 101, -1)
- Boundary values for price fields
- Auth: valid token, expired token, missing token, wrong role
- Idempotency: same request twice with same idempotency key
Save to tests/test_orders.py"

# 4. Execution + fix
claude "Run pytest tests/test_orders.py -v --tb=long.
Fix any test bugs (import errors, fixture issues, setup problems).
Do NOT weaken assertions. Report any real application bugs separately."

# 5. Coverage analysis
claude "Run pytest tests/test_orders.py --cov=app.routes.orders
--cov-report=term-missing. Generate additional tests for uncovered lines.
Add them to the same file."

# 6. Flaky check
claude "Run pytest tests/test_orders.py --count=3 -v.
If any test fails inconsistently, identify the cause and fix it.
Common causes: time dependency, random data, database state."

# 7. Final review
claude "Review all tests in tests/test_orders.py. For each test, ask:
would this test fail if the feature was broken? Flag any tautology tests
(tests that pass regardless of implementation correctness)."

Common Pitfalls with Claude Code

Pitfall 1: Letting Claude Weaken Assertions

# BAD: Claude changes 403 to 200 to make the test pass
# This hides a real authorization bug

# GOOD: instruct Claude explicitly
claude "If a test expects 403 but gets 200, that is an APPLICATION BUG.
Report it but do not change the assertion."

Pitfall 2: Not Specifying the Test Framework

Claude defaults to the most common framework for the language. If you use vitest instead of jest, or pytest-asyncio instead of plain pytest, specify it.

Pitfall 3: Generating Everything in One Prompt

Very long prompts degrade output quality. Break large generation tasks into focused prompts of 20-30 tests each.

Pitfall 4: Forgetting to Commit Between Steps

Claude Code edits files in place. If Step 3 goes wrong and corrupts the file, you lose Steps 1-2. Commit (or at least stash) between major steps.

Metrics: Claude Code Test Generation

Based on practical usage patterns:

Metric	Typical Value
Tests per hour (including review)	40-60
First-run pass rate	70-85%
Tests needing fix after first run	15-30%
Tests deleted during review	5-15%
Coverage increase per session	10-25 percentage points
Token cost per 50-test suite	$0.50-$2.00

Key Takeaway

Claude Code's killer feature is the generate-run-fix loop. Unlike chat-based generation where you manually copy errors back, Claude Code reads the test output, diagnoses failures, and fixes them autonomously. This turns a multi-hour manual process into a 30-minute supervised session.