Write vs Generate: The Decision Matrix

The Central Question

For every testing task, you face a decision: should I write these tests from scratch, or should I generate them with AI and curate the output? Neither answer is always correct. This file provides a framework for making the right choice consistently.

The Decision Matrix

Scenario	Write from Scratch	Generate + Curate	Why
Novel business logic with no existing patterns	Yes	No	AI has no reference for your unique domain rules
Standard CRUD endpoint tests	No	Yes	CRUD tests follow universal patterns AI knows well
Complex stateful workflows (e.g., payment flows)	Hybrid	Hybrid	Write the skeleton manually, AI fills in permutations
Data-driven tests (100+ input combinations)	No	Yes	AI excels at systematic enumeration
Tests requiring deep domain knowledge (financial calcs)	Write logic	AI generates permutations	You define the rules, AI explores the space
Regression tests for a known bug	Write	No	You need exact reproduction of the specific bug
Cross-browser/cross-device matrix	No	Yes	Combinatorial generation is AI's sweet spot
Security-critical tests (auth, encryption)	Hybrid	Hybrid	Write the security assertions, AI generates attack vectors
Flaky test debugging	Write	No	Requires understanding of non-determinism causes
Performance/load test scenarios	Hybrid	Hybrid	AI generates scenario variations, you tune thresholds

Reading the Matrix

Write from scratch when the test requires deep understanding of a specific problem that AI lacks context for.
Generate + curate when the test follows a known pattern and the main value is in systematic coverage.
Hybrid when the test structure requires human judgment but the details benefit from AI's breadth.

The Curation Workflow

When you choose "Generate + Curate," follow this five-step process:

Step 1: GENERATE (5 minutes)

Feed context (spec, existing tests, constraints) and request specific coverage targets.

claude "Read the OpenAPI spec at docs/api.yaml for the /products endpoint.
Generate 30 tests using pytest+httpx covering:
- All documented status codes
- Boundary values for price (min: 0.01, max: 99999.99)
- All enum values for category
- Auth scenarios (valid, expired, missing, wrong role)
Save to tests/test_products_generated.py"

Result: 20-40 tests generated in under 5 minutes.

Step 2: TRIAGE (5 minutes)

Scan test names. Do they make sense? Look for hallucinated APIs or nonsense assertions. Delete obviously wrong tests immediately.

# Quick triage checklist:
# [ ] Test names are descriptive and follow convention
# [ ] No duplicate scenarios (different names, same test)
# [ ] All referenced fixtures exist
# [ ] All API endpoints are real (grep the route file)

# Typical triage result: delete 10-20% of generated tests

Step 3: VALIDATE (15 minutes)

Run the test suite. Fix import errors and missing fixtures. Identify tests that pass but test nothing useful (tautologies).

# Run and observe
pytest tests/test_products_generated.py -v --tb=short

# Common fixes needed:
# - Import paths (AI guesses, not always correctly)
# - Fixture names (AI may use generic names like "client" instead of your "api_client")
# - Base URL configuration
# - Auth helper function signatures

Step 4: STRENGTHEN (10 minutes)

Add missing negative cases. Harden brittle assertions. Add parametrize decorators for data-driven tests.

# BEFORE: AI generated separate tests for each enum value
def test_category_electronics(self): ...
def test_category_clothing(self): ...
def test_category_food(self): ...

# AFTER: Consolidate into parametrized test
@pytest.mark.parametrize("category", ["electronics", "clothing", "food", "other"])
def test_valid_category_accepted(self, client, auth, category):
    response = client.post("/api/products", json={
        "name": "Test", "price": 10.00, "category": category
    }, headers=auth)
    assert response.status_code == 201
    assert response.json()["category"] == category

Step 5: INTEGRATE (5 minutes)

Ensure consistent naming with existing suite. Move into proper test directories. Add to CI configuration.

# Rename to follow convention
mv tests/test_products_generated.py tests/test_products_api.py

# Verify it runs with the full suite
pytest tests/ -v --tb=short

# Verify coverage did not decrease
pytest tests/ --cov=app --cov-report=term-missing

Total time: ~40 minutes for 25-30 production-quality tests. Without AI: ~5-6 hours for the same coverage.

When Hybrid Is the Right Choice

The hybrid approach combines human-written test structure with AI-generated permutations. This is ideal for complex scenarios.

Example: Payment Flow Testing

# HUMAN-WRITTEN: the test structure and key assertions
class TestPaymentFlow:
    """Integration tests for the full payment flow."""

    def _create_and_process_order(self, client, payment_method, amount):
        """Helper: create order, process payment, return result."""
        order = client.post("/api/orders", json={
            "items": [{"product_id": "prod-1", "quantity": 1}],
            "payment_method": payment_method,
            "amount": amount,
        })
        assert order.status_code == 201
        order_id = order.json()["id"]

        payment = client.post(f"/api/orders/{order_id}/pay")
        return payment

    def test_successful_credit_card_payment(self, client, mock_stripe):
        """Verify complete happy path for credit card payment."""
        mock_stripe.charge.return_value = {"status": "succeeded", "id": "ch_123"}

        result = self._create_and_process_order(client, "credit_card", 99.99)

        assert result.status_code == 200
        assert result.json()["payment_status"] == "completed"
        assert result.json()["stripe_charge_id"] == "ch_123"
        mock_stripe.charge.assert_called_once()

    # AI-GENERATED: permutations of failure modes
    # (generated with: "Given the test structure above, generate tests for
    #  every Stripe error: card_declined, expired_card, incorrect_cvc,
    #  processing_error, rate_limit. Follow the same pattern.")

    @pytest.mark.parametrize("stripe_error,expected_status,expected_message", [
        ("card_declined", "failed", "Your card was declined"),
        ("expired_card", "failed", "Your card has expired"),
        ("incorrect_cvc", "failed", "Incorrect CVC"),
        ("processing_error", "pending", "Payment is being processed"),
        ("rate_limit", "pending", "Please try again in a moment"),
    ])
    def test_stripe_error_handling(self, client, mock_stripe,
                                    stripe_error, expected_status, expected_message):
        mock_stripe.charge.side_effect = StripeError(stripe_error)

        result = self._create_and_process_order(client, "credit_card", 99.99)

        assert result.json()["payment_status"] == expected_status
        assert expected_message in result.json()["error_message"]

The human writes the complex test infrastructure (helpers, mocking setup, key assertions). The AI generates the systematic permutations (every error code, every payment method, every amount range).

The Economics of Write vs Generate

Factor	Write from Scratch	Generate + Curate
Time for 50 tests	4-6 hours	30-45 minutes
Initial quality	Higher (80-90%)	Lower (65-75%)
Post-curation quality	Same	Same (after 30 min review)
Coverage breadth	Lower (human blind spots)	Higher (systematic)
Maintenance cost	Lower (tighter code)	Slightly higher (AI verbosity)
Consistency	Varies by author	High (same prompt = same style)
Morale	Lower (tedious for CRUD)	Higher (humans focus on interesting cases)

Red Flags: When to Stop Generating and Start Writing

Switch from "generate + curate" to "write from scratch" when:

You have deleted more than 40% of generated tests. The prompt is wrong or the domain is too specialized for AI.
You spend more time fixing AI tests than it would take to write them. Break-even is typically at 30 minutes of curation for 30 tests.
The tests require understanding race conditions or timing. AI consistently fails at concurrency testing.
The domain has no public documentation. AI cannot generate tests for proprietary protocols or internal APIs with no spec.
You need exact reproduction of a specific bug. Regression tests must replicate the precise conditions. AI generates generic scenarios.

Key Takeaway

The write-vs-generate decision is not about AI capability -- it is about where your expertise adds the most value. Use AI for systematic, pattern-based testing (CRUD, validation, enum coverage, boundary values). Write manually for domain-specific, judgment-heavy testing (business rules, security, concurrency, bug reproduction). The hybrid approach -- human structure with AI permutations -- combines the best of both worlds.