The 10x Review Pattern
The Core Principle
AI generates tests fast. You must review them carefully. The 10x Review pattern means spending 10% of the time generating and 90% reviewing. This is not a flaw of AI -- it is the correct allocation of human expertise.
Think of it this way: a prolific junior engineer can write 50 tests in an afternoon. A senior engineer's value is not in writing those 50 tests -- it is in identifying which 5 are tautologies, which 10 have brittle assertions, and which 3 are testing the mock instead of the code. AI is that prolific junior engineer. You are the senior reviewer.
The Quality Evaluation Checklist
Score each AI-generated test on these six dimensions. A test that fails any dimension should be revised or deleted.
[ ] CORRECTNESS
[ ] Assertion is actually checking the right thing
[ ] Expected value matches the specification (not an AI hallucination)
[ ] Test would fail if the feature broke (mutation test it mentally)
[ ] INDEPENDENCE
[ ] Test does not depend on execution order
[ ] Test does not share mutable state with other tests
[ ] Setup/teardown is self-contained
[ ] DETERMINISM
[ ] No dependency on current time/date without mocking
[ ] No dependency on random data without seeding
[ ] No dependency on external services without mocking
[ ] READABILITY
[ ] Test name describes the scenario, not the implementation
[ ] Arrange/Act/Assert sections are clearly separated
[ ] No "magic numbers" without explanation
[ ] COVERAGE VALUE
[ ] This test covers a scenario not already covered by another test
[ ] This test would catch a realistic bug
[ ] This test is not just asserting that "code runs without error"
[ ] MAINTAINABILITY
[ ] Uses page objects/fixtures/helpers, not raw selectors everywhere
[ ] Assertion messages are descriptive
[ ] Test is under 30 lines (not a novella)
How to Apply the Checklist Efficiently
Do not check every dimension for every test sequentially. Instead, use a three-pass approach:
Pass 1: Scan (2 minutes for 30 tests) Read only the test names. Do they make sense? Do they follow the naming convention? Are there obvious duplicates? This catches about 20% of issues immediately.
# GOOD test names -- clear scenario and condition
def test_should_create_order_when_valid_payload(): ...
def test_should_reject_order_when_quantity_exceeds_maximum(): ...
def test_should_return_404_when_product_not_found(): ...
# BAD test names -- vague or implementation-focused
def test_order_creation(): ... # What about order creation?
def test_post_request(): ... # What post request?
def test_validates_correctly(): ... # Validates what correctly?
Pass 2: Assertions (5 minutes for 30 tests) Read only the assertion lines. Are they checking the right thing? Are they specific enough? This catches tautologies and weak assertions.
# WEAK: only checks status code
assert response.status_code == 200
# STRONG: checks status code AND response content
assert response.status_code == 200
body = response.json()
assert body["name"] == "Widget"
assert body["price"] == 29.99
assert "id" in body
Pass 3: Deep review (10 minutes for 30 tests) For tests that passed the first two passes, check independence, determinism, and setup correctness. This is where you catch the subtle bugs.
The Mental Mutation Test
For every assertion, ask yourself: "If I removed or inverted this assertion, would a real bug go undetected?"
This is the fastest way to evaluate whether a test has actual value.
# Test with LOW mutation score
def test_get_user():
response = client.get("/api/users/1")
assert response.status_code == 200
# If the user's name field is empty, this test still passes.
# If the user's email is wrong, this test still passes.
# This test only verifies the endpoint exists and returns 200.
# Same test with HIGH mutation score
def test_get_user():
response = client.get("/api/users/1")
assert response.status_code == 200
user = response.json()
assert user["name"] == "Alice" # Catches name corruption
assert user["email"] == "alice@x.com" # Catches email corruption
assert user["role"] == "admin" # Catches role assignment bug
assert user["created_at"] is not None # Catches missing timestamp
Every additional assertion is a mutation that the test would catch. Aim for tests where removing any single assertion would leave a realistic bug undetected.
The Review Workflow in Practice
1. RECEIVE AI output (20-40 tests, ~5 minutes of generation time)
2. PASS 1 — SCAN test names (2 minutes)
- Delete tests with vague names
- Flag duplicate scenarios
- Count: typically discard 10-15% here
3. PASS 2 — CHECK assertions (5 minutes)
- Verify expected values against the spec
- Flag tautology tests (testing the mock, not the code)
- Flag weak assertions (status-code-only)
- Count: typically flag 15-25% for revision
4. PASS 3 — DEEP REVIEW (10 minutes)
- Check independence (no shared state)
- Check determinism (no time/random dependencies)
- Check setup/teardown completeness
- Verify all referenced APIs exist in the codebase
- Count: typically catch 5-10% more issues
5. REVISE flagged tests (5-10 minutes)
- Fix assertions
- Add missing setup/teardown
- Strengthen weak assertions
- Delete beyond-repair tests
6. RUN the suite (2 minutes)
- Fix import errors
- Fix missing fixtures
- Verify all tests pass
Total: ~30 minutes for a 30-test suite
Without AI: ~4-6 hours to write the same 30 tests from scratch
Quantifying Review Quality
Track these metrics over time to measure your review effectiveness:
| Metric | Target | How to Measure |
|---|---|---|
| Tests deleted during review | 10-20% | Count of deleted / total generated |
| Tests revised during review | 20-30% | Count of revised / total generated |
| Post-review test failures | < 5% | Tests that fail after review due to test bugs |
| Tautology rate | < 5% | Tests that pass even when the feature is broken |
| Mutation score | > 70% | Run mutmut/mutpy and measure surviving mutants |
If your deletion rate is consistently above 30%, your prompts need improvement. If it is consistently below 10%, you might be accepting too many weak tests.
Scaling Review Across a Team
When the entire team uses AI test generation, you need a shared review standard:
Codify the checklist in your PR template. Add a "Test Review" section with the six dimensions.
Pair review sessions. Have two engineers review AI-generated tests together for the first few sprints. This calibrates everyone's expectations.
Track metrics. A shared dashboard showing deletion rates, revision rates, and mutation scores prevents quality drift.
Template improvement loop. When you consistently delete the same type of test (e.g., tautologies), fix the prompt template to prevent them.
Key Takeaway
The 10x Review pattern is the difference between "using AI to generate tests" (which anyone can do) and "using AI to produce production-quality test suites" (which requires engineering judgment). The review is where your expertise as a QA engineer creates value. AI writes the first draft; you are the editor-in-chief.