The Critic-Actor Pattern
Adversarial Quality Through Separation of Concerns
The Critic-Actor pattern separates test generation from test review. One agent (the Actor) generates tests. Another agent (the Critic) reviews and challenges them. This is inspired by adversarial training in machine learning and Generative Adversarial Networks (GANs).
The key insight: a single agent generating and evaluating its own output has a conflict of interest. It tends to approve what it created. Separating these roles produces measurably higher-quality tests.
Architecture
+----------+ +----------+
| ACTOR |--------→| CRITIC |
| (writes | | (reviews |
| tests) |←--------| tests) |
+----------+ +----------+
|
| After critic approves
v
+----------+
| FINAL |
| SUITE |
+----------+
The flow:
- Actor generates tests from the specification
- Critic reviews each test against the specification
- Critic returns feedback: APPROVE, REVISE, or REJECT per test
- Actor revises based on feedback
- Repeat until the Critic approves 90%+ of tests (or max rounds reached)
- Final suite = all approved tests
Implementation
class CriticActorPipeline:
def __init__(self, actor: Agent, critic: Agent, max_rounds=3):
self.actor = actor
self.critic = critic
self.max_rounds = max_rounds
def generate_reviewed_tests(self, spec: str) -> TestSuite:
# Actor generates initial test suite
tests = self.actor.generate(spec)
print(f"Actor generated {len(tests)} tests")
for round_num in range(self.max_rounds):
# Critic reviews all tests
review = self.critic.review(tests, spec)
print(f"Round {round_num + 1}: "
f"{review.approved_count} approved, "
f"{review.revise_count} need revision, "
f"{review.rejected_count} rejected")
if review.approval_rate >= 0.9: # 90%+ tests approved
print(f"Critic satisfied after {round_num + 1} rounds")
break
# Actor revises based on critic feedback
tests = self.actor.revise(tests, review.feedback)
# Final suite: only approved tests
return TestSuite(tests=[t for t in tests if t.status == "approved"])
The Critic's Review Prompt
The Critic is the quality engine. Its prompt must be rigorous and specific:
You are a senior QA architect reviewing AI-generated tests. For each test:
1. Does the assertion actually verify the requirement? (not a tautology)
2. Would this test catch the bug it claims to test? (mutation analysis)
3. Is this test independent and deterministic?
4. Is there a simpler way to test the same thing?
5. Is anything missing from the specification that should be tested?
Rate each test: APPROVE, REVISE (with specific feedback), or REJECT (with reason).
Also identify any GAPS: scenarios from the spec that no test covers.
Specification:
{spec}
Tests to review:
{tests}
Output as JSON:
{
"reviews": [
{
"test_name": "...",
"verdict": "APPROVE|REVISE|REJECT",
"feedback": "...", // Required for REVISE and REJECT
"mutation_survives": true|false // Would removing the assertion miss a bug?
}
],
"gaps": [
"Description of untested scenario"
]
}
Example Critic Output
{
"reviews": [
{
"test_name": "test_create_user_valid",
"verdict": "REVISE",
"feedback": "Assertion only checks status code 201. Add assertions for response body: verify 'id' is present, 'email' matches input, 'created_at' is recent. Without these, a mutation that breaks user creation but returns 201 would go undetected.",
"mutation_survives": true
},
{
"test_name": "test_create_user_duplicate_email",
"verdict": "APPROVE",
"feedback": "Good test. Verifies 409 status and error message mentioning 'email'. Tight assertions.",
"mutation_survives": false
},
{
"test_name": "test_get_user",
"verdict": "REJECT",
"feedback": "This is a tautology. The mock returns {'name': 'Alice'} and the assertion checks name == 'Alice'. This tests the mock, not the code. Replace with a test that verifies the actual database query or service logic.",
"mutation_survives": true
}
],
"gaps": [
"No test for creating a user with an email longer than 254 characters (RFC 5321 limit)",
"No test for concurrent creation of two users with the same email (race condition)",
"No test for the password hashing (verify stored password is not plaintext)"
]
}
The Actor's Revision Prompt
When the Critic returns REVISE feedback, the Actor receives targeted instructions:
The following tests need revision based on code review feedback.
For each test, apply the suggested changes while maintaining the test's intent.
Original test:
{original_test_code}
Reviewer feedback:
{critic_feedback}
Requirements:
1. Apply the specific feedback
2. Do not change other aspects of the test
3. Maintain the same naming convention
4. Keep the test self-contained (no new external dependencies)
Also generate new tests for these identified gaps:
{gaps}
Multi-Round Convergence
The Critic-Actor loop typically converges in 2-3 rounds:
Round 1: Actor generates 30 tests
Critic: 15 approved, 10 revise, 5 rejected
Approval rate: 50%
Round 2: Actor revises 10, generates 5 new for gaps
Critic: 25 approved, 5 revise, 0 rejected
Approval rate: 83%
Round 3: Actor revises 5
Critic: 29 approved, 1 revise, 0 rejected
Approval rate: 97% → DONE
Why max_rounds matters: Without a limit, the Critic-Actor loop could iterate indefinitely, with the Critic finding increasingly minor issues. Set max_rounds=3 to balance quality and cost.
Quality Benchmarks
In benchmarks, critic-actor pipelines produce measurably better tests:
| Metric | Single-Agent Generation | Critic-Actor Pipeline | Improvement |
|---|---|---|---|
| Tautology rate | 15-20% | 2-5% | 75% reduction |
| Mutation score | 55-65% | 70-80% | +15-25 points |
| Coverage gaps identified | 0 (by definition) | 3-5 per spec | N/A |
| Assertion density (per test) | 1.5 | 3.2 | 2x |
| Review time (human) | 30 min / 30 tests | 10 min / 30 tests | 3x faster |
The human review time drops because the Critic has already caught the most common issues (tautologies, weak assertions, missing scenarios).
Variant: The Three-Agent Pipeline
For even higher quality, add a third agent -- the Specifications Analyst:
+------------+ +----------+ +----------+
| SPEC |------→| ACTOR |------→| CRITIC |
| ANALYST | | (writes | | (reviews |
| (extracts | | tests) |←------| tests) |
| test | +----------+ +----------+
| scenarios)|
+------------+
The Spec Analyst reads the specification and outputs a structured list of test scenarios before the Actor writes any code. This prevents the Actor from missing scenarios that the Critic would later identify as gaps.
class SpecAnalyst:
def analyze(self, spec: str) -> list[TestScenario]:
return self.llm.generate(f"""
Analyze this specification and enumerate every testable scenario:
{spec}
For each scenario:
- Category: happy_path | error | boundary | security | performance
- Description: one sentence
- Input: what data to use
- Expected outcome: what should happen
- Priority: must_have | should_have | nice_to_have
""")
When to Use the Critic-Actor Pattern
Best for:
- When test quality matters more than speed
- Critical paths (authentication, payments, data integrity)
- Regulated environments (healthcare, finance) where test quality is auditable
- Teams new to AI test generation (the Critic catches AI failure modes)
Trade-off: 2-3x more expensive than single-agent generation (multiple LLM calls per round). But the quality improvement often saves more time in human review than it costs in tokens.
Key Takeaway
The Critic-Actor pattern produces the highest-quality AI-generated tests by separating generation from evaluation. The Critic catches tautologies, weak assertions, and coverage gaps that single agents miss. In benchmarks, this pattern achieves 15-25% higher mutation scores than single-pass generation, making it the gold standard for quality-critical test suites.