The Critic-Actor Pattern

Adversarial Quality Through Separation of Concerns

The Critic-Actor pattern separates test generation from test review. One agent (the Actor) generates tests. Another agent (the Critic) reviews and challenges them. This is inspired by adversarial training in machine learning and Generative Adversarial Networks (GANs).

The key insight: a single agent generating and evaluating its own output has a conflict of interest. It tends to approve what it created. Separating these roles produces measurably higher-quality tests.

Architecture

   +----------+         +----------+
   |  ACTOR   |--------→|  CRITIC  |
   | (writes  |         | (reviews |
   |  tests)  |←--------|  tests)  |
   +----------+         +----------+
        |
        |  After critic approves
        v
   +----------+
   |  FINAL   |
   |  SUITE   |
   +----------+

The flow:

Actor generates tests from the specification
Critic reviews each test against the specification
Critic returns feedback: APPROVE, REVISE, or REJECT per test
Actor revises based on feedback
Repeat until the Critic approves 90%+ of tests (or max rounds reached)
Final suite = all approved tests

Implementation

class CriticActorPipeline:
    def __init__(self, actor: Agent, critic: Agent, max_rounds=3):
        self.actor = actor
        self.critic = critic
        self.max_rounds = max_rounds

    def generate_reviewed_tests(self, spec: str) -> TestSuite:
        # Actor generates initial test suite
        tests = self.actor.generate(spec)
        print(f"Actor generated {len(tests)} tests")

        for round_num in range(self.max_rounds):
            # Critic reviews all tests
            review = self.critic.review(tests, spec)
            print(f"Round {round_num + 1}: "
                  f"{review.approved_count} approved, "
                  f"{review.revise_count} need revision, "
                  f"{review.rejected_count} rejected")

            if review.approval_rate >= 0.9:  # 90%+ tests approved
                print(f"Critic satisfied after {round_num + 1} rounds")
                break

            # Actor revises based on critic feedback
            tests = self.actor.revise(tests, review.feedback)

        # Final suite: only approved tests
        return TestSuite(tests=[t for t in tests if t.status == "approved"])

The Critic's Review Prompt

The Critic is the quality engine. Its prompt must be rigorous and specific:

You are a senior QA architect reviewing AI-generated tests. For each test:

1. Does the assertion actually verify the requirement? (not a tautology)
2. Would this test catch the bug it claims to test? (mutation analysis)
3. Is this test independent and deterministic?
4. Is there a simpler way to test the same thing?
5. Is anything missing from the specification that should be tested?

Rate each test: APPROVE, REVISE (with specific feedback), or REJECT (with reason).
Also identify any GAPS: scenarios from the spec that no test covers.

Specification:
{spec}

Tests to review:
{tests}

Output as JSON:
{
  "reviews": [
    {
      "test_name": "...",
      "verdict": "APPROVE|REVISE|REJECT",
      "feedback": "...",  // Required for REVISE and REJECT
      "mutation_survives": true|false  // Would removing the assertion miss a bug?
    }
  ],
  "gaps": [
    "Description of untested scenario"
  ]
}

Example Critic Output

{
  "reviews": [
    {
      "test_name": "test_create_user_valid",
      "verdict": "REVISE",
      "feedback": "Assertion only checks status code 201. Add assertions for response body: verify 'id' is present, 'email' matches input, 'created_at' is recent. Without these, a mutation that breaks user creation but returns 201 would go undetected.",
      "mutation_survives": true
    },
    {
      "test_name": "test_create_user_duplicate_email",
      "verdict": "APPROVE",
      "feedback": "Good test. Verifies 409 status and error message mentioning 'email'. Tight assertions.",
      "mutation_survives": false
    },
    {
      "test_name": "test_get_user",
      "verdict": "REJECT",
      "feedback": "This is a tautology. The mock returns {'name': 'Alice'} and the assertion checks name == 'Alice'. This tests the mock, not the code. Replace with a test that verifies the actual database query or service logic.",
      "mutation_survives": true
    }
  ],
  "gaps": [
    "No test for creating a user with an email longer than 254 characters (RFC 5321 limit)",
    "No test for concurrent creation of two users with the same email (race condition)",
    "No test for the password hashing (verify stored password is not plaintext)"
  ]
}

The Actor's Revision Prompt

When the Critic returns REVISE feedback, the Actor receives targeted instructions:

The following tests need revision based on code review feedback.
For each test, apply the suggested changes while maintaining the test's intent.

Original test:
{original_test_code}

Reviewer feedback:
{critic_feedback}

Requirements:
1. Apply the specific feedback
2. Do not change other aspects of the test
3. Maintain the same naming convention
4. Keep the test self-contained (no new external dependencies)

Also generate new tests for these identified gaps:
{gaps}

Multi-Round Convergence

The Critic-Actor loop typically converges in 2-3 rounds:

Round 1: Actor generates 30 tests
         Critic: 15 approved, 10 revise, 5 rejected
         Approval rate: 50%

Round 2: Actor revises 10, generates 5 new for gaps
         Critic: 25 approved, 5 revise, 0 rejected
         Approval rate: 83%

Round 3: Actor revises 5
         Critic: 29 approved, 1 revise, 0 rejected
         Approval rate: 97% → DONE

Why max_rounds matters: Without a limit, the Critic-Actor loop could iterate indefinitely, with the Critic finding increasingly minor issues. Set max_rounds=3 to balance quality and cost.

Quality Benchmarks

In benchmarks, critic-actor pipelines produce measurably better tests:

Metric	Single-Agent Generation	Critic-Actor Pipeline	Improvement
Tautology rate	15-20%	2-5%	75% reduction
Mutation score	55-65%	70-80%	+15-25 points
Coverage gaps identified	0 (by definition)	3-5 per spec	N/A
Assertion density (per test)	1.5	3.2	2x
Review time (human)	30 min / 30 tests	10 min / 30 tests	3x faster

The human review time drops because the Critic has already caught the most common issues (tautologies, weak assertions, missing scenarios).

Variant: The Three-Agent Pipeline

For even higher quality, add a third agent -- the Specifications Analyst:

   +------------+       +----------+       +----------+
   | SPEC       |------→|  ACTOR   |------→|  CRITIC  |
   | ANALYST    |       | (writes  |       | (reviews |
   | (extracts  |       |  tests)  |←------|  tests)  |
   |  test      |       +----------+       +----------+
   |  scenarios)|
   +------------+

The Spec Analyst reads the specification and outputs a structured list of test scenarios before the Actor writes any code. This prevents the Actor from missing scenarios that the Critic would later identify as gaps.

class SpecAnalyst:
    def analyze(self, spec: str) -> list[TestScenario]:
        return self.llm.generate(f"""
        Analyze this specification and enumerate every testable scenario:
        {spec}

        For each scenario:
        - Category: happy_path | error | boundary | security | performance
        - Description: one sentence
        - Input: what data to use
        - Expected outcome: what should happen
        - Priority: must_have | should_have | nice_to_have
        """)

When to Use the Critic-Actor Pattern

Best for:

When test quality matters more than speed
Critical paths (authentication, payments, data integrity)
Regulated environments (healthcare, finance) where test quality is auditable
Teams new to AI test generation (the Critic catches AI failure modes)

Trade-off: 2-3x more expensive than single-agent generation (multiple LLM calls per round). But the quality improvement often saves more time in human review than it costs in tokens.

Key Takeaway

The Critic-Actor pattern produces the highest-quality AI-generated tests by separating generation from evaluation. The Critic catches tautologies, weak assertions, and coverage gaps that single agents miss. In benchmarks, this pattern achieves 15-25% higher mutation scores than single-pass generation, making it the gold standard for quality-critical test suites.