Practical ReAct Implementation

From Prototype to Production

The core ReAct loop is straightforward to understand but requires careful engineering to use in production. This file covers the practical concerns: error handling, prompt optimization, history management, and testing the test agent itself.

Robust Error Handling

A production ReAct agent must handle three categories of errors:

Category 1: Agent Errors (the agent makes a bad decision)

class AgentDecisionError(Exception):
    """Agent produced an unparseable or invalid action."""
    pass

def parse_decision(self, raw_response: str) -> Action:
    """Parse the LLM's response into a structured action."""
    raw = raw_response.strip()

    # Handle malformed responses
    if not raw:
        raise AgentDecisionError("Empty response from LLM")

    # Try to parse known action formats
    for prefix in ["NAVIGATE", "CLICK", "TYPE", "ASSERT", "DONE"]:
        if raw.upper().startswith(prefix):
            return Action(type=prefix, payload=raw[len(prefix):].strip())

    # If no known action matches, ask the LLM to retry
    retry_prompt = f"""
    Your previous response was not a valid action: "{raw}"
    Please respond with exactly one of:
    - NAVIGATE <url>
    - CLICK <selector>
    - TYPE <selector> <text>
    - ASSERT <condition>
    - DONE <pass|fail> <reason>
    """
    retry_response = self.llm.generate(retry_prompt)
    return self.parse_decision(retry_response)  # One retry only

Category 2: Environment Errors (the system under test fails)

def execute_with_recovery(self, action: Action) -> ActionResult:
    """Execute an action with environment error recovery."""
    try:
        return self.execute(action)
    except ElementNotFoundError as e:
        # Try alternative selectors
        alternatives = self.browser.find_similar(action.selector)
        if alternatives:
            return ActionResult(
                status="recovered",
                message=f"Original selector '{action.selector}' not found. "
                        f"Found alternatives: {alternatives}",
                alternatives=alternatives
            )
        return ActionResult(status="failed", message=str(e))
    except TimeoutError:
        # Take a screenshot for debugging
        screenshot = self.browser.screenshot(f"timeout_step_{self.step_count}.png")
        return ActionResult(
            status="timeout",
            message="Action timed out",
            screenshot=screenshot
        )
    except NavigationError as e:
        # Page failed to load
        return ActionResult(
            status="nav_error",
            message=f"Navigation failed: {e}",
            url=self.browser.current_url()
        )

Category 3: Infrastructure Errors (the agent infrastructure fails)

def run_with_infrastructure_safety(self, objective: str) -> TestResult:
    """Run a test with infrastructure-level safety nets."""
    try:
        return self.run(objective)
    except LLMRateLimitError:
        return TestResult(status="ABORTED", reason="LLM rate limit exceeded")
    except LLMTimeoutError:
        return TestResult(status="ABORTED", reason="LLM response timeout")
    except BrowserCrashError:
        return TestResult(status="ABORTED", reason="Browser process crashed")
    except Exception as e:
        # Catch-all for unexpected errors
        return TestResult(
            status="ERROR",
            reason=f"Unexpected infrastructure error: {type(e).__name__}: {e}",
            screenshot=self.safe_screenshot()
        )

Prompt Optimization for Speed and Quality

The THINK phase prompt is the most critical component. Every extra token in the prompt costs money and latency.

Minimal Prompt (Fast, Cheap, Less Accurate)

minimal_prompt = f"""
Goal: {objective}
URL: {observation.url}
Text: {observation.text[:500]}
Last action: {self.history[-1] if self.history else 'none'}
Next action (NAVIGATE/CLICK/TYPE/ASSERT/DONE):"""

Use when: Smoke tests, high-confidence flows, cost-sensitive environments.

Standard Prompt (Balanced)

standard_prompt = f"""
Objective: {objective}
Current URL: {observation.url}
Page content (truncated): {observation.text[:2000]}
Console errors: {observation.errors}
History: {self.history[-5:]}

What should I do next? Choose one:
- NAVIGATE <url>
- CLICK <selector>
- TYPE <selector> <text>
- ASSERT <condition>
- DONE <pass|fail> <reason>
"""

Use when: Standard testing, staging environments.

Detailed Prompt (Slow, Expensive, Most Accurate)

detailed_prompt = f"""
You are a senior QA engineer testing: {objective}

CURRENT STATE:
  URL: {observation.url}
  Title: {observation.title}
  Visible text: {observation.text[:3000]}
  Visible buttons: {observation.buttons}
  Form fields: {observation.form_fields}
  Console errors: {observation.errors}
  Network failures: {observation.network_errors}

TEST HISTORY (last 10 steps):
{json.dumps(self.history[-10:], indent=2)}

CONSTRAINTS:
  - Steps remaining: {self.max_steps - self.step_count}
  - Allowed domains: {self.config.allowed_domains}
  - Do not click delete/remove buttons unless the objective requires it

REASONING REQUIRED:
1. Analyze the current state
2. Identify what has changed since the last step
3. Decide the most efficient next action toward the objective
4. Explain your reasoning in one sentence

ACTION (choose one):
  NAVIGATE <url>
  CLICK <selector>
  TYPE <selector> <text>
  ASSERT <condition>
  DONE <pass|fail> <reason>
"""

Use when: Complex flows, exploratory testing, debugging sessions.

History Management Strategies

The history buffer directly affects agent performance. Too little history causes loops; too much wastes tokens.

Strategy 1: Fixed Window

# Keep the last N steps
self.history = self.history[-5:]

Simple and predictable. Works well for tests under 15 steps.

Strategy 2: Summarized History

def summarize_history(self) -> str:
    """Compress history into a summary to save tokens."""
    if len(self.history) <= 3:
        return json.dumps(self.history)

    # Keep first step, summary of middle, and last 2 steps
    summary = f"Started at {self.history[0]['state_url']}. "
    summary += f"Took {len(self.history) - 3} intermediate steps. "
    if any(h.get('result') == 'error' for h in self.history[1:-2]):
        summary += "Encountered errors along the way. "
    summary += f"Recent actions: {json.dumps(self.history[-2:])}"
    return summary

Saves tokens while preserving context for long-running agents.

Strategy 3: Milestone-Based History

def milestone_history(self) -> list[dict]:
    """Keep only steps that represent significant state changes."""
    milestones = []
    last_url = None
    for step in self.history:
        # URL changed = navigation milestone
        if step.get("state_url") != last_url:
            milestones.append(step)
            last_url = step.get("state_url")
        # Error occurred = error milestone
        elif step.get("result") == "error":
            milestones.append(step)
        # Assertion made = assertion milestone
        elif "ASSERT" in step.get("action", ""):
            milestones.append(step)
    return milestones[-5:]  # Keep last 5 milestones

Best for long multi-page workflows where only navigation and assertions matter.

Testing the Test Agent

An often-overlooked concern: how do you test your ReAct agent itself?

Unit Testing the Decision Parser

class TestDecisionParser:
    def test_parses_navigate_command(self):
        action = agent.parse_decision("NAVIGATE https://example.com")
        assert action.type == "NAVIGATE"
        assert action.payload == "https://example.com"

    def test_parses_click_with_complex_selector(self):
        action = agent.parse_decision('CLICK button[data-testid="submit"]')
        assert action.type == "CLICK"
        assert action.payload == 'button[data-testid="submit"]'

    def test_handles_malformed_response(self):
        with pytest.raises(AgentDecisionError):
            agent.parse_decision("I think we should click the button")

    def test_handles_empty_response(self):
        with pytest.raises(AgentDecisionError):
            agent.parse_decision("")

Integration Testing with Mock LLM

class TestReActAgentFlow:
    def test_completes_login_flow(self, mock_browser):
        """Agent should complete a login flow with a scripted LLM."""
        mock_llm = ScriptedLLM([
            'NAVIGATE https://app.example.com/login',
            'TYPE input[name=email] test@test.com',
            'TYPE input[name=password] secret',
            'CLICK button[type=submit]',
            'DONE pass "Successfully navigated to dashboard"',
        ])

        agent = ReActTestAgent(llm=mock_llm, browser=mock_browser)
        result = agent.run("Log in with valid credentials")

        assert result.status == "pass"
        assert result.steps_taken == 5

    def test_times_out_on_stuck_flow(self, mock_browser):
        """Agent should timeout if it cannot complete the objective."""
        mock_llm = ScriptedLLM([
            'CLICK #nonexistent-button',  # Repeated forever
        ] * 20)

        agent = ReActTestAgent(llm=mock_llm, browser=mock_browser, max_steps=5)
        result = agent.run("Click the submit button")

        assert result.status == "TIMEOUT"
        assert result.steps_taken == 5

Regression Testing Agent Behavior

Record agent decisions on a known-good run, then replay to detect regressions:

class TestAgentRegression:
    def test_login_agent_matches_baseline(self, real_browser, real_llm):
        """Agent decisions should not drift from established baseline."""
        agent = ReActTestAgent(llm=real_llm, browser=real_browser)
        result = agent.run("Log in with test@test.com / password123")

        # Save the decision sequence
        decisions = [h["action"] for h in agent.history]

        # Compare against baseline (saved from a known-good run)
        baseline = load_baseline("login_test_baseline.json")

        # Allow some flexibility (exact decisions may vary)
        assert result.status == baseline["status"]
        assert len(decisions) <= baseline["max_steps"] * 1.5  # Allow 50% more steps

Performance Benchmarks

Configuration	Steps	LLM Latency	Total Time	Token Cost
Minimal prompt, Haiku	8	100ms/step	3s	$0.005
Standard prompt, Sonnet	12	300ms/step	8s	$0.04
Detailed prompt, Opus	15	500ms/step	15s	$0.15

Key Takeaway

A production ReAct agent requires more than the core loop. It needs robust error handling (agent errors, environment errors, infrastructure errors), optimized prompts for the right speed/quality tradeoff, smart history management to prevent token waste, and -- critically -- tests for the agent itself. The engineering around the loop is as important as the loop itself.