The Determinism Spectrum

The Central Tension

Agentic testing's greatest strength (adaptability) is also its greatest weakness (non-determinism). A test that takes a different path on every run is powerful for exploration but useless as a CI gate. Understanding where your tests fall on the determinism spectrum is critical for choosing the right approach.

The Spectrum

FULLY DETERMINISTIC <---------------------------------------> FULLY AUTONOMOUS
  (pytest scripts)                                              (free-roaming agent)

  |-- Recorded tests    |-- Parameterized       |-- Constrained        |-- Exploratory
  |   (fixed steps,     |   agents              |   agents             |   agents
  |    fixed data)      |   (fixed objective,   |   (fixed objective,  |   (no objective,
  |                     |    agent picks path)  |    bounded steps)    |    find anything)
  |                     |                       |                      |
  |   REGRESSION        |   TARGETED            |   SMOKE/SANITY       |   DISCOVERY
  |   Best for CI       |   Best for staging    |   Best for deploy    |   Best for sprint
  |   gates             |   environments        |   verification       |   exploration

Level 1: Fully Deterministic (Recorded Tests)

# Traditional test: every step is predetermined
def test_login():
    driver.navigate("https://app.example.com/login")
    driver.type("#email", "test@test.com")
    driver.type("#password", "password123")
    driver.click("#submit")
    assert driver.url == "https://app.example.com/dashboard"

Properties: Same input, same path, same result every time. Use for: CI gates, regression testing, deployability verification. Limitation: Breaks when the UI changes. No adaptability.

Level 2: Parameterized Agents

# Agent has a fixed objective but chooses its own path
def test_login_agent():
    agent = TestAgent(objective="Log in with test@test.com / password123")
    result = agent.run(max_steps=15)
    assert result.final_url == "https://app.example.com/dashboard"

Properties: Same objective, potentially different path, same expected outcome. Use for: Staging tests, self-healing regression tests. Limitation: May take different paths that have different performance characteristics.

Level 3: Constrained Agents

# Agent has a bounded scope but explores within it
def test_checkout_smoke():
    agent = TestAgent(
        objective="Complete a checkout with any valid product",
        config=HarnessConfig(max_steps=20, timeout_seconds=120)
    )
    result = agent.run()
    assert result.status == "pass"
    assert "order confirmation" in result.final_observation.lower()

Properties: Bounded exploration with a goal. Different paths, different products. Use for: Deploy verification, smoke tests, sanity checks. Limitation: Non-deterministic results make debugging harder.

Level 4: Fully Autonomous (Exploratory Agents)

# Agent explores freely, looking for anything interesting
def test_explore_app():
    agent = ExploratoryAgent(
        starting_url="https://app.example.com",
        config=HarnessConfig(max_steps=100, timeout_seconds=600)
    )
    findings = agent.explore()
    assert len(findings.critical_issues) == 0

Properties: No predetermined path. Agent discovers issues autonomously. Use for: Sprint exploration, security audits, discovering untested states. Limitation: Completely non-deterministic. Cannot be a CI gate.

The Value Matrix

Scenario	Autonomous Agent Value	Non-Determinism Risk	Recommendation
Exploratory testing of new UI	High	Medium	Use as supplement, not gate
Regression testing known flows	Low	High	Use deterministic scripts
API fuzz testing	High	Low (fuzzing is stochastic)	Agent-driven fuzzing ideal
Flaky test detection	High	Low (detection is deterministic)	Strong use case
Visual regression	Medium	Medium (vision model varies)	Pixel-diff + agent overlay
Cross-browser testing	Medium	Low	Good with constrained harness

Making Agents Deterministic Enough for CI

If you want to use agents in CI pipelines where non-determinism breaks builds, apply these techniques:

Technique 1: Fix the Seed

# Deterministic model output via temperature=0 and fixed seed
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0,
    seed=42  # Fixed seed for reproducibility
)

Limitation: Even with temperature=0 and a fixed seed, LLM outputs are not guaranteed to be identical across API versions or model updates.

Technique 2: Cache Agent Decisions

class DeterministicAgent:
    """Replays cached decisions for reproducibility."""
    def __init__(self, cache_path: str):
        self.cache = self.load_cache(cache_path)
        self.step = 0

    def next_action(self):
        if self.step in self.cache:
            action = self.cache[self.step]  # Replay cached decision
        else:
            action = self.llm.decide()       # Generate new decision
            self.cache[self.step] = action   # Record for next run
        self.step += 1
        return action

    def save_cache(self):
        """Save decisions for future deterministic replays."""
        with open(self.cache_path, "w") as f:
            json.dump(self.cache, f)

How it works: On the first run, the agent generates decisions and caches them. On subsequent runs, it replays the cached decisions exactly. If the page changes and a cached decision becomes invalid, the cache is invalidated and a new decision is generated.

Technique 3: Assert on Outcomes, Not Paths

# BAD: asserts the exact sequence of agent actions
assert agent.history == [
    "navigate /login",
    "type email test@test.com",
    "type password secret",
    "click submit"
]

# GOOD: asserts the final state regardless of how the agent got there
assert agent.browser.url == "https://app.example.com/dashboard"
assert "Welcome" in agent.browser.text("h1")

This is the most practical technique. The agent may take a different path on each run, but the test passes as long as the final state is correct.

Choosing Your Position on the Spectrum

Your Need	Spectrum Position	Implementation
CI gate (must pass every time)	Left (deterministic)	Recorded scripts or cached agents
Staging validation (mostly pass)	Center-left (parameterized)	Agents with fixed objectives, outcome assertions
Deploy verification (quick check)	Center (constrained)	Agents with bounded steps, broad objectives
Exploratory testing (find issues)	Right (autonomous)	Unconstrained agents, findings-based reporting

Key Takeaway

There is no single "right" position on the determinism spectrum. CI gates need deterministic tests. Exploratory testing needs autonomous agents. The art is choosing the right position for each test category and implementing the appropriate techniques (seed fixing, decision caching, outcome assertions) to achieve the desired level of reproducibility.