The Determinism Spectrum
The Central Tension
Agentic testing's greatest strength (adaptability) is also its greatest weakness (non-determinism). A test that takes a different path on every run is powerful for exploration but useless as a CI gate. Understanding where your tests fall on the determinism spectrum is critical for choosing the right approach.
The Spectrum
FULLY DETERMINISTIC <---------------------------------------> FULLY AUTONOMOUS
(pytest scripts) (free-roaming agent)
|-- Recorded tests |-- Parameterized |-- Constrained |-- Exploratory
| (fixed steps, | agents | agents | agents
| fixed data) | (fixed objective, | (fixed objective, | (no objective,
| | agent picks path) | bounded steps) | find anything)
| | | |
| REGRESSION | TARGETED | SMOKE/SANITY | DISCOVERY
| Best for CI | Best for staging | Best for deploy | Best for sprint
| gates | environments | verification | exploration
Level 1: Fully Deterministic (Recorded Tests)
# Traditional test: every step is predetermined
def test_login():
driver.navigate("https://app.example.com/login")
driver.type("#email", "test@test.com")
driver.type("#password", "password123")
driver.click("#submit")
assert driver.url == "https://app.example.com/dashboard"
Properties: Same input, same path, same result every time. Use for: CI gates, regression testing, deployability verification. Limitation: Breaks when the UI changes. No adaptability.
Level 2: Parameterized Agents
# Agent has a fixed objective but chooses its own path
def test_login_agent():
agent = TestAgent(objective="Log in with test@test.com / password123")
result = agent.run(max_steps=15)
assert result.final_url == "https://app.example.com/dashboard"
Properties: Same objective, potentially different path, same expected outcome. Use for: Staging tests, self-healing regression tests. Limitation: May take different paths that have different performance characteristics.
Level 3: Constrained Agents
# Agent has a bounded scope but explores within it
def test_checkout_smoke():
agent = TestAgent(
objective="Complete a checkout with any valid product",
config=HarnessConfig(max_steps=20, timeout_seconds=120)
)
result = agent.run()
assert result.status == "pass"
assert "order confirmation" in result.final_observation.lower()
Properties: Bounded exploration with a goal. Different paths, different products. Use for: Deploy verification, smoke tests, sanity checks. Limitation: Non-deterministic results make debugging harder.
Level 4: Fully Autonomous (Exploratory Agents)
# Agent explores freely, looking for anything interesting
def test_explore_app():
agent = ExploratoryAgent(
starting_url="https://app.example.com",
config=HarnessConfig(max_steps=100, timeout_seconds=600)
)
findings = agent.explore()
assert len(findings.critical_issues) == 0
Properties: No predetermined path. Agent discovers issues autonomously. Use for: Sprint exploration, security audits, discovering untested states. Limitation: Completely non-deterministic. Cannot be a CI gate.
The Value Matrix
| Scenario | Autonomous Agent Value | Non-Determinism Risk | Recommendation |
|---|---|---|---|
| Exploratory testing of new UI | High | Medium | Use as supplement, not gate |
| Regression testing known flows | Low | High | Use deterministic scripts |
| API fuzz testing | High | Low (fuzzing is stochastic) | Agent-driven fuzzing ideal |
| Flaky test detection | High | Low (detection is deterministic) | Strong use case |
| Visual regression | Medium | Medium (vision model varies) | Pixel-diff + agent overlay |
| Cross-browser testing | Medium | Low | Good with constrained harness |
Making Agents Deterministic Enough for CI
If you want to use agents in CI pipelines where non-determinism breaks builds, apply these techniques:
Technique 1: Fix the Seed
# Deterministic model output via temperature=0 and fixed seed
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0,
seed=42 # Fixed seed for reproducibility
)
Limitation: Even with temperature=0 and a fixed seed, LLM outputs are not guaranteed to be identical across API versions or model updates.
Technique 2: Cache Agent Decisions
class DeterministicAgent:
"""Replays cached decisions for reproducibility."""
def __init__(self, cache_path: str):
self.cache = self.load_cache(cache_path)
self.step = 0
def next_action(self):
if self.step in self.cache:
action = self.cache[self.step] # Replay cached decision
else:
action = self.llm.decide() # Generate new decision
self.cache[self.step] = action # Record for next run
self.step += 1
return action
def save_cache(self):
"""Save decisions for future deterministic replays."""
with open(self.cache_path, "w") as f:
json.dump(self.cache, f)
How it works: On the first run, the agent generates decisions and caches them. On subsequent runs, it replays the cached decisions exactly. If the page changes and a cached decision becomes invalid, the cache is invalidated and a new decision is generated.
Technique 3: Assert on Outcomes, Not Paths
# BAD: asserts the exact sequence of agent actions
assert agent.history == [
"navigate /login",
"type email test@test.com",
"type password secret",
"click submit"
]
# GOOD: asserts the final state regardless of how the agent got there
assert agent.browser.url == "https://app.example.com/dashboard"
assert "Welcome" in agent.browser.text("h1")
This is the most practical technique. The agent may take a different path on each run, but the test passes as long as the final state is correct.
Choosing Your Position on the Spectrum
| Your Need | Spectrum Position | Implementation |
|---|---|---|
| CI gate (must pass every time) | Left (deterministic) | Recorded scripts or cached agents |
| Staging validation (mostly pass) | Center-left (parameterized) | Agents with fixed objectives, outcome assertions |
| Deploy verification (quick check) | Center (constrained) | Agents with bounded steps, broad objectives |
| Exploratory testing (find issues) | Right (autonomous) | Unconstrained agents, findings-based reporting |
Key Takeaway
There is no single "right" position on the determinism spectrum. CI gates need deterministic tests. Exploratory testing needs autonomous agents. The art is choosing the right position for each test category and implementing the appropriate techniques (seed fixing, decision caching, outcome assertions) to achieve the desired level of reproducibility.