The Dead Man's Switch Pattern
The Last Line of Defense
Even with all five guardrails in place, there are scenarios where the harness itself fails. The agent might hang in a system call. The harness process might segfault. A network partition might freeze the LLM call indefinitely. The Dead Man's Switch is the external safety net that kills the test run when the harness itself becomes unresponsive.
Implementation in CI/CD
The Dead Man's Switch lives outside the agent and harness. It is configured at the CI runner level:
GitHub Actions
- name: Run agentic tests
timeout-minutes: 15 # Hard CI-level timeout
env:
AGENT_MAX_STEPS: 30
AGENT_MAX_TOKENS: 50000
AGENT_ALLOWED_DOMAINS: "staging.myapp.com,api.staging.myapp.com"
run: |
python -m pytest tests/agentic/ \
--timeout=300 \ # Per-test timeout (pytest-timeout)
--agent-budget=$AGENT_MAX_TOKENS
The layered timeout structure:
Layer 1: Per-action timeout (30s)
→ Catches: hung browser, slow element lookup
→ Set by: the harness (action-level)
Layer 2: Per-test timeout (300s / 5 min)
→ Catches: stuck agent loop, slow LLM response chain
→ Set by: pytest-timeout plugin
Layer 3: CI job timeout (15 min)
→ Catches: harness hang, process deadlock, zombie processes
→ Set by: GitHub Actions timeout-minutes
Layer 4: CI workflow timeout (60 min)
→ Catches: entire workflow stuck (multiple jobs)
→ Set by: GitHub Actions workflow-level timeout
GitLab CI
agentic_tests:
timeout: 15 minutes
variables:
AGENT_MAX_STEPS: "30"
AGENT_MAX_TOKENS: "50000"
script:
- python -m pytest tests/agentic/ --timeout=300
after_script:
- python scripts/cleanup_agent_processes.py
Jenkins
pipeline {
options {
timeout(time: 15, unit: 'MINUTES')
}
stages {
stage('Agentic Tests') {
steps {
sh '''
timeout 300 python -m pytest tests/agentic/ \
--timeout=120
'''
}
post {
always {
sh 'pkill -f "vibe-check daemon" || true'
sh 'pkill -f "chromedriver" || true'
}
}
}
}
}
Process Cleanup
When a Dead Man's Switch triggers, it kills the process abruptly. This can leave zombie processes (Chrome instances, daemon processes, temp files). Always implement cleanup:
#!/usr/bin/env python3
"""cleanup_agent_processes.py — Run after agentic tests to clean up."""
import subprocess
import os
import signal
import glob
def cleanup():
# Kill any remaining Chrome processes started by tests
subprocess.run(["pkill", "-f", "chrome.*--test-type"], capture_output=True)
# Kill any vibe-check daemon processes
subprocess.run(["pkill", "-f", "vibe-check daemon"], capture_output=True)
# Kill any chromedriver processes
subprocess.run(["pkill", "-f", "chromedriver"], capture_output=True)
# Remove temporary screenshot files
for f in glob.glob("/tmp/agent_screenshot_*.png"):
os.remove(f)
# Remove stale lock files
for f in glob.glob("/tmp/agent_lock_*"):
os.remove(f)
# Remove shared state files from agent runs
for f in glob.glob("/tmp/agent_state_*.json"):
os.remove(f)
print("Agent cleanup complete")
if __name__ == "__main__":
cleanup()
Heartbeat Monitoring
For long-running agent sessions (nightly exploratory tests), implement a heartbeat:
import threading
import time
import sys
class HeartbeatMonitor:
"""Kills the process if the agent stops making progress."""
def __init__(self, max_idle_seconds: int = 120):
self.max_idle = max_idle_seconds
self.last_heartbeat = time.time()
self._running = True
self._thread = threading.Thread(target=self._monitor, daemon=True)
self._thread.start()
def beat(self):
"""Call this every time the agent takes an action."""
self.last_heartbeat = time.time()
def _monitor(self):
while self._running:
idle = time.time() - self.last_heartbeat
if idle > self.max_idle:
print(f"DEAD MAN'S SWITCH: No heartbeat for {idle:.0f}s. "
f"Killing process.", file=sys.stderr)
os._exit(1) # Hard exit — bypass cleanup
time.sleep(10) # Check every 10 seconds
def stop(self):
self._running = False
# Usage in the harness
monitor = HeartbeatMonitor(max_idle_seconds=120)
class HarnessWithHeartbeat(ConstrainedTestHarness):
def execute(self, objective):
for step in range(self.config.max_steps):
monitor.beat() # Signal that we are still making progress
action = self.agent.next_action()
result = self.agent.execute_action(action)
if result.is_terminal:
monitor.stop()
return result
The Complete Safety Stack
+----------------------------------------------------------+
| Layer 4: CI Workflow Timeout (60 min) |
| +------------------------------------------------------+|
| | Layer 3: CI Job Timeout (15 min) ||
| | +--------------------------------------------------+||
| | | Layer 2: Per-Test Timeout (300s) |||
| | | +----------------------------------------------+|||
| | | | Layer 1: Per-Action Timeout (30s) ||||
| | | | +------------------------------------------+||||
| | | | | Guardrails: steps, tokens, domains, |||||
| | | | | actions — enforced by the harness |||||
| | | | +------------------------------------------+||||
| | | +----------------------------------------------+|||
| | +--------------------------------------------------+||
| +------------------------------------------------------+|
+----------------------------------------------------------+
Each layer catches failures that the inner layer misses:
- Guardrails catch agent misbehavior (too many steps, wrong domain)
- Per-action timeout catches hung browser operations
- Per-test timeout catches stuck agent loops
- CI job timeout catches harness/process hangs
- CI workflow timeout catches infrastructure-level failures
Anti-Pattern: No Kill Switch
# DANGEROUS: no timeouts at any level
- name: Run agentic tests
run: python -m pytest tests/agentic/
This job can run forever. If the agent enters an infinite loop and the harness fails to detect it, the CI runner is blocked until manually killed. In a shared CI environment, this blocks other teams.
Always set at least two timeout layers: one in the test framework and one in the CI runner.
Key Takeaway
The Dead Man's Switch pattern ensures that agentic tests always terminate, even when the agent, harness, and test framework all fail to self-terminate. Implement it as layered timeouts (action, test, job, workflow) plus process cleanup in CI post-scripts. The CI timeout is the outermost safety net -- the one guarantee that your pipeline will not hang indefinitely.