QA Engineer Skills 2026QA-2026The Dead Man's Switch Pattern

The Dead Man's Switch Pattern

The Last Line of Defense

Even with all five guardrails in place, there are scenarios where the harness itself fails. The agent might hang in a system call. The harness process might segfault. A network partition might freeze the LLM call indefinitely. The Dead Man's Switch is the external safety net that kills the test run when the harness itself becomes unresponsive.


Implementation in CI/CD

The Dead Man's Switch lives outside the agent and harness. It is configured at the CI runner level:

GitHub Actions

- name: Run agentic tests
  timeout-minutes: 15           # Hard CI-level timeout
  env:
    AGENT_MAX_STEPS: 30
    AGENT_MAX_TOKENS: 50000
    AGENT_ALLOWED_DOMAINS: "staging.myapp.com,api.staging.myapp.com"
  run: |
    python -m pytest tests/agentic/ \
      --timeout=300 \            # Per-test timeout (pytest-timeout)
      --agent-budget=$AGENT_MAX_TOKENS

The layered timeout structure:

Layer 1: Per-action timeout (30s)
  → Catches: hung browser, slow element lookup
  → Set by: the harness (action-level)

Layer 2: Per-test timeout (300s / 5 min)
  → Catches: stuck agent loop, slow LLM response chain
  → Set by: pytest-timeout plugin

Layer 3: CI job timeout (15 min)
  → Catches: harness hang, process deadlock, zombie processes
  → Set by: GitHub Actions timeout-minutes

Layer 4: CI workflow timeout (60 min)
  → Catches: entire workflow stuck (multiple jobs)
  → Set by: GitHub Actions workflow-level timeout

GitLab CI

agentic_tests:
  timeout: 15 minutes
  variables:
    AGENT_MAX_STEPS: "30"
    AGENT_MAX_TOKENS: "50000"
  script:
    - python -m pytest tests/agentic/ --timeout=300
  after_script:
    - python scripts/cleanup_agent_processes.py

Jenkins

pipeline {
    options {
        timeout(time: 15, unit: 'MINUTES')
    }
    stages {
        stage('Agentic Tests') {
            steps {
                sh '''
                    timeout 300 python -m pytest tests/agentic/ \
                        --timeout=120
                '''
            }
            post {
                always {
                    sh 'pkill -f "vibe-check daemon" || true'
                    sh 'pkill -f "chromedriver" || true'
                }
            }
        }
    }
}

Process Cleanup

When a Dead Man's Switch triggers, it kills the process abruptly. This can leave zombie processes (Chrome instances, daemon processes, temp files). Always implement cleanup:

#!/usr/bin/env python3
"""cleanup_agent_processes.py — Run after agentic tests to clean up."""

import subprocess
import os
import signal
import glob

def cleanup():
    # Kill any remaining Chrome processes started by tests
    subprocess.run(["pkill", "-f", "chrome.*--test-type"], capture_output=True)

    # Kill any vibe-check daemon processes
    subprocess.run(["pkill", "-f", "vibe-check daemon"], capture_output=True)

    # Kill any chromedriver processes
    subprocess.run(["pkill", "-f", "chromedriver"], capture_output=True)

    # Remove temporary screenshot files
    for f in glob.glob("/tmp/agent_screenshot_*.png"):
        os.remove(f)

    # Remove stale lock files
    for f in glob.glob("/tmp/agent_lock_*"):
        os.remove(f)

    # Remove shared state files from agent runs
    for f in glob.glob("/tmp/agent_state_*.json"):
        os.remove(f)

    print("Agent cleanup complete")

if __name__ == "__main__":
    cleanup()

Heartbeat Monitoring

For long-running agent sessions (nightly exploratory tests), implement a heartbeat:

import threading
import time
import sys

class HeartbeatMonitor:
    """Kills the process if the agent stops making progress."""

    def __init__(self, max_idle_seconds: int = 120):
        self.max_idle = max_idle_seconds
        self.last_heartbeat = time.time()
        self._running = True
        self._thread = threading.Thread(target=self._monitor, daemon=True)
        self._thread.start()

    def beat(self):
        """Call this every time the agent takes an action."""
        self.last_heartbeat = time.time()

    def _monitor(self):
        while self._running:
            idle = time.time() - self.last_heartbeat
            if idle > self.max_idle:
                print(f"DEAD MAN'S SWITCH: No heartbeat for {idle:.0f}s. "
                      f"Killing process.", file=sys.stderr)
                os._exit(1)  # Hard exit — bypass cleanup
            time.sleep(10)  # Check every 10 seconds

    def stop(self):
        self._running = False

# Usage in the harness
monitor = HeartbeatMonitor(max_idle_seconds=120)

class HarnessWithHeartbeat(ConstrainedTestHarness):
    def execute(self, objective):
        for step in range(self.config.max_steps):
            monitor.beat()  # Signal that we are still making progress
            action = self.agent.next_action()
            result = self.agent.execute_action(action)
            if result.is_terminal:
                monitor.stop()
                return result

The Complete Safety Stack

+----------------------------------------------------------+
|  Layer 4: CI Workflow Timeout (60 min)                    |
|  +------------------------------------------------------+|
|  |  Layer 3: CI Job Timeout (15 min)                    ||
|  |  +--------------------------------------------------+||
|  |  |  Layer 2: Per-Test Timeout (300s)                |||
|  |  |  +----------------------------------------------+|||
|  |  |  |  Layer 1: Per-Action Timeout (30s)           ||||
|  |  |  |  +------------------------------------------+||||
|  |  |  |  |  Guardrails: steps, tokens, domains,     |||||
|  |  |  |  |  actions — enforced by the harness       |||||
|  |  |  |  +------------------------------------------+||||
|  |  |  +----------------------------------------------+|||
|  |  +--------------------------------------------------+||
|  +------------------------------------------------------+|
+----------------------------------------------------------+

Each layer catches failures that the inner layer misses:

  • Guardrails catch agent misbehavior (too many steps, wrong domain)
  • Per-action timeout catches hung browser operations
  • Per-test timeout catches stuck agent loops
  • CI job timeout catches harness/process hangs
  • CI workflow timeout catches infrastructure-level failures

Anti-Pattern: No Kill Switch

# DANGEROUS: no timeouts at any level
- name: Run agentic tests
  run: python -m pytest tests/agentic/

This job can run forever. If the agent enters an infinite loop and the harness fails to detect it, the CI runner is blocked until manually killed. In a shared CI environment, this blocks other teams.

Always set at least two timeout layers: one in the test framework and one in the CI runner.


Key Takeaway

The Dead Man's Switch pattern ensures that agentic tests always terminate, even when the agent, harness, and test framework all fail to self-terminate. Implement it as layered timeouts (action, test, job, workflow) plus process cleanup in CI post-scripts. The CI timeout is the outermost safety net -- the one guarantee that your pipeline will not hang indefinitely.