QA Engineer Skills 2026QA-2026Game Days: Incident Response Testing

Game Days: Incident Response Testing

What Is a Game Day?

A game day is a scheduled chaos exercise where the team practices incident response procedures under controlled conditions. Unlike automated chaos experiments that validate system resilience, game days validate human processes: communication, decision-making, tooling familiarity, and runbook accuracy.

Think of it as a fire drill for your engineering organization. The building (system) may survive a fire automatically, but the people need to know the evacuation routes (runbooks), where the fire extinguishers are (tools), and who leads the response (incident commander).


Why QA Architects Should Lead Game Days

QA architects are uniquely positioned to design and facilitate game days because they:

  1. Know the system's weak points from performance and chaos testing
  2. Understand the user impact of different failure scenarios
  3. Can design realistic scenarios based on production incident history
  4. Sit between development and operations, bridging communication gaps
  5. Have experience designing test cases, which is exactly what game day scenarios are

Game Day Checklist

Pre-Game (1-2 Weeks Before)

  1. Scenario Design -- Define the failure scenario and success criteria

    • What failure will you simulate?
    • What is the expected system behavior?
    • What is the expected human response?
    • How will you measure success?
  2. Participant Briefing -- Notify all participants

    • On-call engineers for affected services
    • Incident commander (rotating role)
    • Communications lead
    • Optional observers (management, new team members)
  3. Safety Measures -- Define boundaries

    • What blast radius is acceptable?
    • What is the kill switch procedure?
    • What hours will the exercise run?
    • Is customer impact acceptable? If not, use staging.
  4. Tool Readiness -- Verify all incident response tools work

    • PagerDuty / Opsgenie configured and routing correctly
    • Slack/Teams incident channels ready
    • Monitoring dashboards accessible
    • Runbooks up to date and accessible

During the Game

  1. Inject the Failure -- Execute the chaos experiment

    • Use Litmus, Gremlin, or manual intervention
    • Record the exact time of injection
  2. Observe the Response -- Track metrics without intervening

    • Detection Time: How long until someone notices?
    • Acknowledgment Time: How long until the on-call responds?
    • Communication Time: How long until the team is assembled?
    • Diagnosis Time: How long until root cause is identified?
    • Recovery Time: How long until service is restored?
    • Customer Impact: How many users were affected, for how long?
  3. Document Everything -- A dedicated observer takes notes

    • What happened and when (timeline)
    • What went well
    • What went poorly
    • Gaps in tooling, monitoring, or runbooks

Post-Game (Within 48 Hours)

  1. Post-Game Review -- Blameless retrospective
    • Review the timeline with all participants
    • Identify action items (runbook updates, monitoring gaps, tool improvements)
    • Assign owners and deadlines for each action item
    • Share findings with the broader organization

Designing Game Day Scenarios

Scenario Template

## Game Day Scenario: [Name]

### Failure Type
[Pod kill / Network partition / Dependency failure / etc.]

### Target Service
[Service name and environment]

### Hypothesis
"When [failure], the system should [expected behavior], and the team should
[expected response] within [time limit]."

### Injection Method
[Litmus experiment / Gremlin attack / Manual kubectl / etc.]

### Success Criteria
- [ ] Detection within 5 minutes
- [ ] Incident channel created within 10 minutes
- [ ] Root cause identified within 20 minutes
- [ ] Service restored within 30 minutes
- [ ] Customer impact limited to < 1% of users

### Kill Switch
[Command to abort the experiment immediately]

### Blast Radius
[Which services, users, and regions are affected]

Scenario Ideas by Maturity Level

Beginner (First 3 Game Days):

Scenario What It Tests Difficulty
Kill a non-critical service pod Auto-restart, monitoring, alerting Low
Simulate a dependency timeout Circuit breaker behavior, fallback logic Low
Revoke a database password Secret rotation process, runbook accuracy Medium

Intermediate:

Scenario What It Tests Difficulty
Kill 50% of a critical service's pods Auto-scaling, load balancing, SLO resilience Medium
Inject 2-second latency on inter-service network Timeout configuration, retry logic, cascading failures Medium
Simulate a full region failure Multi-region failover, DNS routing, data consistency High

Advanced:

Scenario What It Tests Difficulty
Corrupt a database table during peak traffic Backup/restore process, data integrity checks High
Simulate a supply chain attack (compromised dependency) Incident response for security events, rollback speed High
Combined failure: high traffic + partial outage + new deployment Real-world incident complexity, team coordination Very High

Measuring Game Day Effectiveness

Track these metrics across game days to measure organizational improvement:

Metric First Game Day Target Mature Target
Mean Time to Detect (MTTD) < 15 minutes < 5 minutes
Mean Time to Acknowledge (MTTA) < 10 minutes < 2 minutes
Mean Time to Resolve (MTTR) < 60 minutes < 15 minutes
Runbook accuracy 50% (many gaps) 90%+
Communication effectiveness Ad hoc Structured incident command
Action items completed (% within 2 weeks) 30% 80%+

Running a Game Day: Step-by-Step

T-60 minutes: Pre-brief

Gather participants in a shared call. Brief them:

  • "We are running a game day exercise starting in 60 minutes."
  • "This is a learning exercise, not a test. There are no wrong actions."
  • "We will inject a failure in the [staging/production] environment."
  • "Please respond as you would to a real incident."
  • For production exercises: "Customer impact is expected to be minimal. If impact exceeds [threshold], we will abort."

T-0: Inject the failure

Execute the chaos experiment. Start the clock.

T+5 to T+30: Observe without intervening

The facilitator observes and documents but does not guide the response. The goal is to see how the team responds naturally. Only intervene if:

  • The blast radius exceeds the agreed boundary
  • The team is about to take a destructive action
  • The exercise has gone on significantly longer than planned

T+30 to T+60: Wrap up

  • Abort the chaos experiment if it has not resolved naturally
  • Verify the system has returned to steady state
  • Collect initial impressions from participants

T+24h to T+48h: Post-game review

Run a blameless retrospective:

  1. Timeline reconstruction -- What happened and when?
  2. What went well? -- Celebrate effective responses
  3. What surprised us? -- Unexpected system behavior or process gaps
  4. Action items -- Concrete improvements with owners and deadlines

Game Day Anti-Patterns

Anti-Pattern Why It Is Harmful Instead Do
"Gotcha" exercises Surprises breed resentment, not learning Always pre-announce game days
Blame-focused reviews Discourages participation in future exercises Run blameless retrospectives
No follow-through on actions Team loses trust in the process Track and report on action item completion
Only testing happy paths "Kill one pod" every time teaches nothing new Escalate scenario complexity over time
No observers Participants cannot accurately self-assess during stress Assign a dedicated observer/facilitator
Game days only in staging Staging does not represent production Graduate to production as the team matures

Building a Game Day Program

Quarter 1: Foundation

  • Run 1 game day per month in staging
  • Focus on single-service failures
  • Establish the post-game review process
  • Build initial runbooks based on findings

Quarter 2: Expansion

  • Run 2 game days per month
  • Include multi-service scenarios
  • Run first production game day (low-traffic window)
  • Start tracking MTTD/MTTA/MTTR trends

Quarter 3: Maturity

  • Weekly game days (rotating services)
  • Production game days during normal hours
  • Include security incident scenarios
  • Cross-team exercises (backend + frontend + mobile)

Quarter 4: Continuous

  • Automated game days (scheduled chaos with human response evaluation)
  • Game days integrated into on-call rotation onboarding
  • Quarterly "all hands" game days with company-wide participation
  • Publish a game day report to the entire organization

Interview Talking Point: "I approach performance and resilience as two sides of the same coin. Performance testing tells you how fast the system runs under load; chaos engineering tells you what happens when that load arrives during a failure. I integrate both into CI -- k6 validates our SLOs in staging, and Litmus chaos experiments verify we stay within SLO even when pods are being killed. For LLM-backed features, I add dedicated metrics like time-to-first-token and tokens-per-second, because traditional latency percentiles do not capture the user experience of a streaming AI response. The goal is not to prove the system never fails -- it is to prove that when it fails, it fails gracefully within our error budget."