Game Days: Incident Response Testing
What Is a Game Day?
A game day is a scheduled chaos exercise where the team practices incident response procedures under controlled conditions. Unlike automated chaos experiments that validate system resilience, game days validate human processes: communication, decision-making, tooling familiarity, and runbook accuracy.
Think of it as a fire drill for your engineering organization. The building (system) may survive a fire automatically, but the people need to know the evacuation routes (runbooks), where the fire extinguishers are (tools), and who leads the response (incident commander).
Why QA Architects Should Lead Game Days
QA architects are uniquely positioned to design and facilitate game days because they:
- Know the system's weak points from performance and chaos testing
- Understand the user impact of different failure scenarios
- Can design realistic scenarios based on production incident history
- Sit between development and operations, bridging communication gaps
- Have experience designing test cases, which is exactly what game day scenarios are
Game Day Checklist
Pre-Game (1-2 Weeks Before)
Scenario Design -- Define the failure scenario and success criteria
- What failure will you simulate?
- What is the expected system behavior?
- What is the expected human response?
- How will you measure success?
Participant Briefing -- Notify all participants
- On-call engineers for affected services
- Incident commander (rotating role)
- Communications lead
- Optional observers (management, new team members)
Safety Measures -- Define boundaries
- What blast radius is acceptable?
- What is the kill switch procedure?
- What hours will the exercise run?
- Is customer impact acceptable? If not, use staging.
Tool Readiness -- Verify all incident response tools work
- PagerDuty / Opsgenie configured and routing correctly
- Slack/Teams incident channels ready
- Monitoring dashboards accessible
- Runbooks up to date and accessible
During the Game
Inject the Failure -- Execute the chaos experiment
- Use Litmus, Gremlin, or manual intervention
- Record the exact time of injection
Observe the Response -- Track metrics without intervening
- Detection Time: How long until someone notices?
- Acknowledgment Time: How long until the on-call responds?
- Communication Time: How long until the team is assembled?
- Diagnosis Time: How long until root cause is identified?
- Recovery Time: How long until service is restored?
- Customer Impact: How many users were affected, for how long?
Document Everything -- A dedicated observer takes notes
- What happened and when (timeline)
- What went well
- What went poorly
- Gaps in tooling, monitoring, or runbooks
Post-Game (Within 48 Hours)
- Post-Game Review -- Blameless retrospective
- Review the timeline with all participants
- Identify action items (runbook updates, monitoring gaps, tool improvements)
- Assign owners and deadlines for each action item
- Share findings with the broader organization
Designing Game Day Scenarios
Scenario Template
## Game Day Scenario: [Name]
### Failure Type
[Pod kill / Network partition / Dependency failure / etc.]
### Target Service
[Service name and environment]
### Hypothesis
"When [failure], the system should [expected behavior], and the team should
[expected response] within [time limit]."
### Injection Method
[Litmus experiment / Gremlin attack / Manual kubectl / etc.]
### Success Criteria
- [ ] Detection within 5 minutes
- [ ] Incident channel created within 10 minutes
- [ ] Root cause identified within 20 minutes
- [ ] Service restored within 30 minutes
- [ ] Customer impact limited to < 1% of users
### Kill Switch
[Command to abort the experiment immediately]
### Blast Radius
[Which services, users, and regions are affected]
Scenario Ideas by Maturity Level
Beginner (First 3 Game Days):
| Scenario | What It Tests | Difficulty |
|---|---|---|
| Kill a non-critical service pod | Auto-restart, monitoring, alerting | Low |
| Simulate a dependency timeout | Circuit breaker behavior, fallback logic | Low |
| Revoke a database password | Secret rotation process, runbook accuracy | Medium |
Intermediate:
| Scenario | What It Tests | Difficulty |
|---|---|---|
| Kill 50% of a critical service's pods | Auto-scaling, load balancing, SLO resilience | Medium |
| Inject 2-second latency on inter-service network | Timeout configuration, retry logic, cascading failures | Medium |
| Simulate a full region failure | Multi-region failover, DNS routing, data consistency | High |
Advanced:
| Scenario | What It Tests | Difficulty |
|---|---|---|
| Corrupt a database table during peak traffic | Backup/restore process, data integrity checks | High |
| Simulate a supply chain attack (compromised dependency) | Incident response for security events, rollback speed | High |
| Combined failure: high traffic + partial outage + new deployment | Real-world incident complexity, team coordination | Very High |
Measuring Game Day Effectiveness
Track these metrics across game days to measure organizational improvement:
| Metric | First Game Day Target | Mature Target |
|---|---|---|
| Mean Time to Detect (MTTD) | < 15 minutes | < 5 minutes |
| Mean Time to Acknowledge (MTTA) | < 10 minutes | < 2 minutes |
| Mean Time to Resolve (MTTR) | < 60 minutes | < 15 minutes |
| Runbook accuracy | 50% (many gaps) | 90%+ |
| Communication effectiveness | Ad hoc | Structured incident command |
| Action items completed (% within 2 weeks) | 30% | 80%+ |
Running a Game Day: Step-by-Step
T-60 minutes: Pre-brief
Gather participants in a shared call. Brief them:
- "We are running a game day exercise starting in 60 minutes."
- "This is a learning exercise, not a test. There are no wrong actions."
- "We will inject a failure in the [staging/production] environment."
- "Please respond as you would to a real incident."
- For production exercises: "Customer impact is expected to be minimal. If impact exceeds [threshold], we will abort."
T-0: Inject the failure
Execute the chaos experiment. Start the clock.
T+5 to T+30: Observe without intervening
The facilitator observes and documents but does not guide the response. The goal is to see how the team responds naturally. Only intervene if:
- The blast radius exceeds the agreed boundary
- The team is about to take a destructive action
- The exercise has gone on significantly longer than planned
T+30 to T+60: Wrap up
- Abort the chaos experiment if it has not resolved naturally
- Verify the system has returned to steady state
- Collect initial impressions from participants
T+24h to T+48h: Post-game review
Run a blameless retrospective:
- Timeline reconstruction -- What happened and when?
- What went well? -- Celebrate effective responses
- What surprised us? -- Unexpected system behavior or process gaps
- Action items -- Concrete improvements with owners and deadlines
Game Day Anti-Patterns
| Anti-Pattern | Why It Is Harmful | Instead Do |
|---|---|---|
| "Gotcha" exercises | Surprises breed resentment, not learning | Always pre-announce game days |
| Blame-focused reviews | Discourages participation in future exercises | Run blameless retrospectives |
| No follow-through on actions | Team loses trust in the process | Track and report on action item completion |
| Only testing happy paths | "Kill one pod" every time teaches nothing new | Escalate scenario complexity over time |
| No observers | Participants cannot accurately self-assess during stress | Assign a dedicated observer/facilitator |
| Game days only in staging | Staging does not represent production | Graduate to production as the team matures |
Building a Game Day Program
Quarter 1: Foundation
- Run 1 game day per month in staging
- Focus on single-service failures
- Establish the post-game review process
- Build initial runbooks based on findings
Quarter 2: Expansion
- Run 2 game days per month
- Include multi-service scenarios
- Run first production game day (low-traffic window)
- Start tracking MTTD/MTTA/MTTR trends
Quarter 3: Maturity
- Weekly game days (rotating services)
- Production game days during normal hours
- Include security incident scenarios
- Cross-team exercises (backend + frontend + mobile)
Quarter 4: Continuous
- Automated game days (scheduled chaos with human response evaluation)
- Game days integrated into on-call rotation onboarding
- Quarterly "all hands" game days with company-wide participation
- Publish a game day report to the entire organization
Interview Talking Point: "I approach performance and resilience as two sides of the same coin. Performance testing tells you how fast the system runs under load; chaos engineering tells you what happens when that load arrives during a failure. I integrate both into CI -- k6 validates our SLOs in staging, and Litmus chaos experiments verify we stay within SLO even when pods are being killed. For LLM-backed features, I add dedicated metrics like time-to-first-token and tokens-per-second, because traditional latency percentiles do not capture the user experience of a streaming AI response. The goal is not to prove the system never fails -- it is to prove that when it fails, it fails gracefully within our error budget."