Game Days: Incident Response Testing

What Is a Game Day?

A game day is a scheduled chaos exercise where the team practices incident response procedures under controlled conditions. Unlike automated chaos experiments that validate system resilience, game days validate human processes: communication, decision-making, tooling familiarity, and runbook accuracy.

Think of it as a fire drill for your engineering organization. The building (system) may survive a fire automatically, but the people need to know the evacuation routes (runbooks), where the fire extinguishers are (tools), and who leads the response (incident commander).

Why QA Architects Should Lead Game Days

QA architects are uniquely positioned to design and facilitate game days because they:

Know the system's weak points from performance and chaos testing
Understand the user impact of different failure scenarios
Can design realistic scenarios based on production incident history
Sit between development and operations, bridging communication gaps
Have experience designing test cases, which is exactly what game day scenarios are

Game Day Checklist

Pre-Game (1-2 Weeks Before)

Scenario Design -- Define the failure scenario and success criteria
- What failure will you simulate?
- What is the expected system behavior?
- What is the expected human response?
- How will you measure success?
Participant Briefing -- Notify all participants
- On-call engineers for affected services
- Incident commander (rotating role)
- Communications lead
- Optional observers (management, new team members)
Safety Measures -- Define boundaries
- What blast radius is acceptable?
- What is the kill switch procedure?
- What hours will the exercise run?
- Is customer impact acceptable? If not, use staging.
Tool Readiness -- Verify all incident response tools work
- PagerDuty / Opsgenie configured and routing correctly
- Slack/Teams incident channels ready
- Monitoring dashboards accessible
- Runbooks up to date and accessible

During the Game

Inject the Failure -- Execute the chaos experiment
- Use Litmus, Gremlin, or manual intervention
- Record the exact time of injection
Observe the Response -- Track metrics without intervening
- Detection Time: How long until someone notices?
- Acknowledgment Time: How long until the on-call responds?
- Communication Time: How long until the team is assembled?
- Diagnosis Time: How long until root cause is identified?
- Recovery Time: How long until service is restored?
- Customer Impact: How many users were affected, for how long?
Document Everything -- A dedicated observer takes notes
- What happened and when (timeline)
- What went well
- What went poorly
- Gaps in tooling, monitoring, or runbooks

Post-Game (Within 48 Hours)

Post-Game Review -- Blameless retrospective
- Review the timeline with all participants
- Identify action items (runbook updates, monitoring gaps, tool improvements)
- Assign owners and deadlines for each action item
- Share findings with the broader organization

Designing Game Day Scenarios

Scenario Template

## Game Day Scenario: [Name]

### Failure Type
[Pod kill / Network partition / Dependency failure / etc.]

### Target Service
[Service name and environment]

### Hypothesis
"When [failure], the system should [expected behavior], and the team should
[expected response] within [time limit]."

### Injection Method
[Litmus experiment / Gremlin attack / Manual kubectl / etc.]

### Success Criteria
- [ ] Detection within 5 minutes
- [ ] Incident channel created within 10 minutes
- [ ] Root cause identified within 20 minutes
- [ ] Service restored within 30 minutes
- [ ] Customer impact limited to < 1% of users

### Kill Switch
[Command to abort the experiment immediately]

### Blast Radius
[Which services, users, and regions are affected]

Scenario Ideas by Maturity Level

Beginner (First 3 Game Days):

Scenario	What It Tests	Difficulty
Kill a non-critical service pod	Auto-restart, monitoring, alerting	Low
Simulate a dependency timeout	Circuit breaker behavior, fallback logic	Low
Revoke a database password	Secret rotation process, runbook accuracy	Medium

Intermediate:

Scenario	What It Tests	Difficulty
Kill 50% of a critical service's pods	Auto-scaling, load balancing, SLO resilience	Medium
Inject 2-second latency on inter-service network	Timeout configuration, retry logic, cascading failures	Medium
Simulate a full region failure	Multi-region failover, DNS routing, data consistency	High

Advanced:

Scenario	What It Tests	Difficulty
Corrupt a database table during peak traffic	Backup/restore process, data integrity checks	High
Simulate a supply chain attack (compromised dependency)	Incident response for security events, rollback speed	High
Combined failure: high traffic + partial outage + new deployment	Real-world incident complexity, team coordination	Very High

Measuring Game Day Effectiveness

Track these metrics across game days to measure organizational improvement:

Metric	First Game Day Target	Mature Target
Mean Time to Detect (MTTD)	< 15 minutes	< 5 minutes
Mean Time to Acknowledge (MTTA)	< 10 minutes	< 2 minutes
Mean Time to Resolve (MTTR)	< 60 minutes	< 15 minutes
Runbook accuracy	50% (many gaps)	90%+
Communication effectiveness	Ad hoc	Structured incident command
Action items completed (% within 2 weeks)	30%	80%+

Running a Game Day: Step-by-Step

T-60 minutes: Pre-brief

Gather participants in a shared call. Brief them:

"We are running a game day exercise starting in 60 minutes."
"This is a learning exercise, not a test. There are no wrong actions."
"We will inject a failure in the [staging/production] environment."
"Please respond as you would to a real incident."
For production exercises: "Customer impact is expected to be minimal. If impact exceeds [threshold], we will abort."

T-0: Inject the failure

Execute the chaos experiment. Start the clock.

T+5 to T+30: Observe without intervening

The facilitator observes and documents but does not guide the response. The goal is to see how the team responds naturally. Only intervene if:

The blast radius exceeds the agreed boundary
The team is about to take a destructive action
The exercise has gone on significantly longer than planned

T+30 to T+60: Wrap up

Abort the chaos experiment if it has not resolved naturally
Verify the system has returned to steady state
Collect initial impressions from participants

T+24h to T+48h: Post-game review

Run a blameless retrospective:

Timeline reconstruction -- What happened and when?
What went well? -- Celebrate effective responses
What surprised us? -- Unexpected system behavior or process gaps
Action items -- Concrete improvements with owners and deadlines

Game Day Anti-Patterns

Anti-Pattern	Why It Is Harmful	Instead Do
"Gotcha" exercises	Surprises breed resentment, not learning	Always pre-announce game days
Blame-focused reviews	Discourages participation in future exercises	Run blameless retrospectives
No follow-through on actions	Team loses trust in the process	Track and report on action item completion
Only testing happy paths	"Kill one pod" every time teaches nothing new	Escalate scenario complexity over time
No observers	Participants cannot accurately self-assess during stress	Assign a dedicated observer/facilitator
Game days only in staging	Staging does not represent production	Graduate to production as the team matures

Building a Game Day Program

Quarter 1: Foundation

Run 1 game day per month in staging
Focus on single-service failures
Establish the post-game review process
Build initial runbooks based on findings

Quarter 2: Expansion

Run 2 game days per month
Include multi-service scenarios
Run first production game day (low-traffic window)
Start tracking MTTD/MTTA/MTTR trends

Quarter 3: Maturity

Weekly game days (rotating services)
Production game days during normal hours
Include security incident scenarios
Cross-team exercises (backend + frontend + mobile)

Quarter 4: Continuous

Automated game days (scheduled chaos with human response evaluation)
Game days integrated into on-call rotation onboarding
Quarterly "all hands" game days with company-wide participation
Publish a game day report to the entire organization

Interview Talking Point: "I approach performance and resilience as two sides of the same coin. Performance testing tells you how fast the system runs under load; chaos engineering tells you what happens when that load arrives during a failure. I integrate both into CI -- k6 validates our SLOs in staging, and Litmus chaos experiments verify we stay within SLO even when pods are being killed. For LLM-backed features, I add dedicated metrics like time-to-first-token and tokens-per-second, because traditional latency percentiles do not capture the user experience of a streaming AI response. The goal is not to prove the system never fails -- it is to prove that when it fails, it fails gracefully within our error budget."