Incident Communication

Saying the Right Things to the Right People at the Right Time

During an active incident, communication is as important as the technical fix. Poor communication turns a 30-minute outage into a trust crisis. Stakeholders panic when they do not know what is happening. Customers flood support channels when the status page is silent. Engineers duplicate work when there is no central coordination. The QA engineer's role during incidents is unique: you can assess scope, verify fixes, and communicate quality status in ways that no other role can.

Communicating During an Active Incident

Severity Levels

Before communicating about an incident, everyone needs to agree on its severity. This determines who is notified, how quickly, and through which channels.

Severity	Definition	Response Time	Communication Cadence	Notification Scope
Sev-1 (Critical)	Complete service outage, data loss, or security breach	Immediate (< 5 min)	Every 15 minutes	All stakeholders, customers, executives
Sev-2 (Major)	Significant degradation, core feature broken, large user impact	< 15 minutes	Every 30 minutes	Engineering, product, support, affected customers
Sev-3 (Minor)	Partial degradation, non-core feature broken, limited user impact	< 1 hour	Every 1-2 hours	Engineering, product
Sev-4 (Low)	Cosmetic issue, minor inconvenience, workaround available	Next business day	As needed	Engineering team

Status Update Template (During Active Incident)

## Incident Update -- [Timestamp UTC]

**Status:** Investigating / Identified / Monitoring / Resolved
**Severity:** Sev-[1/2/3]
**Incident Commander:** [Name]

### Current Situation
One to two sentences: what is happening right now.

### Impact
Who is affected and how. Be specific.

### Actions Being Taken
What the team is doing right now to resolve the issue.

### Next Update
When the next update will be provided.

### What Teams Can Do
Any actions other teams should take (e.g., "Do not deploy",
"Redirect customer inquiries to [link]").

Example:

## Incident Update -- 2026-02-09 14:30 UTC

**Status:** Identified
**Severity:** Sev-1
**Incident Commander:** Alice Chen

### Current Situation
Checkout is returning 500 errors for approximately 40% of users.
The root cause has been identified as a database connection pool
exhaustion caused by a slow query.

### Impact
Users cannot complete purchases. Approximately 3,200 users
affected. Revenue impact estimated at $500/minute.

### Actions Being Taken
- The slow query has been killed manually
- Connection pool is recovering (currently at 60% capacity)
- A database index fix is being deployed
- QA is verifying checkout functionality in staging

### Next Update
14:45 UTC or sooner if status changes.

### What Teams Can Do
- Support: Direct customers to retry in 15 minutes
- Marketing: Pause any active promotions
- Engineering: Do not deploy any changes until all-clear

Communication Channels During Incidents

Channel	Purpose	Who Uses It
War room (Slack channel or video call)	Real-time coordination among responders	Engineering, QA, DevOps
Status page	Customer-facing updates	Incident commander updates
Email to stakeholders	Formal notification to executives and partners	Incident commander or engineering manager
Support team channel	Updates for customer support staff	Incident commander or designated liaison
Social media	Public acknowledgment of widespread issues	Communications team

Communication Rules

Update early and often. A status update that says "we are investigating" is better than silence.
Underpromise and overdeliver. Say "we expect resolution within 2 hours" if you think it will take 1 hour.
Be honest about what you do not know. "We have not yet identified the root cause" is better than speculation.
Use consistent terminology. "Investigating, Identified, Monitoring, Resolved" -- everyone knows where you are.
Separate facts from guesses. "We have confirmed that..." vs. "We believe that..."

Writing Incident Reports

The incident report is the permanent record of what happened. It is written after the incident is resolved and serves as the input for the root cause analysis.

Incident Report Template

# Incident Report: [Title]
Date: [Date]
Duration: [Start time] -- [End time] ([total duration])
Severity: [Sev-1/2/3]
Author: [Name]

## Summary
A 2-3 sentence summary of what happened, who was impacted,
and how it was resolved.

## Impact
- **Users affected:** [number and description]
- **Duration:** [time]
- **Revenue impact:** [if applicable]
- **Data impact:** [if applicable]
- **SLA impact:** [if applicable]
- **Support tickets generated:** [number]

## Timeline

| Time (UTC) | Event | Source |
|---|---|---|
| 14:00 | Alert triggered: DB connection pool at 90% | PagerDuty |
| 14:05 | On-call engineer acknowledged | PagerDuty |
| 14:12 | Checkout failures reported by users | Zendesk |
| 14:15 | Incident declared as Sev-1 | Slack #incidents |
| 14:15 | War room opened | Slack #inc-20260209 |
| 14:25 | Root cause identified: slow query | DB monitoring |
| 14:30 | Slow query killed, pool recovering | Manual action |
| 14:35 | Checkout functionality restored | QA verification |
| 14:45 | Database index deployed | Deploy pipeline |
| 15:00 | All-clear declared | Slack #incidents |

## Resolution
What was done to resolve the incident. Both the immediate fix
and any temporary measures.

## Root Cause
Brief description (full analysis in separate RCA document).

## What Went Well
- Alert triggered within minutes of the issue starting
- Root cause identified quickly through database monitoring
- Cross-team coordination was effective
- QA verified the fix before declaring all-clear

## What Could Be Improved
- Migration failure was not detected 2 weeks ago
- Staging data does not match production volumes
- No circuit breaker on database connection pool
- Status page was updated 10 minutes late

## Action Items
Link to the RCA corrective actions.

## Related Documents
- [Root Cause Analysis](link)
- [Status page timeline](link)
- [Customer communication](link)

Internal vs. External Communication

The same incident requires very different communication for different audiences.

Internal Communication (Engineering)

Tone: Technical, detailed, honest about unknowns. Content: Full technical details, timeline, root cause hypotheses, action items.

"At 14:12 UTC, the PostgreSQL connection pool on prod-db-01 was exhausted due to a query on the orders table performing a full table scan (50M rows, no index on created_at). The query was introduced in v2.4.0 (PR #3847). Connection pool recovered after manually killing the query at 14:30 UTC. Index deployed at 14:45 UTC."

External Communication (Customers)

Tone: Empathetic, clear, non-technical. Focus on impact and resolution, not technical details.

"Earlier today, some customers experienced errors when trying to complete their purchases. The issue lasted approximately 23 minutes and has been fully resolved. All orders placed during this time have been verified, and no data was lost. We apologize for the inconvenience and have implemented measures to prevent this from happening again."

External Communication (Partners and Regulators)

Tone: Formal, precise, comprehensive. May need to include specific data about impact, root cause, and corrective actions.

"On February 9, 2026, between 14:12 and 14:35 UTC, the checkout service experienced an outage affecting approximately 3,200 users. Root cause: a missing database index caused query timeouts that exhausted the connection pool. No customer data was compromised. Corrective actions include migration failure alerting, query performance testing, and connection pool circuit breakers. Full RCA report is attached."

Status Page Updates

Status page updates are the most visible form of incident communication. They set expectations and reduce the support burden.

Update cadence by severity:

Severity	First Update	Subsequent Updates
Sev-1	Within 5 minutes	Every 15 minutes
Sev-2	Within 15 minutes	Every 30 minutes
Sev-3	Within 1 hour	Every 1-2 hours

Status page update examples:

Investigating:

"We are investigating reports of errors during checkout. Some customers may experience issues completing purchases. Our team is actively working on this. We will provide an update within 15 minutes."

Identified:

"We have identified the cause of the checkout errors and are implementing a fix. Purchases may still fail for some customers. We expect to resolve this within 30 minutes."

Monitoring:

"A fix has been deployed and checkout is operational. We are monitoring to confirm stability. If you experienced a failed purchase, please try again."

Resolved:

"The checkout issue has been fully resolved. All services are operating normally. We apologize for the disruption and have implemented measures to prevent recurrence."

The QA Engineer's Role During Incidents

QA engineers bring specific skills to incident response that other roles do not.

During the Incident

Activity	Why QA Is Uniquely Suited
Reproduction	QA knows how to systematically reproduce issues, isolate variables, and document exact steps
Scope assessment	QA can quickly determine which features and user flows are affected by testing adjacent functionality
Fix verification	QA can verify the fix works and does not introduce regressions, in staging and then in production
Evidence collection	QA is trained to capture screenshots, logs, network traces, and other evidence

After the Incident

Activity	Why QA Is Uniquely Suited
Regression testing	Verify that the fix holds and no regressions were introduced
Test gap analysis	Identify what tests should have caught this and add them
Monitoring setup	Define what to monitor to detect similar issues early
RCA contribution	Provide the testing perspective: why did existing tests miss this?
Checklist updates	Update deployment verification checklists based on what was learned

QA's Incident Response Checklist

## QA Incident Response

### Immediate (during incident)
- [ ] Join the war room
- [ ] Reproduce the issue and document exact reproduction steps
- [ ] Assess scope: what features are affected beyond the reported issue?
- [ ] When a fix is ready, verify in staging before production deployment
- [ ] Verify the fix in production after deployment
- [ ] Run smoke tests on adjacent functionality

### Short-term (24-48 hours)
- [ ] Run full regression suite against production
- [ ] Write test cases that would have caught this issue
- [ ] Review and update deployment verification checklist
- [ ] Contribute to the incident report with testing perspective

### Medium-term (1-2 weeks)
- [ ] Automate the new test cases and add to CI
- [ ] Participate in the RCA / post-mortem meeting
- [ ] Update monitoring alerts based on lessons learned
- [ ] Review test strategy: does it need adjustment?

Post-Incident Testing

Regression Verification

After an incident fix is deployed, regression testing is critical. The fix might solve the immediate issue but introduce new problems.

Regression testing strategy post-incident:

Smoke test the fix itself (does the reported issue still happen?)
Test adjacent functionality (what could the fix have affected?)
Run the full automated regression suite
Perform targeted exploratory testing around the affected area
Monitor production metrics for 24-48 hours after the fix

Monitoring Setup

Every incident should result in better monitoring. Work with DevOps to ensure that:

The specific failure mode that caused this incident has an alert
Related failure modes have alerts
Dashboards show the relevant metrics (connection pool usage, query latency, error rates)
Alert thresholds are set low enough to catch the issue before it becomes customer-facing

Building Incident Communication Templates and Runbooks

Template Library

Create a library of pre-written templates that can be customized during an incident. Writing from scratch under pressure leads to poor communication.

Template	When to Use
Severity assessment	Immediately when an incident is reported
Internal status update	Every update during the incident
Status page update (investigating)	First public acknowledgment
Status page update (identified)	When root cause is found
Status page update (monitoring)	When fix is deployed
Status page update (resolved)	When incident is confirmed resolved
Customer email (apology)	After Sev-1 or Sev-2 incidents
Partner notification	When partners are affected
Post-incident summary (internal)	Within 24 hours of resolution

Incident Communication Runbook

# Incident Communication Runbook

## Step 1: Assess Severity
Use the severity matrix to classify the incident.
Severity determines communication cadence and audience.

## Step 2: Open Communication Channels
- Create a dedicated Slack channel: #inc-YYYYMMDD-brief-description
- Start a war room video call for Sev-1
- Designate an incident commander and a communication lead

## Step 3: First Status Update (within SLA)
- Post internal update using the status update template
- Update status page using the appropriate template
- Notify affected stakeholders via email (Sev-1 and Sev-2)

## Step 4: Ongoing Updates
- Follow the cadence for the severity level
- Each update includes: current status, impact, actions being taken, next update time
- Keep the Slack channel as the single source of truth

## Step 5: Resolution Communication
- Post final status update (internal and external)
- Update status page to "Resolved"
- Send resolution email to stakeholders
- Schedule post-mortem within 48 hours

## Step 6: Post-Incident
- Write incident report within 24 hours
- Conduct post-mortem within 1 week
- Complete RCA and track corrective actions
- Send customer communication if required

Hands-On Exercise

Write a status page update sequence (investigating, identified, monitoring, resolved) for a recent incident at your company
Create a severity matrix customized to your product and organization
Write an incident report for a recent production issue using the template above
Draft both an internal and customer-facing communication for the same incident. Notice the differences in detail, tone, and technical content.
Build an incident communication runbook for your team, including templates, channels, and escalation procedures.