Root Cause Analysis
Finding the Real Cause, Not Just the Obvious One
A root cause analysis (RCA) that stops at the first plausible explanation is not an analysis -- it is a guess. The purpose of RCA is not to find someone to blame or something to patch. It is to understand why the system allowed the failure to happen and to change the system so that similar failures cannot recur. The difference between a team that keeps having the same types of incidents and a team that genuinely improves is the quality of their root cause analysis.
RCA Frameworks
The 5 Whys
The simplest and most widely used RCA technique. Start with the problem and ask "why" repeatedly until you reach a root cause that is systemic, not symptomatic.
Example: Production outage due to database connection exhaustion
| Level | Question | Answer |
|---|---|---|
| Problem | Why did the application go down? | The database connection pool was exhausted |
| Why 1 | Why was the connection pool exhausted? | A slow query was holding connections for 30+ seconds |
| Why 2 | Why was the query slow? | The query was doing a full table scan on a 50M-row table |
| Why 3 | Why was there a full table scan? | The query was missing an index on the created_at column |
| Why 4 | Why was the index missing? | The migration that should have added the index failed silently |
| Why 5 | Why did the migration fail silently? | Our migration runner does not alert on failures; it logs to a file nobody monitors |
Root cause: Migration failures are not monitored or alerted. Contributing cause: No performance testing catches slow queries before production.
Corrective actions:
- Add alerting for failed migrations (prevents this class of issue)
- Add the missing index (fixes this specific issue)
- Add query performance testing to CI pipeline (catches slow queries earlier)
Common mistakes with 5 Whys:
| Mistake | Problem | Fix |
|---|---|---|
| Stopping too early | "The query was slow" is a symptom, not a root cause | Keep asking why until you reach a process or system failure |
| Only one chain | Complex incidents have multiple contributing causes | Ask "why" from multiple starting points |
| Landing on a person | "Because Alice did not add the index" blames a person, not a system | Ask why the system allowed that to happen |
| Too abstract | "Because our process is bad" is too vague to act on | Be specific: what process, what gap, what change |
Fishbone Diagram (Ishikawa)
The fishbone diagram organizes potential causes into categories, which is useful for complex incidents with multiple contributing factors.
┌── Process ──────┐
│ No migration │
│ monitoring │
│ │
┌── People ──┤ ┌── Tools ────┤
│ No query │ │ No slow │
│ review │ │ query │
│ process │ │ detection │
│ │ │ │
Problem: ─────┤ ├───┤ ├───→ Database
Database │ │ │ │ Outage
Outage │ │ │ │
│ │ │ │
└── Env ─────┤ └── Testing ──┤
Staging DB│ No load │
has 1K │ testing │
rows, not │ with │
50M │ production │
│ data │
│ volumes │
└─────────────────┘
Standard fishbone categories (the 6 Ms):
- Methods: Processes, procedures, policies
- Machines: Tools, infrastructure, environments
- Materials: Data, inputs, dependencies
- Measurements: Monitoring, alerting, metrics
- Manpower: Skills, training, staffing
- Mother Nature: External factors, third-party services
Fault Tree Analysis
Fault tree analysis works backward from the failure using Boolean logic (AND/OR gates) to identify all possible cause combinations.
Database Outage
|
AND
/ \
Slow Query No Connection
Exists Pool Recovery
| |
OR AND
/ \ / \
Missing Unoptimized No pool No timeout
index query plan monitoring configured
When to use fault tree analysis: Complex, safety-critical systems where you need to understand all possible failure paths. Common in aviation, medical devices, and nuclear systems. Less common in web applications, but valuable for critical infrastructure.
Writing RCA Reports
RCA Report Template
# Root Cause Analysis: [Incident Title]
Date: [Date of incident]
Author: [Name]
Severity: [Sev-1 / Sev-2 / Sev-3]
Status: [Draft / In Review / Final]
## Summary
One paragraph: what happened, when, how long it lasted, what was affected.
## Timeline
| Time (UTC) | Event |
|---|---|
| 14:00 | Monitoring alert: database connection pool at 90% |
| 14:05 | On-call engineer acknowledged alert |
| 14:12 | Database connection pool exhausted; application returning 503 |
| 14:15 | Incident declared; war room opened |
| 14:25 | Slow query identified via database monitoring |
| 14:30 | Query killed manually; connections began recovering |
| 14:35 | Application fully recovered |
| 14:40 | Root cause identified: missing index on orders.created_at |
| 14:45 | Index added to production database |
| 15:00 | Monitoring confirmed stable; incident resolved |
## Impact
- Duration: 23 minutes (14:12 - 14:35)
- Users affected: approximately 3,200 (all users attempting checkout)
- Revenue impact: estimated $8,500 in lost transactions
- Support tickets: 47
## Root Cause
The migration that should have added an index to the `orders.created_at`
column failed silently during the v2.3.0 deployment (2 weeks prior).
Without the index, a new report query introduced in v2.4.0 performed a
full table scan on 50 million rows, consuming database connections
for 30+ seconds each.
## Contributing Factors
1. **Migration monitoring gap:** Migration failures log to a file but
do not trigger alerts. The team was unaware of the failed migration.
2. **No query performance testing:** The CI pipeline does not test
query performance against production-scale data volumes.
3. **Staging data mismatch:** Staging has 1,000 rows in the orders
table; production has 50 million. The query performed well in staging.
4. **No connection pool circuit breaker:** When the pool fills, the
application queues requests indefinitely instead of failing fast.
## Corrective Actions
| Action | Owner | Priority | Due Date | Status |
|---|---|---|---|---|
| Add alerting for migration failures | DevOps | P1 | 2026-02-21 | In progress |
| Add query performance tests to CI | QA | P2 | 2026-03-07 | Not started |
| Configure connection pool circuit breaker | Backend | P2 | 2026-03-07 | Not started |
| Create staging data seeding script (production-scale) | QA + DevOps | P3 | 2026-03-21 | Not started |
| Add slow query monitoring dashboard | DevOps | P2 | 2026-02-28 | Not started |
## Lessons Learned
- Silent failures are the most dangerous kind. If something can fail,
the failure must be visible.
- Testing against unrealistic data volumes gives false confidence.
- Connection pool exhaustion cascades: one slow query can take down
the entire application. Defense in depth (circuit breakers,
timeouts, connection limits) is essential.
Blameless Post-Mortems
Language That Focuses on Systems, Not People
The single most important principle of blameless post-mortems is this: people do not cause incidents; systems allow incidents to happen. When someone makes a mistake, the question is not "why did they do that?" but "why did the system make it easy to make that mistake and hard to catch it?"
| Blaming Language | Blameless Language |
|---|---|
| "Alice forgot to add the index" | "The migration that should have added the index failed silently" |
| "Bob deployed without testing" | "The deployment process did not include a mandatory test verification step" |
| "The developer wrote a bad query" | "The query was not tested against production-scale data volumes" |
| "QA missed this bug" | "Our test coverage did not include this scenario" |
| "The on-call engineer was slow to respond" | "The alert routing did not reach the on-call engineer's phone" |
How to Facilitate a Blameless Post-Mortem
- Set the tone at the start. "This meeting is about learning and improving, not assigning blame. We assume everyone involved made the best decisions they could with the information they had."
- Focus on the timeline. Walk through what happened chronologically. Facts first, analysis second.
- Ask "what" and "how," not "who." "What made it possible for this to happen?" not "Who caused this?"
- Celebrate the response. Acknowledge what went well during the incident response, not just what went wrong.
- End with actions, not judgments. Every corrective action should change a system or process, not punish a person.
Tracking Corrective Actions to Completion
The most common failure in the RCA process is not the analysis -- it is the follow-through. Teams write thorough RCA reports with excellent corrective actions, and then those actions sit in a spreadsheet until the next incident.
Tracking System
- Create tickets for every corrective action in the same system where you track development work (Jira, Linear, GitHub Issues). If it is not in the backlog, it will not get done.
- Assign an owner and a due date. "The team will fix this" means nobody will fix this.
- Review progress weekly in standup or a dedicated 15-minute meeting.
- Close the RCA only when all actions are complete (or explicitly deprioritized with documented rationale).
Corrective Action Categories
| Category | Description | Example |
|---|---|---|
| Immediate fix | Fixes the specific issue that caused this incident | Add the missing database index |
| Detection improvement | Makes similar issues visible earlier | Add migration failure alerting |
| Prevention | Changes the system so this class of issue cannot occur | Add query performance tests to CI |
| Process change | Updates a procedure to prevent recurrence | Add mandatory slow query review for new queries |
| Documentation | Captures knowledge for future reference | Document connection pool limits and circuit breaker config |
RCA Templates with Real-World Examples
Example 1: Production Outage (E-commerce)
Incident: Checkout flow returned 500 errors for 45 minutes during Black Friday.
Root cause: A third-party payment API changed their rate limit from 1,000 to 500 requests per minute without notice. The application did not handle rate limiting gracefully.
Corrective actions:
- Add rate limit handling with exponential backoff
- Implement a request queue with overflow to a secondary payment provider
- Add rate limit monitoring and alerting
- Negotiate contractual rate limit guarantees with the payment provider
Example 2: Data Breach (SaaS Platform)
Incident: Customer data was exposed through an API endpoint that lacked authentication.
Root cause: A new API endpoint was added without the authentication middleware. The code review did not catch it because the reviewer was not aware of the authentication requirement for API routes.
Corrective actions:
- Add automated security scanning that flags unauthenticated endpoints
- Update the code review checklist to include authentication verification
- Add integration tests that verify all API endpoints require authentication
- Implement default-deny: all new endpoints require authentication unless explicitly marked as public
Example 3: Performance Degradation (Mobile App)
Incident: App startup time degraded from 2s to 8s over a period of 3 weeks.
Root cause: A new analytics SDK was added that performed synchronous network calls during app initialization. The performance regression was gradual (added across 3 PRs) and fell below the threshold of any single performance test.
Corrective actions:
- Add startup time performance budget to CI (fail if startup exceeds 3s)
- Require async initialization for all third-party SDKs
- Add performance trend monitoring that detects gradual degradation
- Audit all existing SDK initializations for synchronous calls
Common Anti-Patterns in RCA
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Stopping at the first "why" | The symptom gets fixed; the root cause remains | Keep asking until you reach a systemic cause |
| Blame-shifting | "The vendor's API was bad" avoids asking why you had no fallback | Ask why your system was fragile to the vendor's failure |
| Fixing symptoms, not causes | Adding a band-aid fix without addressing why it was needed | Fix both the immediate issue and the systemic cause |
| Analysis paralysis | Spending weeks on the RCA while corrective actions wait | Timebox the analysis; implement obvious fixes immediately |
| Copy-paste RCA | Using the same generic template without tailoring to the incident | Each RCA should be specific, detailed, and actionable |
| No follow-through | Writing a great RCA and never implementing the actions | Track actions in the sprint backlog with owners and dates |
| Hero worship | "Alice saved us by finding the fix at 3 AM" normalizes heroics | Ask why the system required heroics instead of handling it gracefully |
Making RCAs Actionable: From Analysis to Prevention
The measure of a good RCA is not the quality of the analysis. It is whether the same type of incident stops happening.
The Prevention Hierarchy
- Eliminate: Remove the possibility of the failure entirely (best but often impossible)
- Automate detection: Catch the issue before it reaches production (CI checks, automated scans)
- Limit blast radius: If the failure occurs, limit its impact (circuit breakers, feature flags, canary deployments)
- Speed recovery: Make it easy to detect and recover from the failure quickly (monitoring, runbooks, automated rollback)
- Document: At minimum, document the failure so future teams can recognize it (worst option, but better than nothing)
Always aim for the highest level possible. If you can only document it, your RCA has not gone far enough.
Hands-On Exercise
- Take a recent production incident and write a full RCA report using the template above
- Practice the 5 Whys technique on a recent bug. Did you reach a systemic root cause, or did you stop at a symptom?
- Review a past RCA from your team. Are the corrective actions implemented? If not, why not?
- Rewrite a past incident report to remove any blaming language using the blameless language examples
- Create a fishbone diagram for a complex incident, categorizing causes by the 6 Ms