Root Cause Analysis

Finding the Real Cause, Not Just the Obvious One

A root cause analysis (RCA) that stops at the first plausible explanation is not an analysis -- it is a guess. The purpose of RCA is not to find someone to blame or something to patch. It is to understand why the system allowed the failure to happen and to change the system so that similar failures cannot recur. The difference between a team that keeps having the same types of incidents and a team that genuinely improves is the quality of their root cause analysis.

RCA Frameworks

The 5 Whys

The simplest and most widely used RCA technique. Start with the problem and ask "why" repeatedly until you reach a root cause that is systemic, not symptomatic.

Example: Production outage due to database connection exhaustion

Level	Question	Answer
Problem	Why did the application go down?	The database connection pool was exhausted
Why 1	Why was the connection pool exhausted?	A slow query was holding connections for 30+ seconds
Why 2	Why was the query slow?	The query was doing a full table scan on a 50M-row table
Why 3	Why was there a full table scan?	The query was missing an index on the `created_at` column
Why 4	Why was the index missing?	The migration that should have added the index failed silently
Why 5	Why did the migration fail silently?	Our migration runner does not alert on failures; it logs to a file nobody monitors

Root cause: Migration failures are not monitored or alerted. Contributing cause: No performance testing catches slow queries before production.

Corrective actions:

Add alerting for failed migrations (prevents this class of issue)
Add the missing index (fixes this specific issue)
Add query performance testing to CI pipeline (catches slow queries earlier)

Common mistakes with 5 Whys:

Mistake	Problem	Fix
Stopping too early	"The query was slow" is a symptom, not a root cause	Keep asking why until you reach a process or system failure
Only one chain	Complex incidents have multiple contributing causes	Ask "why" from multiple starting points
Landing on a person	"Because Alice did not add the index" blames a person, not a system	Ask why the system allowed that to happen
Too abstract	"Because our process is bad" is too vague to act on	Be specific: what process, what gap, what change

Fishbone Diagram (Ishikawa)

The fishbone diagram organizes potential causes into categories, which is useful for complex incidents with multiple contributing factors.

                           ┌── Process ──────┐
                           │  No migration    │
                           │  monitoring      │
                           │                  │
              ┌── People ──┤   ┌── Tools ────┤
              │  No query  │   │  No slow     │
              │  review    │   │  query       │
              │  process   │   │  detection   │
              │            │   │              │
Problem: ─────┤            ├───┤              ├───→ Database
Database      │            │   │              │     Outage
Outage        │            │   │              │
              │            │   │              │
              └── Env ─────┤   └── Testing ──┤
                 Staging DB│      No load     │
                 has 1K    │      testing     │
                 rows, not │      with        │
                 50M       │      production  │
                           │      data        │
                           │      volumes     │
                           └─────────────────┘

Standard fishbone categories (the 6 Ms):

Methods: Processes, procedures, policies
Machines: Tools, infrastructure, environments
Materials: Data, inputs, dependencies
Measurements: Monitoring, alerting, metrics
Manpower: Skills, training, staffing
Mother Nature: External factors, third-party services

Fault Tree Analysis

Fault tree analysis works backward from the failure using Boolean logic (AND/OR gates) to identify all possible cause combinations.

                    Database Outage
                          |
                         AND
                    /           \
          Slow Query          No Connection
          Exists              Pool Recovery
            |                      |
           OR                     AND
         /    \              /          \
   Missing   Unoptimized   No pool     No timeout
   index     query plan    monitoring  configured

When to use fault tree analysis: Complex, safety-critical systems where you need to understand all possible failure paths. Common in aviation, medical devices, and nuclear systems. Less common in web applications, but valuable for critical infrastructure.

Writing RCA Reports

RCA Report Template

# Root Cause Analysis: [Incident Title]
Date: [Date of incident]
Author: [Name]
Severity: [Sev-1 / Sev-2 / Sev-3]
Status: [Draft / In Review / Final]

## Summary
One paragraph: what happened, when, how long it lasted, what was affected.

## Timeline

| Time (UTC) | Event |
|---|---|
| 14:00 | Monitoring alert: database connection pool at 90% |
| 14:05 | On-call engineer acknowledged alert |
| 14:12 | Database connection pool exhausted; application returning 503 |
| 14:15 | Incident declared; war room opened |
| 14:25 | Slow query identified via database monitoring |
| 14:30 | Query killed manually; connections began recovering |
| 14:35 | Application fully recovered |
| 14:40 | Root cause identified: missing index on orders.created_at |
| 14:45 | Index added to production database |
| 15:00 | Monitoring confirmed stable; incident resolved |

## Impact
- Duration: 23 minutes (14:12 - 14:35)
- Users affected: approximately 3,200 (all users attempting checkout)
- Revenue impact: estimated $8,500 in lost transactions
- Support tickets: 47

## Root Cause
The migration that should have added an index to the `orders.created_at`
column failed silently during the v2.3.0 deployment (2 weeks prior).
Without the index, a new report query introduced in v2.4.0 performed a
full table scan on 50 million rows, consuming database connections
for 30+ seconds each.

## Contributing Factors
1. **Migration monitoring gap:** Migration failures log to a file but
   do not trigger alerts. The team was unaware of the failed migration.
2. **No query performance testing:** The CI pipeline does not test
   query performance against production-scale data volumes.
3. **Staging data mismatch:** Staging has 1,000 rows in the orders
   table; production has 50 million. The query performed well in staging.
4. **No connection pool circuit breaker:** When the pool fills, the
   application queues requests indefinitely instead of failing fast.

## Corrective Actions

| Action | Owner | Priority | Due Date | Status |
|---|---|---|---|---|
| Add alerting for migration failures | DevOps | P1 | 2026-02-21 | In progress |
| Add query performance tests to CI | QA | P2 | 2026-03-07 | Not started |
| Configure connection pool circuit breaker | Backend | P2 | 2026-03-07 | Not started |
| Create staging data seeding script (production-scale) | QA + DevOps | P3 | 2026-03-21 | Not started |
| Add slow query monitoring dashboard | DevOps | P2 | 2026-02-28 | Not started |

## Lessons Learned
- Silent failures are the most dangerous kind. If something can fail,
  the failure must be visible.
- Testing against unrealistic data volumes gives false confidence.
- Connection pool exhaustion cascades: one slow query can take down
  the entire application. Defense in depth (circuit breakers,
  timeouts, connection limits) is essential.

Blameless Post-Mortems

Language That Focuses on Systems, Not People

The single most important principle of blameless post-mortems is this: people do not cause incidents; systems allow incidents to happen. When someone makes a mistake, the question is not "why did they do that?" but "why did the system make it easy to make that mistake and hard to catch it?"

Blaming Language	Blameless Language
"Alice forgot to add the index"	"The migration that should have added the index failed silently"
"Bob deployed without testing"	"The deployment process did not include a mandatory test verification step"
"The developer wrote a bad query"	"The query was not tested against production-scale data volumes"
"QA missed this bug"	"Our test coverage did not include this scenario"
"The on-call engineer was slow to respond"	"The alert routing did not reach the on-call engineer's phone"

How to Facilitate a Blameless Post-Mortem

Set the tone at the start. "This meeting is about learning and improving, not assigning blame. We assume everyone involved made the best decisions they could with the information they had."
Focus on the timeline. Walk through what happened chronologically. Facts first, analysis second.
Ask "what" and "how," not "who." "What made it possible for this to happen?" not "Who caused this?"
Celebrate the response. Acknowledge what went well during the incident response, not just what went wrong.
End with actions, not judgments. Every corrective action should change a system or process, not punish a person.

Tracking Corrective Actions to Completion

The most common failure in the RCA process is not the analysis -- it is the follow-through. Teams write thorough RCA reports with excellent corrective actions, and then those actions sit in a spreadsheet until the next incident.

Tracking System

Create tickets for every corrective action in the same system where you track development work (Jira, Linear, GitHub Issues). If it is not in the backlog, it will not get done.
Assign an owner and a due date. "The team will fix this" means nobody will fix this.
Review progress weekly in standup or a dedicated 15-minute meeting.
Close the RCA only when all actions are complete (or explicitly deprioritized with documented rationale).

Corrective Action Categories

Category	Description	Example
Immediate fix	Fixes the specific issue that caused this incident	Add the missing database index
Detection improvement	Makes similar issues visible earlier	Add migration failure alerting
Prevention	Changes the system so this class of issue cannot occur	Add query performance tests to CI
Process change	Updates a procedure to prevent recurrence	Add mandatory slow query review for new queries
Documentation	Captures knowledge for future reference	Document connection pool limits and circuit breaker config

RCA Templates with Real-World Examples

Example 1: Production Outage (E-commerce)

Incident: Checkout flow returned 500 errors for 45 minutes during Black Friday.

Root cause: A third-party payment API changed their rate limit from 1,000 to 500 requests per minute without notice. The application did not handle rate limiting gracefully.

Corrective actions:

Add rate limit handling with exponential backoff
Implement a request queue with overflow to a secondary payment provider
Add rate limit monitoring and alerting
Negotiate contractual rate limit guarantees with the payment provider

Example 2: Data Breach (SaaS Platform)

Incident: Customer data was exposed through an API endpoint that lacked authentication.

Root cause: A new API endpoint was added without the authentication middleware. The code review did not catch it because the reviewer was not aware of the authentication requirement for API routes.

Corrective actions:

Add automated security scanning that flags unauthenticated endpoints
Update the code review checklist to include authentication verification
Add integration tests that verify all API endpoints require authentication
Implement default-deny: all new endpoints require authentication unless explicitly marked as public

Example 3: Performance Degradation (Mobile App)

Incident: App startup time degraded from 2s to 8s over a period of 3 weeks.

Root cause: A new analytics SDK was added that performed synchronous network calls during app initialization. The performance regression was gradual (added across 3 PRs) and fell below the threshold of any single performance test.

Corrective actions:

Add startup time performance budget to CI (fail if startup exceeds 3s)
Require async initialization for all third-party SDKs
Add performance trend monitoring that detects gradual degradation
Audit all existing SDK initializations for synchronous calls

Common Anti-Patterns in RCA

Anti-Pattern	Problem	Better Approach
Stopping at the first "why"	The symptom gets fixed; the root cause remains	Keep asking until you reach a systemic cause
Blame-shifting	"The vendor's API was bad" avoids asking why you had no fallback	Ask why your system was fragile to the vendor's failure
Fixing symptoms, not causes	Adding a band-aid fix without addressing why it was needed	Fix both the immediate issue and the systemic cause
Analysis paralysis	Spending weeks on the RCA while corrective actions wait	Timebox the analysis; implement obvious fixes immediately
Copy-paste RCA	Using the same generic template without tailoring to the incident	Each RCA should be specific, detailed, and actionable
No follow-through	Writing a great RCA and never implementing the actions	Track actions in the sprint backlog with owners and dates
Hero worship	"Alice saved us by finding the fix at 3 AM" normalizes heroics	Ask why the system required heroics instead of handling it gracefully

Making RCAs Actionable: From Analysis to Prevention

The measure of a good RCA is not the quality of the analysis. It is whether the same type of incident stops happening.

The Prevention Hierarchy

Eliminate: Remove the possibility of the failure entirely (best but often impossible)
Automate detection: Catch the issue before it reaches production (CI checks, automated scans)
Limit blast radius: If the failure occurs, limit its impact (circuit breakers, feature flags, canary deployments)
Speed recovery: Make it easy to detect and recover from the failure quickly (monitoring, runbooks, automated rollback)
Document: At minimum, document the failure so future teams can recognize it (worst option, but better than nothing)

Always aim for the highest level possible. If you can only document it, your RCA has not gone far enough.

Hands-On Exercise

Take a recent production incident and write a full RCA report using the template above
Practice the 5 Whys technique on a recent bug. Did you reach a systemic root cause, or did you stop at a symptom?
Review a past RCA from your team. Are the corrective actions implemented? If not, why not?
Rewrite a past incident report to remove any blaming language using the blameless language examples
Create a fishbone diagram for a complex incident, categorizing causes by the 6 Ms