Alert Design: Detecting Real Problems Without Alert Fatigue
The Alert Fatigue Problem
Alert fatigue -- the condition where on-call engineers ignore alerts because most of them are false positives -- is the enemy of observability-driven testing. Good alert design is as important as good test design. A test suite that produces too many false positives gets ignored; an alerting system that produces too many false positives gets the same treatment.
Core Alert Design Principles
1. Alert on Symptoms, Not Causes
Alert on what users experience, not what the infrastructure is doing:
| Symptom (Good Alert) | Cause (Bad Alert) |
|---|---|
| "Error rate > 1% for checkout API" | "CPU > 80% on checkout-pod-3" |
| "p99 latency > 2s for search" | "Memory usage > 90% on search-worker" |
| "Zero orders processed in 5 minutes" | "Database connection pool at 95%" |
High CPU that causes no user impact is not alert-worthy. Database connection pool at 95% might be perfectly normal under load. Focus on what the user sees.
2. Use Multi-Window, Multi-Burn-Rate Alerts
Instead of a single threshold, use a tiered approach that catches both fast-burn (sudden outage) and slow-burn (gradual degradation):
| Alert Type | Short Window | Long Window | For Duration | Severity |
|---|---|---|---|---|
| Fast burn | 5 min | 1 hour | 2 min | Critical (page) |
| Moderate | 15 min | 3 hours | 5 min | High (page) |
| Slow burn | 30 min | 6 hours | 15 min | Warning (ticket) |
The dual-window requirement reduces false positives: a brief spike triggers the short window but not the long window, so it does not fire.
3. Every Alert Must Have a Runbook
An alert without a runbook is a question without an answer. The runbook should include:
## Runbook: Checkout Error Rate > SLO
### What this alert means
The checkout service error rate has exceeded the SLO burn rate for the
specified window, indicating a potential reliability issue.
### Impact
Users may be unable to complete purchases. Revenue impact is proportional
to the duration and severity.
### Diagnosis steps
1. Check the error rate dashboard: [link]
2. Check recent deployments: `kubectl rollout history deployment/checkout`
3. Check dependency health: [payment-service dashboard link]
4. Check logs: `query: service=checkout level=error | last 15m`
### Common causes and fixes
- **Recent deployment**: Roll back with `kubectl rollout undo deployment/checkout`
- **Payment provider outage**: Enable fallback payment processor
- **Database connection exhaustion**: Scale up connection pool
- **Rate limiting from downstream**: Check rate limit headers
### Escalation
If not resolved within 30 minutes, escalate to:
- #checkout-team Slack channel
- Checkout team lead (page)
4. Page Only for Immediate Human Action
| Action Needed | Notification Type | Response Time |
|---|---|---|
| Immediate human intervention | Page (PagerDuty) | Minutes |
| Fix within a day | Ticket (Jira) | Hours |
| Awareness, no action | Dashboard / weekly report | Days |
If the answer to "what should the on-call do?" is "nothing, it will resolve itself," it should not be a page.
Designing Alerts for Common Scenarios
Web Application SLO Alerts
# Best practice: multi-burn-rate SLO alerts
groups:
- name: web-app-slo
rules:
- alert: WebAppHighErrorRate_FastBurn
expr: |
(error_ratio_5m > 14.4 * 0.001) and (error_ratio_1h > 14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate burning budget at 14.4x (2-day exhaustion)"
- alert: WebAppHighErrorRate_SlowBurn
expr: |
(error_ratio_30m > 3 * 0.001) and (error_ratio_6h > 3 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "Error rate burning budget at 3x (10-day exhaustion)"
Certificate Expiration
- alert: CertificateExpiringSoon
expr: |
(cert_expiry_timestamp_seconds - time()) / 86400 < 14
labels:
severity: warning # ticket, not page
annotations:
summary: "TLS certificate expires in {{ $value | humanizeDuration }}"
- alert: CertificateExpiringCritical
expr: |
(cert_expiry_timestamp_seconds - time()) / 86400 < 3
labels:
severity: critical # page -- 3 days is urgent
Deployment Failure
- alert: DeploymentRollbackDetected
expr: |
kube_deployment_status_observed_generation
!= kube_deployment_metadata_generation
for: 10m
labels:
severity: warning # auto-rollback handled it, but investigate
annotations:
summary: "Deployment {{ $labels.deployment }} appears to have rolled back"
Alert Testing
Alerts are code. They deserve testing like any other code.
Unit Testing Alert Rules
# test_alert_rules.py
"""
Test Prometheus alerting rules using promtool or programmatic evaluation.
"""
import subprocess
import yaml
def test_fast_burn_alert_fires_on_high_error_rate():
"""Verify the fast-burn alert fires when error rate exceeds 14.4x threshold."""
test_case = {
"interval": "1m",
"input_series": [
{
"series": 'http_requests_total{service="checkout",status="500"}',
"values": "0+10x60" # 10 errors per minute for 60 minutes
},
{
"series": 'http_requests_total{service="checkout",status="200"}',
"values": "0+100x60" # 100 successes per minute
}
],
"alert_rule_test": [
{
"eval_time": "10m",
"alertname": "HighErrorRateFastBurn",
"exp_alerts": [
{"exp_labels": {"severity": "critical", "team": "checkout"}}
]
}
]
}
# Write test file and run promtool
with open("/tmp/alert_test.yaml", "w") as f:
yaml.dump(test_case, f)
result = subprocess.run(
["promtool", "test", "rules", "/tmp/alert_test.yaml"],
capture_output=True, text=True
)
assert result.returncode == 0, f"Alert test failed: {result.stderr}"
Chaos-Based Alert Testing
The best way to test alerts is to trigger the conditions they detect:
- Run a chaos experiment (kill pods, inject latency)
- Verify the expected alert fires within the expected timeframe
- Verify the runbook link is correct and the runbook is up to date
- Verify the alert resolves when the chaos experiment ends
This is a natural extension of game day exercises -- include "did the right alert fire?" as a success criterion.
Alert Hygiene Practices
Weekly Alert Review
Every week, review the past week's alerts:
| Question | Action if "Yes" |
|---|---|
| Did any page not require action? | Downgrade to ticket or dashboard |
| Did any page go unacknowledged > 10 min? | Check routing and on-call assignment |
| Were there > 5 pages from the same alert? | Add deduplication or increase thresholds |
| Did any incident go undetected? | Add a new alert for the gap |
| Were any alerts flapping (firing/resolving repeatedly)? | Add hysteresis or increase for duration |
Alert SLOs
Yes, your alerting system itself should have SLOs:
| Metric | Target |
|---|---|
| False positive rate | < 20% of pages |
| Mean time to acknowledge | < 5 minutes |
| Alert-to-action ratio | > 70% |
| Pages per on-call shift | < 5 |
| Undetected incidents | 0 |
Good alert design is a continuous practice, not a one-time configuration. Treat your alerts with the same rigor you treat your test suite.