Alert Design: Detecting Real Problems Without Alert Fatigue

The Alert Fatigue Problem

Alert fatigue -- the condition where on-call engineers ignore alerts because most of them are false positives -- is the enemy of observability-driven testing. Good alert design is as important as good test design. A test suite that produces too many false positives gets ignored; an alerting system that produces too many false positives gets the same treatment.

Core Alert Design Principles

1. Alert on Symptoms, Not Causes

Alert on what users experience, not what the infrastructure is doing:

Symptom (Good Alert)	Cause (Bad Alert)
"Error rate > 1% for checkout API"	"CPU > 80% on checkout-pod-3"
"p99 latency > 2s for search"	"Memory usage > 90% on search-worker"
"Zero orders processed in 5 minutes"	"Database connection pool at 95%"

High CPU that causes no user impact is not alert-worthy. Database connection pool at 95% might be perfectly normal under load. Focus on what the user sees.

2. Use Multi-Window, Multi-Burn-Rate Alerts

Instead of a single threshold, use a tiered approach that catches both fast-burn (sudden outage) and slow-burn (gradual degradation):

Alert Type	Short Window	Long Window	For Duration	Severity
Fast burn	5 min	1 hour	2 min	Critical (page)
Moderate	15 min	3 hours	5 min	High (page)
Slow burn	30 min	6 hours	15 min	Warning (ticket)

The dual-window requirement reduces false positives: a brief spike triggers the short window but not the long window, so it does not fire.

3. Every Alert Must Have a Runbook

An alert without a runbook is a question without an answer. The runbook should include:

## Runbook: Checkout Error Rate > SLO

### What this alert means
The checkout service error rate has exceeded the SLO burn rate for the
specified window, indicating a potential reliability issue.

### Impact
Users may be unable to complete purchases. Revenue impact is proportional
to the duration and severity.

### Diagnosis steps
1. Check the error rate dashboard: [link]
2. Check recent deployments: `kubectl rollout history deployment/checkout`
3. Check dependency health: [payment-service dashboard link]
4. Check logs: `query: service=checkout level=error | last 15m`

### Common causes and fixes
- **Recent deployment**: Roll back with `kubectl rollout undo deployment/checkout`
- **Payment provider outage**: Enable fallback payment processor
- **Database connection exhaustion**: Scale up connection pool
- **Rate limiting from downstream**: Check rate limit headers

### Escalation
If not resolved within 30 minutes, escalate to:
- #checkout-team Slack channel
- Checkout team lead (page)

4. Page Only for Immediate Human Action

Action Needed	Notification Type	Response Time
Immediate human intervention	Page (PagerDuty)	Minutes
Fix within a day	Ticket (Jira)	Hours
Awareness, no action	Dashboard / weekly report	Days

If the answer to "what should the on-call do?" is "nothing, it will resolve itself," it should not be a page.

Designing Alerts for Common Scenarios

Web Application SLO Alerts

# Best practice: multi-burn-rate SLO alerts
groups:
  - name: web-app-slo
    rules:
      - alert: WebAppHighErrorRate_FastBurn
        expr: |
          (error_ratio_5m > 14.4 * 0.001) and (error_ratio_1h > 14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate burning budget at 14.4x (2-day exhaustion)"

      - alert: WebAppHighErrorRate_SlowBurn
        expr: |
          (error_ratio_30m > 3 * 0.001) and (error_ratio_6h > 3 * 0.001)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Error rate burning budget at 3x (10-day exhaustion)"

Certificate Expiration

      - alert: CertificateExpiringSoon
        expr: |
          (cert_expiry_timestamp_seconds - time()) / 86400 < 14
        labels:
          severity: warning   # ticket, not page
        annotations:
          summary: "TLS certificate expires in {{ $value | humanizeDuration }}"

      - alert: CertificateExpiringCritical
        expr: |
          (cert_expiry_timestamp_seconds - time()) / 86400 < 3
        labels:
          severity: critical  # page -- 3 days is urgent

Deployment Failure

      - alert: DeploymentRollbackDetected
        expr: |
          kube_deployment_status_observed_generation
          != kube_deployment_metadata_generation
        for: 10m
        labels:
          severity: warning   # auto-rollback handled it, but investigate
        annotations:
          summary: "Deployment {{ $labels.deployment }} appears to have rolled back"

Alert Testing

Alerts are code. They deserve testing like any other code.

Unit Testing Alert Rules

# test_alert_rules.py
"""
Test Prometheus alerting rules using promtool or programmatic evaluation.
"""
import subprocess
import yaml

def test_fast_burn_alert_fires_on_high_error_rate():
    """Verify the fast-burn alert fires when error rate exceeds 14.4x threshold."""
    test_case = {
        "interval": "1m",
        "input_series": [
            {
                "series": 'http_requests_total{service="checkout",status="500"}',
                "values": "0+10x60"  # 10 errors per minute for 60 minutes
            },
            {
                "series": 'http_requests_total{service="checkout",status="200"}',
                "values": "0+100x60"  # 100 successes per minute
            }
        ],
        "alert_rule_test": [
            {
                "eval_time": "10m",
                "alertname": "HighErrorRateFastBurn",
                "exp_alerts": [
                    {"exp_labels": {"severity": "critical", "team": "checkout"}}
                ]
            }
        ]
    }

    # Write test file and run promtool
    with open("/tmp/alert_test.yaml", "w") as f:
        yaml.dump(test_case, f)

    result = subprocess.run(
        ["promtool", "test", "rules", "/tmp/alert_test.yaml"],
        capture_output=True, text=True
    )
    assert result.returncode == 0, f"Alert test failed: {result.stderr}"

Chaos-Based Alert Testing

The best way to test alerts is to trigger the conditions they detect:

Run a chaos experiment (kill pods, inject latency)
Verify the expected alert fires within the expected timeframe
Verify the runbook link is correct and the runbook is up to date
Verify the alert resolves when the chaos experiment ends

This is a natural extension of game day exercises -- include "did the right alert fire?" as a success criterion.

Alert Hygiene Practices

Weekly Alert Review

Every week, review the past week's alerts:

Question	Action if "Yes"
Did any page not require action?	Downgrade to ticket or dashboard
Did any page go unacknowledged > 10 min?	Check routing and on-call assignment
Were there > 5 pages from the same alert?	Add deduplication or increase thresholds
Did any incident go undetected?	Add a new alert for the gap
Were any alerts flapping (firing/resolving repeatedly)?	Add hysteresis or increase `for` duration

Alert SLOs

Yes, your alerting system itself should have SLOs:

Metric	Target
False positive rate	< 20% of pages
Mean time to acknowledge	< 5 minutes
Alert-to-action ratio	> 70%
Pages per on-call shift	< 5
Undetected incidents	0

Good alert design is a continuous practice, not a one-time configuration. Treat your alerts with the same rigor you treat your test suite.