Metrics and Alerting
The Role of Metrics in Observability
Metrics are aggregated numerical measurements over time. Unlike logs (one entry per event) or traces (one per request), metrics are pre-aggregated, making them cheap to store, fast to query, and ideal for dashboards and alerting. They answer the question: "Is this system healthy right now?"
The Three Pillars Compared
| Pillar | What It Captures | Best For | Cardinality | Storage Cost |
|---|---|---|---|---|
| Logs | Discrete events with context | Debugging specific incidents | High (one per event) | High |
| Metrics | Aggregated numerical measurements | Alerting, dashboards, trends | Low (pre-aggregated) | Low |
| Traces | Request flow across services | Understanding distributed behavior | Medium (sampled) | Medium |
Metrics are the first line of defense: they tell you something is wrong. Traces tell you where. Logs tell you why.
Prometheus Metric Types
Prometheus is the de facto standard for metrics collection in cloud-native environments:
| Type | Description | Example | Use Case |
|---|---|---|---|
| Counter | Monotonically increasing value | http_requests_total |
Request count, error count |
| Gauge | Value that can go up or down | temperature_celsius |
Queue depth, active connections |
| Histogram | Observations bucketed by value | http_request_duration_seconds |
Latency distributions |
| Summary | Pre-calculated percentiles | request_duration_quantile |
Client-side percentiles |
Choosing Between Histogram and Summary
- Histogram: Use when you need server-side aggregation (multiple pods, dashboards). Prometheus can calculate percentiles across instances.
- Summary: Use for client-side percentiles when you do not need cross-instance aggregation.
In most cases, choose Histogram. It is more flexible and works with Prometheus recording rules and alerts.
Instrumenting Application Metrics
# metrics_setup.py -- Application metrics with Prometheus client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Request metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'path'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Business metrics
orders_created = Counter(
'orders_created_total',
'Total orders created',
['payment_method', 'status']
)
active_sessions = Gauge(
'active_sessions',
'Number of active user sessions'
)
# Start metrics endpoint
start_http_server(8000) # /metrics endpoint on port 8000
The USE and RED Methods
Two frameworks for choosing what to measure:
USE Method (for infrastructure):
- Utilization: How full is the resource? (CPU %, memory %, disk %)
- Saturation: How much extra work is queued? (queue depth, thread pool usage)
- Errors: How often does work fail? (I/O errors, connection failures)
RED Method (for services):
- Rate: How many requests per second?
- Errors: How many of those requests fail?
- Duration: How long do those requests take?
Multi-Burn-Rate Alerts
The most important advance in alerting over the past decade is multi-burn-rate alerts. Instead of a single threshold ("error rate > 1%"), they detect both fast-burn (sudden outage) and slow-burn (gradual degradation) problems.
How Burn Rate Works
If your SLO allows 0.1% errors over 30 days, the sustainable error rate is 0.1%. A "burn rate" of 1x means you are consuming your error budget at exactly the sustainable rate.
| Burn Rate | What It Means | Budget Duration |
|---|---|---|
| 1x | Sustainable rate | Budget lasts 30 days |
| 3x | Slow burn | Budget exhausted in 10 days |
| 6x | Moderate burn | Budget exhausted in 5 days |
| 14.4x | Fast burn | Budget exhausted in 2 days |
Prometheus Alert Rules
# prometheus-alerting-rules.yaml
groups:
- name: slo-alerts
rules:
# Fast burn: consuming error budget at 14.4x the sustainable rate
# Will exhaust 30-day budget in 2 days if unchecked
- alert: HighErrorRateFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5..", service="checkout"}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5..", service="checkout"}[1h]))
/
sum(rate(http_requests_total{service="checkout"}[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
team: checkout
annotations:
summary: "Checkout error rate burning budget fast (14.4x)"
runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"
dashboard: "https://grafana.internal/d/checkout-slo"
# Slow burn: consuming at 3x the sustainable rate
# Will exhaust 30-day budget in 10 days if unchecked
- alert: HighErrorRateSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5..", service="checkout"}[30m]))
/
sum(rate(http_requests_total{service="checkout"}[30m]))
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5..", service="checkout"}[6h]))
/
sum(rate(http_requests_total{service="checkout"}[6h]))
) > (3 * 0.001)
for: 15m
labels:
severity: warning
team: checkout
annotations:
summary: "Checkout error rate burning budget slowly (3x)"
runbook: "https://wiki.internal/runbooks/checkout-elevated-errors"
The dual-window technique (short window AND long window) reduces false positives. A brief spike that resolves quickly will trigger the short window but not the long window, preventing unnecessary pages.
Alert Design Principles
Alert on symptoms, not causes. Alert on "users are seeing errors" not "CPU is high." High CPU that causes no user impact is not an alert-worthy event.
Use multi-window, multi-burn-rate alerts. Catch both sudden outages and gradual degradation.
Every alert must have a runbook. If there is no documented response procedure, the alert is not ready for production.
Page only for things that need immediate human action. Everything else should be a ticket, a dashboard, or a weekly report.
Alert Classification Matrix
| Signal | Page (wake someone up) | Ticket (fix this week) | Dashboard (awareness) |
|---|---|---|---|
| Error rate > 10x SLO burn rate | Yes | -- | -- |
| Error rate > 3x SLO burn rate | -- | Yes | Yes |
| Latency p99 > 2x SLO | Depends on duration | Yes | Yes |
| Single pod crash | No | No | Yes |
| Deployment failed | No (auto-rollback) | Yes | Yes |
| Certificate expiring in 7 days | No | Yes | -- |
| Disk 90% full | -- | Yes | Yes |
| Disk 98% full | Yes | -- | -- |
Alert Fatigue: The Enemy of Observability
Alert fatigue occurs when on-call engineers receive so many alerts that they begin ignoring them. This is the single biggest risk to an observability-driven testing program.
Diagnosing Alert Fatigue
| Symptom | Indicates |
|---|---|
| >10 pages per on-call shift | Too many alerts, insufficient filtering |
| >50% of pages are "no action needed" | Too many false positives |
| Average acknowledgment time > 10 minutes | Engineers are ignoring alerts |
| Snooze rate > 30% | Alerts are not actionable |
Curing Alert Fatigue
- Audit every alert. For each alert, ask: "Did this require immediate human action?" If the answer is consistently "no," downgrade from page to ticket.
- Group correlated alerts. If a single incident triggers 15 alerts, create a meta-alert that groups them.
- Set maintenance windows. During deployments, suppress known transient alerts.
- Review monthly. Track alert-to-action ratio. Target >70% of pages resulting in meaningful action.
Metrics and alerting are the real-time nervous system of your production environment. Well-designed alerts catch problems before users notice; poorly designed alerts create noise that hides real problems.