Metrics and Alerting

The Role of Metrics in Observability

Metrics are aggregated numerical measurements over time. Unlike logs (one entry per event) or traces (one per request), metrics are pre-aggregated, making them cheap to store, fast to query, and ideal for dashboards and alerting. They answer the question: "Is this system healthy right now?"

The Three Pillars Compared

Pillar	What It Captures	Best For	Cardinality	Storage Cost
Logs	Discrete events with context	Debugging specific incidents	High (one per event)	High
Metrics	Aggregated numerical measurements	Alerting, dashboards, trends	Low (pre-aggregated)	Low
Traces	Request flow across services	Understanding distributed behavior	Medium (sampled)	Medium

Metrics are the first line of defense: they tell you something is wrong. Traces tell you where. Logs tell you why.

Prometheus Metric Types

Prometheus is the de facto standard for metrics collection in cloud-native environments:

Type	Description	Example	Use Case
Counter	Monotonically increasing value	`http_requests_total`	Request count, error count
Gauge	Value that can go up or down	`temperature_celsius`	Queue depth, active connections
Histogram	Observations bucketed by value	`http_request_duration_seconds`	Latency distributions
Summary	Pre-calculated percentiles	`request_duration_quantile`	Client-side percentiles

Choosing Between Histogram and Summary

Histogram: Use when you need server-side aggregation (multiple pods, dashboards). Prometheus can calculate percentiles across instances.
Summary: Use for client-side percentiles when you do not need cross-instance aggregation.

In most cases, choose Histogram. It is more flexible and works with Prometheus recording rules and alerts.

Instrumenting Application Metrics

# metrics_setup.py -- Application metrics with Prometheus client
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Request metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'path'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Business metrics
orders_created = Counter(
    'orders_created_total',
    'Total orders created',
    ['payment_method', 'status']
)

active_sessions = Gauge(
    'active_sessions',
    'Number of active user sessions'
)

# Start metrics endpoint
start_http_server(8000)  # /metrics endpoint on port 8000

The USE and RED Methods

Two frameworks for choosing what to measure:

USE Method (for infrastructure):

Utilization: How full is the resource? (CPU %, memory %, disk %)
Saturation: How much extra work is queued? (queue depth, thread pool usage)
Errors: How often does work fail? (I/O errors, connection failures)

RED Method (for services):

Rate: How many requests per second?
Errors: How many of those requests fail?
Duration: How long do those requests take?

Multi-Burn-Rate Alerts

The most important advance in alerting over the past decade is multi-burn-rate alerts. Instead of a single threshold ("error rate > 1%"), they detect both fast-burn (sudden outage) and slow-burn (gradual degradation) problems.

How Burn Rate Works

If your SLO allows 0.1% errors over 30 days, the sustainable error rate is 0.1%. A "burn rate" of 1x means you are consuming your error budget at exactly the sustainable rate.

Burn Rate	What It Means	Budget Duration
1x	Sustainable rate	Budget lasts 30 days
3x	Slow burn	Budget exhausted in 10 days
6x	Moderate burn	Budget exhausted in 5 days
14.4x	Fast burn	Budget exhausted in 2 days

Prometheus Alert Rules

# prometheus-alerting-rules.yaml
groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming error budget at 14.4x the sustainable rate
      # Will exhaust 30-day budget in 2 days if unchecked
      - alert: HighErrorRateFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[5m]))
            /
            sum(rate(http_requests_total{service="checkout"}[5m]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[1h]))
            /
            sum(rate(http_requests_total{service="checkout"}[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout error rate burning budget fast (14.4x)"
          runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"
          dashboard: "https://grafana.internal/d/checkout-slo"

      # Slow burn: consuming at 3x the sustainable rate
      # Will exhaust 30-day budget in 10 days if unchecked
      - alert: HighErrorRateSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[30m]))
            /
            sum(rate(http_requests_total{service="checkout"}[30m]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[6h]))
            /
            sum(rate(http_requests_total{service="checkout"}[6h]))
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout error rate burning budget slowly (3x)"
          runbook: "https://wiki.internal/runbooks/checkout-elevated-errors"

The dual-window technique (short window AND long window) reduces false positives. A brief spike that resolves quickly will trigger the short window but not the long window, preventing unnecessary pages.

Alert Design Principles

Alert on symptoms, not causes. Alert on "users are seeing errors" not "CPU is high." High CPU that causes no user impact is not an alert-worthy event.
Use multi-window, multi-burn-rate alerts. Catch both sudden outages and gradual degradation.
Every alert must have a runbook. If there is no documented response procedure, the alert is not ready for production.
Page only for things that need immediate human action. Everything else should be a ticket, a dashboard, or a weekly report.

Alert Classification Matrix

Signal	Page (wake someone up)	Ticket (fix this week)	Dashboard (awareness)
Error rate > 10x SLO burn rate	Yes	--	--
Error rate > 3x SLO burn rate	--	Yes	Yes
Latency p99 > 2x SLO	Depends on duration	Yes	Yes
Single pod crash	No	No	Yes
Deployment failed	No (auto-rollback)	Yes	Yes
Certificate expiring in 7 days	No	Yes	--
Disk 90% full	--	Yes	Yes
Disk 98% full	Yes	--	--

Alert Fatigue: The Enemy of Observability

Alert fatigue occurs when on-call engineers receive so many alerts that they begin ignoring them. This is the single biggest risk to an observability-driven testing program.

Diagnosing Alert Fatigue

Symptom	Indicates
>10 pages per on-call shift	Too many alerts, insufficient filtering
>50% of pages are "no action needed"	Too many false positives
Average acknowledgment time > 10 minutes	Engineers are ignoring alerts
Snooze rate > 30%	Alerts are not actionable

Curing Alert Fatigue

Audit every alert. For each alert, ask: "Did this require immediate human action?" If the answer is consistently "no," downgrade from page to ticket.
Group correlated alerts. If a single incident triggers 15 alerts, create a meta-alert that groups them.
Set maintenance windows. During deployments, suppress known transient alerts.
Review monthly. Track alert-to-action ratio. Target >70% of pages resulting in meaningful action.

Metrics and alerting are the real-time nervous system of your production environment. Well-designed alerts catch problems before users notice; poorly designed alerts create noise that hides real problems.