QA Engineer Skills 2026QA-2026Metrics and Alerting

Metrics and Alerting

The Role of Metrics in Observability

Metrics are aggregated numerical measurements over time. Unlike logs (one entry per event) or traces (one per request), metrics are pre-aggregated, making them cheap to store, fast to query, and ideal for dashboards and alerting. They answer the question: "Is this system healthy right now?"


The Three Pillars Compared

Pillar What It Captures Best For Cardinality Storage Cost
Logs Discrete events with context Debugging specific incidents High (one per event) High
Metrics Aggregated numerical measurements Alerting, dashboards, trends Low (pre-aggregated) Low
Traces Request flow across services Understanding distributed behavior Medium (sampled) Medium

Metrics are the first line of defense: they tell you something is wrong. Traces tell you where. Logs tell you why.


Prometheus Metric Types

Prometheus is the de facto standard for metrics collection in cloud-native environments:

Type Description Example Use Case
Counter Monotonically increasing value http_requests_total Request count, error count
Gauge Value that can go up or down temperature_celsius Queue depth, active connections
Histogram Observations bucketed by value http_request_duration_seconds Latency distributions
Summary Pre-calculated percentiles request_duration_quantile Client-side percentiles

Choosing Between Histogram and Summary

  • Histogram: Use when you need server-side aggregation (multiple pods, dashboards). Prometheus can calculate percentiles across instances.
  • Summary: Use for client-side percentiles when you do not need cross-instance aggregation.

In most cases, choose Histogram. It is more flexible and works with Prometheus recording rules and alerts.


Instrumenting Application Metrics

# metrics_setup.py -- Application metrics with Prometheus client
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Request metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'path'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Business metrics
orders_created = Counter(
    'orders_created_total',
    'Total orders created',
    ['payment_method', 'status']
)

active_sessions = Gauge(
    'active_sessions',
    'Number of active user sessions'
)

# Start metrics endpoint
start_http_server(8000)  # /metrics endpoint on port 8000

The USE and RED Methods

Two frameworks for choosing what to measure:

USE Method (for infrastructure):

  • Utilization: How full is the resource? (CPU %, memory %, disk %)
  • Saturation: How much extra work is queued? (queue depth, thread pool usage)
  • Errors: How often does work fail? (I/O errors, connection failures)

RED Method (for services):

  • Rate: How many requests per second?
  • Errors: How many of those requests fail?
  • Duration: How long do those requests take?

Multi-Burn-Rate Alerts

The most important advance in alerting over the past decade is multi-burn-rate alerts. Instead of a single threshold ("error rate > 1%"), they detect both fast-burn (sudden outage) and slow-burn (gradual degradation) problems.

How Burn Rate Works

If your SLO allows 0.1% errors over 30 days, the sustainable error rate is 0.1%. A "burn rate" of 1x means you are consuming your error budget at exactly the sustainable rate.

Burn Rate What It Means Budget Duration
1x Sustainable rate Budget lasts 30 days
3x Slow burn Budget exhausted in 10 days
6x Moderate burn Budget exhausted in 5 days
14.4x Fast burn Budget exhausted in 2 days

Prometheus Alert Rules

# prometheus-alerting-rules.yaml
groups:
  - name: slo-alerts
    rules:
      # Fast burn: consuming error budget at 14.4x the sustainable rate
      # Will exhaust 30-day budget in 2 days if unchecked
      - alert: HighErrorRateFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[5m]))
            /
            sum(rate(http_requests_total{service="checkout"}[5m]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[1h]))
            /
            sum(rate(http_requests_total{service="checkout"}[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          team: checkout
        annotations:
          summary: "Checkout error rate burning budget fast (14.4x)"
          runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"
          dashboard: "https://grafana.internal/d/checkout-slo"

      # Slow burn: consuming at 3x the sustainable rate
      # Will exhaust 30-day budget in 10 days if unchecked
      - alert: HighErrorRateSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[30m]))
            /
            sum(rate(http_requests_total{service="checkout"}[30m]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5..", service="checkout"}[6h]))
            /
            sum(rate(http_requests_total{service="checkout"}[6h]))
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout error rate burning budget slowly (3x)"
          runbook: "https://wiki.internal/runbooks/checkout-elevated-errors"

The dual-window technique (short window AND long window) reduces false positives. A brief spike that resolves quickly will trigger the short window but not the long window, preventing unnecessary pages.


Alert Design Principles

  1. Alert on symptoms, not causes. Alert on "users are seeing errors" not "CPU is high." High CPU that causes no user impact is not an alert-worthy event.

  2. Use multi-window, multi-burn-rate alerts. Catch both sudden outages and gradual degradation.

  3. Every alert must have a runbook. If there is no documented response procedure, the alert is not ready for production.

  4. Page only for things that need immediate human action. Everything else should be a ticket, a dashboard, or a weekly report.


Alert Classification Matrix

Signal Page (wake someone up) Ticket (fix this week) Dashboard (awareness)
Error rate > 10x SLO burn rate Yes -- --
Error rate > 3x SLO burn rate -- Yes Yes
Latency p99 > 2x SLO Depends on duration Yes Yes
Single pod crash No No Yes
Deployment failed No (auto-rollback) Yes Yes
Certificate expiring in 7 days No Yes --
Disk 90% full -- Yes Yes
Disk 98% full Yes -- --

Alert Fatigue: The Enemy of Observability

Alert fatigue occurs when on-call engineers receive so many alerts that they begin ignoring them. This is the single biggest risk to an observability-driven testing program.

Diagnosing Alert Fatigue

Symptom Indicates
>10 pages per on-call shift Too many alerts, insufficient filtering
>50% of pages are "no action needed" Too many false positives
Average acknowledgment time > 10 minutes Engineers are ignoring alerts
Snooze rate > 30% Alerts are not actionable

Curing Alert Fatigue

  1. Audit every alert. For each alert, ask: "Did this require immediate human action?" If the answer is consistently "no," downgrade from page to ticket.
  2. Group correlated alerts. If a single incident triggers 15 alerts, create a meta-alert that groups them.
  3. Set maintenance windows. During deployments, suppress known transient alerts.
  4. Review monthly. Track alert-to-action ratio. Target >70% of pages resulting in meaningful action.

Metrics and alerting are the real-time nervous system of your production environment. Well-designed alerts catch problems before users notice; poorly designed alerts create noise that hides real problems.