SLOs, SLIs, and Error Budgets

Why QA Architects Need SRE Fundamentals

Modern QA architects sit at the intersection of quality engineering and site reliability engineering. Understanding SLOs, error budgets, and their operational implications is not optional at the architect level. These concepts provide the quantitative framework for making quality decisions: when to ship, when to stop, and when to invest in reliability over features.

The SRE Vocabulary

Concept	Definition	Example
SLI (Service Level Indicator)	A quantitative measure of a service attribute	p99 latency of the checkout API
SLO (Service Level Objective)	A target value or range for an SLI	p99 latency < 500ms, 99.9% of the time
SLA (Service Level Agreement)	A business contract with financial consequences	99.9% uptime or credits issued
Error Budget	The allowed failure margin: 100% - SLO	0.1% = ~43 minutes of downtime/month

The Relationship Between Them

SLI = The measurement (what you observe)
SLO = The target (what you commit to internally)
SLA = The contract (what you promise customers, with penalties)
Error Budget = SLO's tolerance (how much failure is acceptable)

Always: SLA <= SLO (your internal target should be stricter than your customer promise)

Choosing Good SLIs

Not all metrics make good SLIs. The best SLIs are:

User-facing. Measure what users experience, not what servers report.
Measurable. You must be able to collect the data reliably.
Actionable. When the SLI degrades, the team can do something about it.

SLI Categories

Category	Good SLIs	Bad SLIs
Availability	Successful requests / total requests	Server uptime
Latency	p99 request duration	Average response time
Quality	Responses with correct content / total responses	Test pass rate
Freshness	Data updated within threshold / total queries	Cron job success rate
Throughput	Requests served at target rate / total minutes	CPU utilization

Why "server uptime" is a bad SLI: A server can be "up" (responding to health checks) while returning errors to every user request. Uptime measures infrastructure, not user experience.

Why "average response time" is a bad SLI: Averages hide outliers. If 99% of requests take 100ms and 1% take 30 seconds, the average is ~400ms -- which looks fine but masks a terrible experience for 1% of users.

Defining SLOs

An SLO pairs an SLI with a target and a time window:

# slo-definitions.yaml
service: checkout-api

slos:
  - name: availability
    sli: successful_requests / total_requests
    # "successful" = status code < 500 (client errors are not server failures)
    target: 99.95%
    window: 30d
    error_budget: 0.05%    # ~21.6 minutes of downtime per month

  - name: latency
    sli: requests_completed_under_500ms / total_requests
    target: 99.0%
    window: 30d
    error_budget: 1.0%
    # 1% of requests can exceed 500ms without breaching the SLO

  - name: correctness
    sli: orders_with_correct_total / total_orders
    target: 99.99%
    window: 30d
    error_budget: 0.01%

SLO Design Guidelines

Start with user expectations. If your users expect checkout to take under 2 seconds, your latency SLO should be stricter than that.
Use percentiles, not averages. p99 or p95 latency SLOs protect the tail of the distribution.
Use rolling windows. A 30-day rolling window is standard. Calendar months create end-of-month panic.
Do not aim for 100%. A 100% SLO means zero tolerance for any failure, which halts all development. Even Google targets 99.99%, not 100%.
Fewer is better. 3-5 SLOs per service is enough. Too many SLOs create confusion about priorities.

Error Budgets: The Key Insight

The error budget is the most powerful concept in SRE. It turns reliability into a measurable resource that can be spent:

SLO Target	Error Budget (30 days)	Equivalent Downtime
99%	1.0%	~7.3 hours
99.5%	0.5%	~3.6 hours
99.9%	0.1%	~43.8 minutes
99.95%	0.05%	~21.9 minutes
99.99%	0.01%	~4.4 minutes

Error Budget Policy

The error budget policy defines what happens as the budget is consumed:

# error-budget-policy.yaml
service: checkout-api
slos:
  - name: availability
    sli: successful_requests / total_requests
    target: 99.95%
    window: 30d
    error_budget: 0.05%

  - name: latency
    sli: requests_under_500ms / total_requests
    target: 99.0%
    window: 30d
    error_budget: 1.0%

policy:
  budget_remaining_above_50pct:
    - Deploy normally
    - Run chaos experiments
    - Ship new features
    - Experiment with new architectures

  budget_remaining_25_to_50pct:
    - Reduce deployment frequency
    - Pause non-critical chaos experiments
    - Prioritize reliability work in sprint planning
    - Review recent deployments for regressions

  budget_remaining_below_25pct:
    - Feature freeze for this service
    - All engineering effort on reliability
    - Incident review for every budget-consuming event
    - Escalate to engineering leadership

  budget_exhausted:
    - Full deployment freeze except hotfixes
    - Executive escalation
    - Postmortem required for next deployment
    - Consider rollback of recent changes

Why Error Budgets Change the Conversation

Without error budgets, the reliability discussion is adversarial:

Product team: "We need to ship this feature."
QA/SRE team: "It is not ready. More testing needed."
Result: Endless negotiation, no objective criteria.

With error budgets, the discussion is data-driven:

Product team: "We need to ship this feature."
QA/SRE team: "We have 60% of our error budget remaining. We can ship, but we need to monitor closely."
Result: Objective decision based on measurable risk.

Monitoring Error Budget Consumption

Prometheus/Grafana Setup

# prometheus-error-budget-recording-rules.yaml
groups:
  - name: error-budget
    interval: 1m
    rules:
      # SLI: availability (successful requests / total requests)
      - record: sli:checkout:availability:5m
        expr: |
          sum(rate(http_requests_total{service="checkout",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="checkout"}[5m]))

      # Error budget remaining (30-day window)
      - record: error_budget:checkout:availability:remaining
        expr: |
          1 - (
            (1 - sli:checkout:availability:30d) / (1 - 0.9995)
          )
        # Result: 1.0 = full budget, 0.0 = budget exhausted, <0 = over budget

      # Error budget burn rate
      - record: error_budget:checkout:availability:burn_rate:1h
        expr: |
          (1 - sli:checkout:availability:1h) / (1 - 0.9995)
        # Result: 1.0 = sustainable rate, >1.0 = burning faster than allowed

Practical Implementation Steps

Step 1: Measure Before You Set Targets

Before defining SLOs, measure your current performance for 30 days. Your initial SLO should be set at or slightly above your current baseline -- not at an aspirational target.

Step 2: Start with One Service

Pick your most critical service (usually the one that generates revenue) and define 2-3 SLOs. Prove the process works before expanding.

Step 3: Automate Budget Tracking

Manual error budget tracking does not scale. Use Prometheus recording rules or your monitoring platform's SLO feature (Datadog SLOs, Grafana SLO).

Step 4: Tie Budget to Actions

The error budget policy must have teeth. If the budget hits 25% and the policy says "feature freeze," leadership must enforce it. Otherwise, the entire system loses credibility.

Step 5: Review Quarterly

SLOs are not permanent. Review them quarterly:

Are they too tight? (Constant feature freezes = targets too ambitious)
Are they too loose? (Never consume budget = targets not meaningful)
Do they reflect user expectations? (User complaints with green SLOs = wrong metrics)

Common Mistakes

Mistake	Problem	Fix
SLO set at 100%	No room for any failure; all development stops	Use 99.9% or 99.95%
Too many SLOs	Teams cannot prioritize	3-5 per service maximum
SLO based on infrastructure metrics	Does not reflect user experience	Use request success rate, not CPU
No error budget policy	SLOs without consequences are just dashboards	Define actions at 50%, 25%, and 0% budget
Aspirational SLOs	Permanently exhausted budget demoralizes teams	Set SLOs based on measured baseline