SLOs, SLIs, and Error Budgets
Why QA Architects Need SRE Fundamentals
Modern QA architects sit at the intersection of quality engineering and site reliability engineering. Understanding SLOs, error budgets, and their operational implications is not optional at the architect level. These concepts provide the quantitative framework for making quality decisions: when to ship, when to stop, and when to invest in reliability over features.
The SRE Vocabulary
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of a service attribute | p99 latency of the checkout API |
| SLO (Service Level Objective) | A target value or range for an SLI | p99 latency < 500ms, 99.9% of the time |
| SLA (Service Level Agreement) | A business contract with financial consequences | 99.9% uptime or credits issued |
| Error Budget | The allowed failure margin: 100% - SLO | 0.1% = ~43 minutes of downtime/month |
The Relationship Between Them
SLI = The measurement (what you observe)
SLO = The target (what you commit to internally)
SLA = The contract (what you promise customers, with penalties)
Error Budget = SLO's tolerance (how much failure is acceptable)
Always: SLA <= SLO (your internal target should be stricter than your customer promise)
Choosing Good SLIs
Not all metrics make good SLIs. The best SLIs are:
- User-facing. Measure what users experience, not what servers report.
- Measurable. You must be able to collect the data reliably.
- Actionable. When the SLI degrades, the team can do something about it.
SLI Categories
| Category | Good SLIs | Bad SLIs |
|---|---|---|
| Availability | Successful requests / total requests | Server uptime |
| Latency | p99 request duration | Average response time |
| Quality | Responses with correct content / total responses | Test pass rate |
| Freshness | Data updated within threshold / total queries | Cron job success rate |
| Throughput | Requests served at target rate / total minutes | CPU utilization |
Why "server uptime" is a bad SLI: A server can be "up" (responding to health checks) while returning errors to every user request. Uptime measures infrastructure, not user experience.
Why "average response time" is a bad SLI: Averages hide outliers. If 99% of requests take 100ms and 1% take 30 seconds, the average is ~400ms -- which looks fine but masks a terrible experience for 1% of users.
Defining SLOs
An SLO pairs an SLI with a target and a time window:
# slo-definitions.yaml
service: checkout-api
slos:
- name: availability
sli: successful_requests / total_requests
# "successful" = status code < 500 (client errors are not server failures)
target: 99.95%
window: 30d
error_budget: 0.05% # ~21.6 minutes of downtime per month
- name: latency
sli: requests_completed_under_500ms / total_requests
target: 99.0%
window: 30d
error_budget: 1.0%
# 1% of requests can exceed 500ms without breaching the SLO
- name: correctness
sli: orders_with_correct_total / total_orders
target: 99.99%
window: 30d
error_budget: 0.01%
SLO Design Guidelines
- Start with user expectations. If your users expect checkout to take under 2 seconds, your latency SLO should be stricter than that.
- Use percentiles, not averages. p99 or p95 latency SLOs protect the tail of the distribution.
- Use rolling windows. A 30-day rolling window is standard. Calendar months create end-of-month panic.
- Do not aim for 100%. A 100% SLO means zero tolerance for any failure, which halts all development. Even Google targets 99.99%, not 100%.
- Fewer is better. 3-5 SLOs per service is enough. Too many SLOs create confusion about priorities.
Error Budgets: The Key Insight
The error budget is the most powerful concept in SRE. It turns reliability into a measurable resource that can be spent:
| SLO Target | Error Budget (30 days) | Equivalent Downtime |
|---|---|---|
| 99% | 1.0% | ~7.3 hours |
| 99.5% | 0.5% | ~3.6 hours |
| 99.9% | 0.1% | ~43.8 minutes |
| 99.95% | 0.05% | ~21.9 minutes |
| 99.99% | 0.01% | ~4.4 minutes |
Error Budget Policy
The error budget policy defines what happens as the budget is consumed:
# error-budget-policy.yaml
service: checkout-api
slos:
- name: availability
sli: successful_requests / total_requests
target: 99.95%
window: 30d
error_budget: 0.05%
- name: latency
sli: requests_under_500ms / total_requests
target: 99.0%
window: 30d
error_budget: 1.0%
policy:
budget_remaining_above_50pct:
- Deploy normally
- Run chaos experiments
- Ship new features
- Experiment with new architectures
budget_remaining_25_to_50pct:
- Reduce deployment frequency
- Pause non-critical chaos experiments
- Prioritize reliability work in sprint planning
- Review recent deployments for regressions
budget_remaining_below_25pct:
- Feature freeze for this service
- All engineering effort on reliability
- Incident review for every budget-consuming event
- Escalate to engineering leadership
budget_exhausted:
- Full deployment freeze except hotfixes
- Executive escalation
- Postmortem required for next deployment
- Consider rollback of recent changes
Why Error Budgets Change the Conversation
Without error budgets, the reliability discussion is adversarial:
- Product team: "We need to ship this feature."
- QA/SRE team: "It is not ready. More testing needed."
- Result: Endless negotiation, no objective criteria.
With error budgets, the discussion is data-driven:
- Product team: "We need to ship this feature."
- QA/SRE team: "We have 60% of our error budget remaining. We can ship, but we need to monitor closely."
- Result: Objective decision based on measurable risk.
Monitoring Error Budget Consumption
Prometheus/Grafana Setup
# prometheus-error-budget-recording-rules.yaml
groups:
- name: error-budget
interval: 1m
rules:
# SLI: availability (successful requests / total requests)
- record: sli:checkout:availability:5m
expr: |
sum(rate(http_requests_total{service="checkout",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
# Error budget remaining (30-day window)
- record: error_budget:checkout:availability:remaining
expr: |
1 - (
(1 - sli:checkout:availability:30d) / (1 - 0.9995)
)
# Result: 1.0 = full budget, 0.0 = budget exhausted, <0 = over budget
# Error budget burn rate
- record: error_budget:checkout:availability:burn_rate:1h
expr: |
(1 - sli:checkout:availability:1h) / (1 - 0.9995)
# Result: 1.0 = sustainable rate, >1.0 = burning faster than allowed
Practical Implementation Steps
Step 1: Measure Before You Set Targets
Before defining SLOs, measure your current performance for 30 days. Your initial SLO should be set at or slightly above your current baseline -- not at an aspirational target.
Step 2: Start with One Service
Pick your most critical service (usually the one that generates revenue) and define 2-3 SLOs. Prove the process works before expanding.
Step 3: Automate Budget Tracking
Manual error budget tracking does not scale. Use Prometheus recording rules or your monitoring platform's SLO feature (Datadog SLOs, Grafana SLO).
Step 4: Tie Budget to Actions
The error budget policy must have teeth. If the budget hits 25% and the policy says "feature freeze," leadership must enforce it. Otherwise, the entire system loses credibility.
Step 5: Review Quarterly
SLOs are not permanent. Review them quarterly:
- Are they too tight? (Constant feature freezes = targets too ambitious)
- Are they too loose? (Never consume budget = targets not meaningful)
- Do they reflect user expectations? (User complaints with green SLOs = wrong metrics)
Common Mistakes
| Mistake | Problem | Fix |
|---|---|---|
| SLO set at 100% | No room for any failure; all development stops | Use 99.9% or 99.95% |
| Too many SLOs | Teams cannot prioritize | 3-5 per service maximum |
| SLO based on infrastructure metrics | Does not reflect user experience | Use request success rate, not CPU |
| No error budget policy | SLOs without consequences are just dashboards | Define actions at 50%, 25%, and 0% budget |
| Aspirational SLOs | Permanently exhausted budget demoralizes teams | Set SLOs based on measured baseline |