Service-Level Objectives in Practice
TL;DR — SLO = a target percentage of “good events” over a window. Pick 2-3 user-facing SLIs per service. Realistic targets (99.9% not 99.999%). Compute error budget = (1 - target) × total. When budget runs out, pause feature work. SLO without enforcement is decoration.
After Tempo, the operational discipline layer. SLOs (Service Level Objectives) turn metrics into commitments. This post covers the practical setup, not the theory (see Google’s SRE book for that).
SLI, SLO, SLA — three terms
SLI (Service Level Indicator): the metric you measure. “Percentage of API requests returning 2xx within 200ms.”
SLO: the target for the SLI over a window. “99.5% of requests over 30 days.”
SLA: contractual commitment (with penalties). “If we drop below 99.5% per month, customers get a 10% refund.”
You set SLIs and SLOs internally. SLAs are negotiated externally. Most teams have SLOs without SLAs.
Picking SLIs
Two per service is usually enough:
Availability SLI: fraction of requests that succeed.
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
Latency SLI: fraction of requests faster than threshold.
sum(rate(http_request_duration_seconds_bucket{service="api",le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))
For batch jobs / async workers: success rate of jobs, not requests.
For databases: query latency p99 < 100ms.
For UI: page load p95 < 2s.
Match the SLI to user pain. “Requests succeed” is what users care about. “CPU < 80%” is what you care about.
Setting targets
Common ranges:
- 99% — three 9s. Easy to maintain; reasonable for non-critical
- 99.5% — solid. ~3.6 hours downtime/month
- 99.9% — three 9s of nines. ~43 minutes/month
- 99.99% — four 9s. ~4 minutes/month
- 99.999% — five 9s. ~26 seconds/month. Requires fault-tolerant architecture
Set realistic targets based on:
- Historical performance (where are you already?)
- Customer expectations
- Cost of getting to the next 9
The jump from 99.9% to 99.99% is often 10× the operational investment. Worth it only if customers genuinely need it.
Error budget
The math:
Error budget = (1 - target) × total events over window
For 99.9% SLO on 30M requests/month:
0.001 × 30M = 30,000 allowed bad events
Each 5xx eats budget. Each request over 500ms eats budget. Run out before month end = problem.
Calculate burn rate:
1 - (
sum(rate(http_requests_total{service="api",status!~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api"}[1h]))
)
If this is 0.001 and your SLO is 0.001, you’re burning budget at 1× — exactly sustainable. 0.005 = 5× — you’ll run out in 1/5 of the window.
Operationalizing — the policy
SLOs without consequences don’t help. The Google policy:
If you blow your error budget, feature work stops until you’ve fixed reliability.
This is the cultural shift that makes SLOs real. Without it, you have aspirational targets nobody acts on.
Soft version (what most companies actually do):
- Above budget: feature work as normal
- 75-100% burn: investigate; consider deferring risky changes
-
100% burn: stop deploying risky changes; focus on stability
The mechanism creates pressure to make reliability investments. Without it, reliability work loses to feature work every time.
Dashboards for SLOs
Three panels per SLO:
- Current SLO performance (stat, big number, color-coded)
- Error budget remaining (gauge or stat)
- Burn rate (time series, last 24h)
For multi-window/multi-burn-rate alerts (Google’s recommendation):
Alert if:
(1h burn rate × 14.4 > 1) AND (5m burn rate × 14.4 > 1)
→ fast page (using 5% of budget in 1 hour)
OR
(6h burn rate × 6 > 1) AND (30m burn rate × 6 > 1)
→ slower page (using 10% of budget in 6 hours)
Catches both sudden outages (fast burn) and slow degradations (gradual burn).
SLOs for things you don’t control
External dependencies (Stripe, AWS, third-party APIs) have their own SLOs. You consume them. Two options:
- Inherit their SLO. If Stripe is 99.99%, your service can be at most 99.99% (assuming you depend on every Stripe call).
- Decouple via async + retry. Your service success doesn’t require Stripe success right now. Pattern: queue Stripe operations, retry, only fail user request on prolonged outage.
The decoupling work is real. The SLO budget often forces it.
Common Pitfalls
SLI for the wrong thing. “CPU < 80%” isn’t an SLI. Users don’t care about CPU.
Targets without basis. Picking 99.99% because it sounds good. Look at history first.
No consequences. SLO is paper if nothing happens when you miss it.
Per-endpoint SLOs. Granular but unmanageable. Aggregate to service-level.
Alerting on the wrong threshold. “Alert when SLO breached” = page when you’ve already failed. Use burn rate alerts (predictive).
Forgetting maintenance windows. Planned downtime should be excluded from SLO accounting. Document the policy.
Wrapping Up
Pick 2 SLIs per service, set a realistic target, compute error budget from Prometheus, build burn-rate alerts. Monday: error budgets and burn rate in depth.