Service-Level Objectives in Practice

Slo article cover illustration on a gradient background

September 23, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — SLO = a target percentage of “good events” over a window. Pick 2-3 user-facing SLIs per service. Realistic targets (99.9% not 99.999%). Compute error budget = (1 - target) × total. When budget runs out, pause feature work. SLO without enforcement is decoration.

After Tempo , the operational discipline layer. SLOs (Service Level Objectives) turn metrics into commitments. This post covers the practical setup, not the theory (see Google’s SRE book for that).

SLI, SLO, SLA — three terms

SLI (Service Level Indicator): the metric you measure. “Percentage of API requests returning 2xx within 200ms.”

SLO: the target for the SLI over a window. “99.5% of requests over 30 days.”

SLA: contractual commitment (with penalties). “If we drop below 99.5% per month, customers get a 10% refund.”

You set SLIs and SLOs internally. SLAs are negotiated externally. Most teams have SLOs without SLAs.

Picking SLIs

Two per service is usually enough:

Availability SLI: fraction of requests that succeed.

sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

Latency SLI: fraction of requests faster than threshold.

sum(rate(http_request_duration_seconds_bucket{service="api",le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))

For batch jobs / async workers: success rate of jobs, not requests.

For databases: query latency p99 < 100ms.

For UI: page load p95 < 2s.

Match the SLI to user pain. “Requests succeed” is what users care about. “CPU < 80%” is what you care about.

Setting targets

Common ranges:

99% — three 9s. Easy to maintain; reasonable for non-critical
99.5% — solid. ~3.6 hours downtime/month
99.9% — three 9s of nines. ~43 minutes/month
99.99% — four 9s. ~4 minutes/month
99.999% — five 9s. ~26 seconds/month. Requires fault-tolerant architecture

Set realistic targets based on:

Historical performance (where are you already?)
Customer expectations
Cost of getting to the next 9

The jump from 99.9% to 99.99% is often 10× the operational investment. Worth it only if customers genuinely need it.

Error budget

The math:

Error budget = (1 - target) × total events over window

For 99.9% SLO on 30M requests/month:

0.001 × 30M = 30,000 allowed bad events

Each 5xx eats budget. Each request over 500ms eats budget. Run out before month end = problem.

Calculate burn rate:

1 - (
  sum(rate(http_requests_total{service="api",status!~"5.."}[1h]))
  /
  sum(rate(http_requests_total{service="api"}[1h]))
)

If this is 0.001 and your SLO is 0.001, you’re burning budget at 1× — exactly sustainable. 0.005 = 5× — you’ll run out in 1/5 of the window.

Operationalizing — the policy

SLOs without consequences don’t help. The Google policy:

If you blow your error budget, feature work stops until you’ve fixed reliability.

This is the cultural shift that makes SLOs real. Without it, you have aspirational targets nobody acts on.

Soft version (what most companies actually do):

Above budget: feature work as normal
75-100% burn: investigate; consider deferring risky changes
100% burn: stop deploying risky changes; focus on stability

The mechanism creates pressure to make reliability investments. Without it, reliability work loses to feature work every time.

Dashboards for SLOs

Three panels per SLO:

Current SLO performance (stat, big number, color-coded)
Error budget remaining (gauge or stat)
Burn rate (time series, last 24h)

For multi-window/multi-burn-rate alerts (Google’s recommendation):

Alert if:
  (1h burn rate × 14.4 > 1) AND (5m burn rate × 14.4 > 1)
  → fast page (using 5% of budget in 1 hour)

OR

  (6h burn rate × 6 > 1) AND (30m burn rate × 6 > 1)
  → slower page (using 10% of budget in 6 hours)

Catches both sudden outages (fast burn) and slow degradations (gradual burn).

SLOs for things you don’t control

External dependencies (Stripe, AWS, third-party APIs) have their own SLOs. You consume them. Two options:

Inherit their SLO. If Stripe is 99.99%, your service can be at most 99.99% (assuming you depend on every Stripe call).
Decouple via async + retry. Your service success doesn’t require Stripe success right now. Pattern: queue Stripe operations, retry, only fail user request on prolonged outage.

The decoupling work is real. The SLO budget often forces it.

Common Pitfalls

SLI for the wrong thing. “CPU < 80%” isn’t an SLI. Users don’t care about CPU.

Targets without basis. Picking 99.99% because it sounds good. Look at history first.

No consequences. SLO is paper if nothing happens when you miss it.

Per-endpoint SLOs. Granular but unmanageable. Aggregate to service-level.

Alerting on the wrong threshold. “Alert when SLO breached” = page when you’ve already failed. Use burn rate alerts (predictive).

Forgetting maintenance windows. Planned downtime should be excluded from SLO accounting. Document the policy.

Wrapping Up

Pick 2 SLIs per service, set a realistic target, compute error budget from Prometheus, build burn-rate alerts. Monday: error budgets and burn rate in depth .

SLI, SLO, SLA — three terms

Picking SLIs

Setting targets

Error budget

Operationalizing — the policy

Dashboards for SLOs

SLOs for things you don’t control

Common Pitfalls

Wrapping Up

Related posts

SLOs and Error Budgets That Engineers Actually Use

AIOps in May 2025, What Actually Works in Production

September Retro, One Stack to Watch Them All

Prometheus Cardinality and Cost Control

Error Budgets and Burn Rates

Tempo for Distributed Tracing

Loki for Logs, Self-Hosted Aggregation

Alerting with Prometheus Alertmanager

Let’s Start a Project