Error Budgets and Burn Rates
TL;DR — Error budget = (1 - SLO target) × total events. Burn rate = current bad-event rate / sustainable rate. Multi-window alerts (5m + 1h, 30m + 6h) catch both fast outages and slow degradations. Action policy: pause risk-introducing changes when budget burns hot.
After SLOs in practice, the alerting half. Burn rate is the practical lever for “when to act on SLO budget.”
The math
For 99.9% availability SLO over 30 days on a service doing 1000 req/min:
- Total requests/30 days = 43.2M
- Allowed errors = 0.001 × 43.2M = 43,200 errors
- Allowed error rate (avg) = 0.1% = 1 per 1000 requests
Burn rate = how fast you’re using the budget relative to sustainable:
burn rate = (errors / total) / SLO error rate
= current error rate / 0.001
- Current 0.001 (0.1%): burn rate 1× — exactly sustainable
- Current 0.005 (0.5%): burn rate 5× — using up 30-day budget in 6 days
- Current 0.05 (5%): burn rate 50× — full budget in 14 hours
The higher the burn rate, the more urgent.
Multi-window alerts
A single threshold alert (“burn rate > 10”) flaps. Multi-window pattern from Google SRE workbook:
Fast burn:
- alert: SLOFastBurn
expr: |
(
slo:burn_rate:5m{service="api"} > 14.4
AND
slo:burn_rate:1h{service="api"} > 14.4
)
for: 2m
labels: { severity: critical }
14.4× burn rate over both 5m AND 1h windows = consuming 5% of budget in 1 hour. Real fire.
Slow burn:
- alert: SLOSlowBurn
expr: |
(
slo:burn_rate:30m{service="api"} > 6
AND
slo:burn_rate:6h{service="api"} > 6
)
for: 15m
labels: { severity: warning }
6× burn over 30m and 6h = consuming 10% of budget in 6 hours. Trending bad.
Two windows prevent: fast = “must be currently bad” + long = “not just a blip.”
Recording rules for burn rate
Computing burn rate in alerts directly is slow. Pre-compute with recording rules:
- name: api_slo
interval: 30s
rules:
- record: slo:errors:rate5m
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
- record: slo:errors:rate1h
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api"}[1h]))
- record: slo:errors:rate30m
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[30m]))
/
sum(rate(http_requests_total{service="api"}[30m]))
- record: slo:errors:rate6h
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{service="api"}[6h]))
- record: slo:burn_rate:5m
expr: slo:errors:rate5m / 0.001
- record: slo:burn_rate:1h
expr: slo:errors:rate1h / 0.001
- record: slo:burn_rate:30m
expr: slo:errors:rate30m / 0.001
- record: slo:burn_rate:6h
expr: slo:errors:rate6h / 0.001
Alerts query the recording rules, not raw histograms.
Dashboard
Per service, an SLO panel:
┌──────────────────────────────────────────────────────┐
│ API Availability SLO │
├──────────────────────────────────────────────────────┤
│ Current 30d performance: 99.94% (target 99.9%) │
│ Error budget remaining: 65% (good) │
│ Burn rate (1h): 0.4× (sustainable) │
│ Burn rate (6h): 0.6× (sustainable) │
└──────────────────────────────────────────────────────┘
Plus a 30-day error budget burndown chart. Visualizes “are we using budget linearly?” Spikes correlate to incidents.
Action policy
SLOs without action policies are decoration. Three concrete tiers:
Budget healthy (> 50%): normal operation. Ship features. Take risks.
Budget tight (10-50%): review risky changes carefully. No code freezes; just elevated scrutiny.
Budget exhausted (< 10%): pause risky feature work. Focus on stability. Postmortem the incidents that ate the budget.
This is the SRE-book version. Real adoption varies:
- Most teams: informal, “we should be careful when budget’s low”
- Mature SRE shops: explicit code freeze policy when budget gone
- Big tech: PRs blocked from deploy automatically when budget violated
Even informal awareness changes behavior. A monthly “here’s our SLO status” review surfaces patterns.
Resetting budget
Two reset patterns:
Rolling 30-day window: budget continuously calculated over last 30 days. New errors enter; old errors leave. Smooth.
Calendar-aligned (monthly): budget resets the 1st of each month. Discontinuity but easier to communicate.
Most engineering teams use rolling. Customer-facing SLAs often use calendar.
When SLOs and alerts conflict
You might have an SLO alert AND a separate “high error rate” alert. Don’t.
Pick one source of truth for “is this service OK.” Usually: SLO burn rate is the alert.
Separately, you can have alerts for:
- “Something we want to know but isn’t user-facing” (cron job failed)
- “Predictive” (disk filling, expiry coming up)
- “Anomaly detection” (unusual traffic pattern)
Reserve “page someone” for SLO burn rate alerts on user-facing services.
Common Pitfalls
Burn rate alerts on too-short window. 5-minute burn rate alone is noisy. Always pair with longer window.
Alerting on SLO breach not burn rate. By the time you’ve broken SLO, it’s too late.
No exclusion for planned events. Maintenance windows shouldn’t count. Implement via silences or recorded-rule filters.
Different SLOs per region without rollup. Customers don’t care about us-east-1 vs eu-west-1. Combined SLO.
Linear budget burndown. Budget consumed evenly is healthy. Spikes = incidents. Spikes should be unusual; if every week has them, your SLO is too tight or your reliability is too low.
Wrapping Up
Multi-window burn rate alerts on pre-computed recording rules. Action policy that ties budget to behavior. Wednesday: Prometheus cardinality + cost control.