background-shape
Slo article cover illustration on a gradient background
March 27, 2026 · 9 min read · by Muhammad Amal programming
Advertisement

TL;DR — An AI pipeline SLO must cover latency, errors, and answer quality, not just uptime / Sloth generates burn-rate rules from a short spec so you don’t hand-write PromQL / Multi-window alerts page fast on a real fire and stay quiet on a slow leak.

A distributed AI pipeline fails in ways a CRUD service never does. The retriever returns stale chunks. The model endpoint rate-limits you and a retry storm doubles latency. A reranker silently degrades after a deploy and the answers get worse without a single 500 in the logs. Traditional uptime monitoring — is the service returning 200s — catches none of this. The service is “up” the whole time it’s quietly serving garbage.

That’s why AI pipelines need SLOs framed around outcomes the user actually experiences, plus an error budget that turns reliability into a number you can spend. An error budget reframes the argument: instead of “is it reliable enough,” the question becomes “how much unreliability is left this month, and do we spend it on a risky deploy or save it.” That’s a decision a team can make calmly instead of a 3am judgment call.

Advertisement

This post builds SLOs and error budgets for an AI pipeline using Prometheus 3.x for the metrics, Sloth to generate the rules, and Grafana 11 for the burn-down view. We’ll define multiple SLIs across the pipeline stages and wire multi-window burn-rate alerts. The alerting mechanics here build on building real-time alerting dashboards with Prometheus and Grafana .

What an SLO Actually Is

An SLI — service level indicator — is a measured ratio of good events to total events. An SLO is a target for that ratio over a window, say 99.5% over 30 days. The error budget is the inverse: 100% minus the SLO. At 99.5%, you have 0.5% of events you’re allowed to fail. Over 30 days that’s roughly 3.6 hours of full outage, or an equivalent trickle of failures.

The budget is the point. When it’s healthy, ship fast and take risks. When it’s nearly spent, freeze risky changes and stabilize. It converts a vague reliability conversation into a quantity both engineering and product can see.

For an AI pipeline, “good event” needs care. A request that returns HTTP 200 in 8 seconds with an empty retrieval is not a good event. You’ll typically want three SLIs:

  • Availability — the request completed without a server error.
  • Latency — the request completed within a target, say 95% under 3 seconds end to end.
  • Quality — a measurable proxy for answer usefulness: retrieval returned at least one chunk above a relevance threshold, or a groundedness check passed.

Instrumenting the Pipeline

You can’t have an SLO without the SLI metric. Instrument each request with the outcome of all three dimensions. Here’s a Python pipeline wrapper exposing Prometheus metrics.

# slo_metrics.py — prometheus-client 0.21.x
import time
from contextlib import contextmanager
from prometheus_client import Counter, Histogram

pipeline_requests = Counter(
    "ai_pipeline_requests_total",
    "AI pipeline requests by stage outcome.",
    ["stage", "outcome"],          # outcome: success | error
)

pipeline_latency = Histogram(
    "ai_pipeline_duration_seconds",
    "End-to-end pipeline latency.",
    buckets=[0.25, 0.5, 1, 2, 3, 5, 8, 13, 21],
)

retrieval_quality = Counter(
    "ai_pipeline_retrieval_total",
    "Retrieval outcomes for the quality SLI.",
    ["result"],                     # result: relevant | empty | low_score
)


@contextmanager
def track_request():
    """Wrap a full pipeline invocation and record SLI events."""
    start = time.perf_counter()
    try:
        yield
        pipeline_requests.labels(stage="pipeline", outcome="success").inc()
    except Exception:
        pipeline_requests.labels(stage="pipeline", outcome="error").inc()
        raise
    finally:
        pipeline_latency.observe(time.perf_counter() - start)


def record_retrieval(chunks: list, min_score: float = 0.72) -> None:
    """Classify a retrieval result for the quality SLI."""
    if not chunks:
        retrieval_quality.labels(result="empty").inc()
    elif max(c.score for c in chunks) < min_score:
        retrieval_quality.labels(result="low_score").inc()
    else:
        retrieval_quality.labels(result="relevant").inc()

The quality SLI here is deliberately a proxy. You won’t have a human grading every answer, so pick something cheap and correlated: top retrieval score, a groundedness classifier verdict, citation count. It doesn’t need to be perfect, it needs to move when quality moves.

Defining the SLO with Sloth

Hand-writing burn-rate PromQL is error-prone — there are eight windows per SLO and the math is fiddly. Sloth takes a short YAML spec and generates all the recording and alerting rules. Here’s the spec for the pipeline.

# slo/ai-pipeline.yml — Sloth v0.12.x spec
version: prometheus/v1
service: ai-pipeline
labels:
  team: ml-platform
slos:
  - name: availability
    objective: 99.5
    description: "Pipeline requests complete without a server error."
    sli:
      events:
        error_query: sum(rate(ai_pipeline_requests_total{stage="pipeline",outcome="error"}[{{.window}}]))
        total_query: sum(rate(ai_pipeline_requests_total{stage="pipeline"}[{{.window}}]))
    alerting:
      name: AIPipelineAvailability
      page_alert:
        labels: { severity: page }
      ticket_alert:
        labels: { severity: ticket }

  - name: latency-p95-3s
    objective: 95.0
    description: "95% of pipeline requests finish within 3 seconds."
    sli:
      events:
        error_query: |
          sum(rate(ai_pipeline_duration_seconds_bucket{le="3"}[{{.window}}]))
        total_query: sum(rate(ai_pipeline_duration_seconds_count[{{.window}}]))
      # Sloth treats error_query as the BAD events; invert below.
    alerting:
      name: AIPipelineLatency
      page_alert:
        labels: { severity: page }
      ticket_alert:
        labels: { severity: ticket }

  - name: retrieval-quality
    objective: 98.0
    description: "Retrieval returns at least one relevant chunk."
    sli:
      events:
        error_query: |
          sum(rate(ai_pipeline_retrieval_total{result=~"empty|low_score"}[{{.window}}]))
        total_query: sum(rate(ai_pipeline_retrieval_total[{{.window}}]))
    alerting:
      name: AIPipelineRetrievalQuality
      page_alert:
        labels: { severity: page }
      ticket_alert:
        labels: { severity: ticket }

One subtlety: Sloth’s error_query must count bad events. For latency, the histogram bucket le="3" counts requests under the threshold — the good ones. The cleanest fix is to express the bad query as total - good:

    sli:
      events:
        error_query: |
          (
            sum(rate(ai_pipeline_duration_seconds_count[{{.window}}]))
            -
            sum(rate(ai_pipeline_duration_seconds_bucket{le="3"}[{{.window}}]))
          )
        total_query: sum(rate(ai_pipeline_duration_seconds_count[{{.window}}]))

Generate the Prometheus rules:

# Sloth v0.12.x
sloth generate -i slo/ai-pipeline.yml -o slo/ai-pipeline.rules.yml

That produces SLI recording rules, error-budget recording rules, and a complete set of multi-window burn-rate alerts. Wire the output into Prometheus:

# prometheus.yml — Prometheus 3.x
rule_files:
  - slo/ai-pipeline.rules.yml

Multi-Window Burn-Rate Alerting

The reason to use generated rules is burn rate. Burn rate measures how fast you’re consuming the error budget relative to “on pace.” A burn rate of 1 means you’ll spend exactly the month’s budget in the month. A burn rate of 14.4 means you’ll spend it all in roughly 2 days.

A naive alert fires when burn rate exceeds some threshold over a single window. That’s bad: a short window is noisy and a long window is slow. Multi-window alerts — which Sloth generates — combine a long and a short window so the alert only fires when both agree.

# excerpt of Sloth-generated output — do not hand-edit
- alert: AIPipelineAvailability
  expr: |
    (
      slo:sli_error:ratio_rate5m{sloth_id="ai-pipeline-availability"} > (14.4 * 0.005)
      and
      slo:sli_error:ratio_rate1h{sloth_id="ai-pipeline-availability"} > (14.4 * 0.005)
    )
    or
    (
      slo:sli_error:ratio_rate30m{sloth_id="ai-pipeline-availability"} > (6 * 0.005)
      and
      slo:sli_error:ratio_rate6h{sloth_id="ai-pipeline-availability"} > (6 * 0.005)
    )
  labels:
    severity: page

The fast pair (5m and 1h at 14.4x) catches a hard outage in minutes. The slow pair (30m and 6h at 6x) catches a gradual leak that would otherwise quietly drain the budget over a day. The short window in each pair is what makes the alert resolve quickly once the incident is over, so on-call isn’t chasing a stale page. This is the Google SRE multi-window approach , and it’s worth understanding even though Sloth writes the rules for you.

The Error Budget Dashboard

Grafana 11 shows the budget burning down. The key panels: remaining budget as a percentage, current burn rate, and a 30-day projection. Sloth emits slo:error_budget:ratio, so the panels are thin.

{
  "title": "AI Pipeline — SLO Error Budget",
  "schemaVersion": 39,
  "refresh": "1m",
  "time": { "from": "now-30d", "to": "now" },
  "panels": [
    {
      "title": "Availability budget remaining",
      "type": "gauge",
      "gridPos": { "h": 8, "w": 8, "x": 0, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit", "min": 0, "max": 1,
          "thresholds": { "mode": "absolute", "steps": [
            { "color": "red", "value": null },
            { "color": "orange", "value": 0.25 },
            { "color": "green", "value": 0.5 }
          ]}
        }
      },
      "targets": [
        { "expr": "1 - (sum(slo:sli_error:ratio_rate30d{sloth_id=\"ai-pipeline-availability\"}) / 0.005)" }
      ]
    },
    {
      "title": "Burn rate (1h)",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 16, "x": 8, "y": 0 },
      "targets": [
        { "expr": "slo:sli_error:ratio_rate1h{sloth_id=\"ai-pipeline-availability\"} / 0.005",
          "legendFormat": "availability" },
        { "expr": "slo:sli_error:ratio_rate1h{sloth_id=\"ai-pipeline-retrieval-quality\"} / 0.02",
          "legendFormat": "quality" }
      ]
    }
  ]
}

The gauge answering “how much budget is left” is the panel product and engineering should look at together in a weekly review. When it’s green, ship. When it drops into orange, the next risky deploy waits.

Common Pitfalls

  • SLO = uptime only. An AI pipeline can be 100% “up” and serving degraded answers. Include a quality SLI or you’re blind to the failure mode that actually matters.
  • Targeting 100%. A 100% SLO leaves zero error budget, so every blip is an incident and the budget concept dies. Pick a number with room — 99.5%, not 99.99%.
  • Single-window burn alerts. Either noisy or slow. Use the multi-window pairs Sloth generates.
  • Hand-writing burn-rate PromQL. Eight windows per SLO, easy to get the multiplier wrong. Let Sloth own the math.
  • Wrong error_query polarity. With histograms, le buckets count good events. Sloth wants bad events — express it as total - good.
  • Quality proxy that never moves. If your quality SLI is always 100%, it isn’t measuring anything. Tune the threshold against known-bad examples.

Troubleshooting

Symptom: Sloth-generated availability SLO sits at 100% even during a known incident. Cause: error_query and total_query polarity swapped — counting good events as errors or vice versa. Fix: Confirm error_query returns bad events. For latency histograms use total - good.

Symptom: Burn-rate alert fires and clears repeatedly within an hour. Cause: Only the short window is being evaluated, or the long-window series has gaps. Fix: Check both ratio_rate5m and ratio_rate1h exist in Prometheus. Gaps usually mean the SLI recording rule isn’t loaded — verify rule_files includes the Sloth output.

Symptom: Error budget gauge shows over 100% remaining. Cause: The 30-day SLI series hasn’t accumulated a full window yet, so the ratio is artificially low. Fix: Expected for the first 30 days after rollout. Annotate the dashboard with the SLO start date and treat early numbers as provisional.

Symptom: Latency SLO disagrees with the latency panel on the RED dashboard. Cause: Different bucket boundary — the SLO uses le="3" but the panel computes p95 by interpolation. Fix: Align both on the same threshold. If the SLO is “95% under 3s,” dashboard and SLO should both reference the le="3" bucket.

Wrapping Up

SLOs and error budgets give a distributed AI pipeline something uptime monitoring never could: a shared, quantified definition of “good enough” that spans latency, errors, and answer quality. Let Sloth generate the burn-rate rules, watch the budget burn down in Grafana, and use the remaining budget to decide when to ship and when to stabilize. Start with one SLO per pipeline stage, keep the targets honest, and let the budget — not a 3am hunch — drive the release decision.

Advertisement