background-shape
Prometheus article cover illustration on a gradient background
March 20, 2026 · 8 min read · by Muhammad Amal programming
Advertisement

TL;DR — Alert on symptoms users feel, not on every twitchy gauge / Recording rules keep dashboards fast and alert expressions honest / Route through Alertmanager so the right person gets paged once, not the whole team five times.

Every team I’ve joined has had the same dashboard graveyard: forty panels, twelve alerts, and an on-call rotation that has muted half of them because they cried wolf at 3am for a month. The dashboards weren’t wrong. They were built to display everything instead of answering one question — is the service healthy right now, and if not, who needs to act?

Prometheus Grafana alerting dashboards earn their keep when they’re built around that question. That means a small set of alerts tied to user-visible symptoms, recording rules that pre-aggregate the expensive math, and Alertmanager routing that turns a firing rule into exactly one notification to exactly one team. Get those three layers right and the dashboard becomes the thing on-call actually opens during an incident instead of a wall of green they ignore.

Advertisement

This is a build-it guide for Prometheus 3.x, Grafana 11, and Alertmanager. We’ll instrument a service, write recording and alerting rules, route them, and build a dashboard that shows the four signals that matter. If you’re alerting on a vector store specifically, the symptom approach here pairs well with vector database performance monitoring .

The Layers That Make Alerting Trustworthy

A real-time alerting stack has four moving parts and people usually skip the middle two:

  1. Instrumentation — the app exposes counters and histograms.
  2. Recording rules — Prometheus precomputes rates and quantiles on a schedule so both dashboards and alerts read cheap, consistent values.
  3. Alerting rules — expressions that fire when a recorded value crosses a threshold for a sustained window.
  4. Alertmanager — deduplicates, groups, silences, and routes the firing alerts to humans.

Skipping recording rules is the most common mistake. When a dashboard panel and an alert both compute histogram_quantile(0.99, rate(...[5m])) independently, they can disagree because of scrape timing, and on-call ends up debugging the monitoring instead of the outage. Compute it once, name it, reuse it everywhere.

Instrumenting the Service

Start with the application. Here’s a Go HTTP service exposing the RED metrics — Rate, Errors, Duration — which is all you need to alert on user-facing health.

// metrics.go — prometheus/client_golang v1.21.x
package main

import (
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequests = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total HTTP requests by route and status.",
		},
		[]string{"route", "method", "status"},
	)

	httpDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "HTTP request latency.",
			Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
		},
		[]string{"route", "method"},
	)
)

// instrument wraps a handler and records RED metrics.
func instrument(route string, next http.HandlerFunc) http.HandlerFunc {
	return func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		sw := &statusWriter{ResponseWriter: w, status: http.StatusOK}
		next(sw, r)
		dur := time.Since(start).Seconds()
		httpDuration.WithLabelValues(route, r.Method).Observe(dur)
		httpRequests.WithLabelValues(route, r.Method, strconv.Itoa(sw.status)).Inc()
	}
}

type statusWriter struct {
	http.ResponseWriter
	status int
}

func (s *statusWriter) WriteHeader(code int) {
	s.status = code
	s.ResponseWriter.WriteHeader(code)
}

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/api/search", instrument("/api/search", searchHandler))
	mux.Handle("/metrics", promhttp.Handler())
	_ = http.ListenAndServe(":8080", mux)
}

Histogram bucket choice matters more than people think. If your SLO is 250ms, you want a bucket boundary near 0.25 so histogram_quantile interpolates accurately around the threshold you’ll alert on. Default buckets often have a gap exactly where you need precision.

Recording Rules

Now precompute the values dashboards and alerts share. Prometheus 3.x evaluates these on the evaluation_interval and stores the result as a new series.

# rules/recording.yml
groups:
  - name: http_red
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, route) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job, route) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_ratio:rate5m
        expr: |
          job:http_errors:rate5m
            /
          clamp_min(job:http_requests:rate5m, 1e-9)

      - record: job:http_request_duration:p99_5m
        expr: |
          histogram_quantile(
            0.99,
            sum by (job, route, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

The clamp_min guards against divide-by-zero when a route has no traffic — without it the ratio becomes NaN and your alert silently never fires. That’s a bug I’ve shipped, and it’s invisible until the one night you need the alert.

Alerting Rules

Alerts read the recorded series. Two principles: alert on symptoms (error ratio, latency) not causes (high CPU), and require a sustained for window so a single bad scrape doesn’t page anyone.

# rules/alerts.yml
groups:
  - name: http_slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_ratio:rate5m > 0.02
        for: 5m
        labels:
          severity: page
          team: search
        annotations:
          summary: "Error ratio above 2% on {{ $labels.route }}"
          description: >
            {{ $labels.route }} is returning errors at
            {{ $value | humanizePercentage }} over the last 5m.
          runbook: "https://runbooks.internal/http-high-error-rate"

      - alert: HighLatencyP99
        expr: job:http_request_duration:p99_5m > 0.5
        for: 10m
        labels:
          severity: page
          team: search
        annotations:
          summary: "p99 latency above 500ms on {{ $labels.route }}"
          description: "p99 is {{ $value | humanizeDuration }}."

      - alert: NoTraffic
        expr: job:http_requests:rate5m == 0
        for: 10m
        labels:
          severity: ticket
          team: search
        annotations:
          summary: "No traffic on {{ $labels.route }} for 10m"

NoTraffic is the alert teams forget. A service returning zero errors because it’s serving zero requests looks perfectly healthy on an error-ratio panel. Always alert on the absence of traffic somewhere.

Wire the rule files into Prometheus and point it at Alertmanager:

# prometheus.yml — Prometheus 3.x
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - rules/recording.yml
  - rules/alerts.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: search-api
    static_configs:
      - targets: ['search-api:8080']

Routing with Alertmanager

A firing rule is not a notification. Alertmanager turns the stream of firing alerts into grouped, deduplicated messages and decides who gets them. This config routes page severity to PagerDuty and ticket severity to a Slack channel, with grouping so a cascade becomes one message.

# alertmanager.yml
route:
  receiver: default-slack
  group_by: ['alertname', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: pagerduty-search
      group_wait: 10s
      continue: false
    - matchers:
        - severity = ticket
      receiver: default-slack

receivers:
  - name: default-slack
    slack_configs:
      - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
        channel: '#alerts-search'
        send_resolved: true
        title: '{{ .CommonAnnotations.summary }}'
        text: >-
          {{ range .Alerts }}{{ .Annotations.description }}
          {{ if .Annotations.runbook }}<{{ .Annotations.runbook }}|runbook>{{ end }}
          {{ end }}

  - name: pagerduty-search
    pagerduty_configs:
      - routing_key: '{{ env "PD_ROUTING_KEY" }}'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'

inhibit_rules:
  - source_matchers: [severity = page]
    target_matchers: [severity = ticket]
    equal: ['team', 'route']

The inhibit_rules block is what stops alert storms. When HighErrorRate pages, the lower-priority ticket alerts for the same route are suppressed — on-call gets one signal, not a feed. See the Alertmanager configuration docs for the full matcher syntax.

The Dashboard

With recording rules in place, the Grafana 11 dashboard is thin — it just renders the named series. Here’s the JSON model for a four-panel RED dashboard. Import it via Dashboards → New → Import.

{
  "title": "Search API — RED",
  "schemaVersion": 39,
  "refresh": "15s",
  "time": { "from": "now-3h", "to": "now" },
  "panels": [
    {
      "title": "Request rate",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "targets": [
        { "expr": "job:http_requests:rate5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "Error ratio",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 0.02 }
            ]
          }
        }
      },
      "targets": [
        { "expr": "job:http_error_ratio:rate5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "p99 latency",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
      "fieldConfig": { "defaults": { "unit": "s" } },
      "targets": [
        { "expr": "job:http_request_duration:p99_5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "Firing alerts",
      "type": "alertlist",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
      "options": { "alertName": "", "stateFilter": { "firing": true } }
    }
  ]
}

Each panel reads a recorded series, so the dashboard loads instantly even over a long time range and shows exactly the same number the alert evaluated. Set the error-ratio panel threshold to the same 0.02 the alert uses — when the line goes red, on-call already knows a page is imminent.

Common Pitfalls

  • Alerting on causes. High CPU isn’t an incident; slow responses are. Page on symptoms, keep cause metrics on the dashboard for diagnosis only.
  • No for window. An alert without a sustained window fires on a single scrape blip. Five to ten minutes filters noise without hiding real outages.
  • Duplicated PromQL. Dashboard and alert computing the same quantile separately will drift. Always go through a recording rule.
  • No send_resolved. Without it, on-call never learns the alert cleared and keeps investigating a dead incident.
  • Forgetting NoTraffic. A silent service looks healthy on every error-rate panel. Alert on zero traffic explicitly.
  • One giant route in Alertmanager. Without group_by and inhibit_rules, one outage becomes a notification flood.

Troubleshooting

Symptom: Alert shows as firing in Prometheus but no notification arrives. Cause: Prometheus can’t reach Alertmanager, or Alertmanager has no matching route. Fix: Check Status → Runtime & Build in Prometheus for the Alertmanager endpoint, then amtool config routes test severity=page to confirm the alert matches a receiver.

Symptom: Error-ratio panel shows NaN or no data. Cause: Division by zero on a route with no traffic. Fix: Wrap the denominator in clamp_min(..., 1e-9) in the recording rule.

Symptom: The same incident pages five times in ten minutes. Cause: group_interval too short or group_by missing the right labels. Fix: Group by alertname and team, raise group_interval to 5m, and add inhibit_rules so high-severity alerts suppress related low-severity ones.

Symptom: Dashboard p99 and alert disagree. Cause: Panel and alert compute the quantile independently with different windows. Fix: Point both at the job:http_request_duration:p99_5m recording rule.

Wrapping Up

Trustworthy Prometheus Grafana alerting dashboards come from discipline, not panel count: instrument RED metrics, precompute with recording rules, alert on sustained symptoms, and route through Alertmanager so each incident pages one team once. Build that and on-call starts trusting the alerts again — which is the only metric that really matters. From here, codify these dashboards so they’re reproducible and version-controlled.

Advertisement