Building Real-Time Alerting Dashboards with Prometheus and Grafana

Prometheus article cover illustration on a gradient background

March 20, 2026 · 8 min read · by Muhammad Amal programming

TL;DR — Alert on symptoms users feel, not on every twitchy gauge / Recording rules keep dashboards fast and alert expressions honest / Route through Alertmanager so the right person gets paged once, not the whole team five times.

Every team I’ve joined has had the same dashboard graveyard: forty panels, twelve alerts, and an on-call rotation that has muted half of them because they cried wolf at 3am for a month. The dashboards weren’t wrong. They were built to display everything instead of answering one question — is the service healthy right now, and if not, who needs to act?

Prometheus Grafana alerting dashboards earn their keep when they’re built around that question. That means a small set of alerts tied to user-visible symptoms, recording rules that pre-aggregate the expensive math, and Alertmanager routing that turns a firing rule into exactly one notification to exactly one team. Get those three layers right and the dashboard becomes the thing on-call actually opens during an incident instead of a wall of green they ignore.

This is a build-it guide for Prometheus 3.x, Grafana 11, and Alertmanager. We’ll instrument a service, write recording and alerting rules, route them, and build a dashboard that shows the four signals that matter. If you’re alerting on a vector store specifically, the symptom approach here pairs well with vector database performance monitoring .

The Layers That Make Alerting Trustworthy

A real-time alerting stack has four moving parts and people usually skip the middle two:

Instrumentation — the app exposes counters and histograms.
Recording rules — Prometheus precomputes rates and quantiles on a schedule so both dashboards and alerts read cheap, consistent values.
Alerting rules — expressions that fire when a recorded value crosses a threshold for a sustained window.
Alertmanager — deduplicates, groups, silences, and routes the firing alerts to humans.

Skipping recording rules is the most common mistake. When a dashboard panel and an alert both compute histogram_quantile(0.99, rate(...[5m])) independently, they can disagree because of scrape timing, and on-call ends up debugging the monitoring instead of the outage. Compute it once, name it, reuse it everywhere.

Instrumenting the Service

Start with the application. Here’s a Go HTTP service exposing the RED metrics — Rate, Errors, Duration — which is all you need to alert on user-facing health.

// metrics.go — prometheus/client_golang v1.21.x
package main

import (
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequests = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total HTTP requests by route and status.",
		},
		[]string{"route", "method", "status"},
	)

	httpDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "HTTP request latency.",
			Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
		},
		[]string{"route", "method"},
	)
)

// instrument wraps a handler and records RED metrics.
func instrument(route string, next http.HandlerFunc) http.HandlerFunc {
	return func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		sw := &statusWriter{ResponseWriter: w, status: http.StatusOK}
		next(sw, r)
		dur := time.Since(start).Seconds()
		httpDuration.WithLabelValues(route, r.Method).Observe(dur)
		httpRequests.WithLabelValues(route, r.Method, strconv.Itoa(sw.status)).Inc()
	}
}

type statusWriter struct {
	http.ResponseWriter
	status int
}

func (s *statusWriter) WriteHeader(code int) {
	s.status = code
	s.ResponseWriter.WriteHeader(code)
}

func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/api/search", instrument("/api/search", searchHandler))
	mux.Handle("/metrics", promhttp.Handler())
	_ = http.ListenAndServe(":8080", mux)
}

Histogram bucket choice matters more than people think. If your SLO is 250ms, you want a bucket boundary near 0.25 so histogram_quantile interpolates accurately around the threshold you’ll alert on. Default buckets often have a gap exactly where you need precision.

Recording Rules

Now precompute the values dashboards and alerts share. Prometheus 3.x evaluates these on the evaluation_interval and stores the result as a new series.

# rules/recording.yml
groups:
  - name: http_red
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, route) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job, route) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_ratio:rate5m
        expr: |
          job:http_errors:rate5m
            /
          clamp_min(job:http_requests:rate5m, 1e-9)

      - record: job:http_request_duration:p99_5m
        expr: |
          histogram_quantile(
            0.99,
            sum by (job, route, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

The clamp_min guards against divide-by-zero when a route has no traffic — without it the ratio becomes NaN and your alert silently never fires. That’s a bug I’ve shipped, and it’s invisible until the one night you need the alert.

Alerting Rules

Alerts read the recorded series. Two principles: alert on symptoms (error ratio, latency) not causes (high CPU), and require a sustained for window so a single bad scrape doesn’t page anyone.

# rules/alerts.yml
groups:
  - name: http_slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_ratio:rate5m > 0.02
        for: 5m
        labels:
          severity: page
          team: search
        annotations:
          summary: "Error ratio above 2% on {{ $labels.route }}"
          description: >
            {{ $labels.route }} is returning errors at
            {{ $value | humanizePercentage }} over the last 5m.
          runbook: "https://runbooks.internal/http-high-error-rate"

      - alert: HighLatencyP99
        expr: job:http_request_duration:p99_5m > 0.5
        for: 10m
        labels:
          severity: page
          team: search
        annotations:
          summary: "p99 latency above 500ms on {{ $labels.route }}"
          description: "p99 is {{ $value | humanizeDuration }}."

      - alert: NoTraffic
        expr: job:http_requests:rate5m == 0
        for: 10m
        labels:
          severity: ticket
          team: search
        annotations:
          summary: "No traffic on {{ $labels.route }} for 10m"

NoTraffic is the alert teams forget. A service returning zero errors because it’s serving zero requests looks perfectly healthy on an error-ratio panel. Always alert on the absence of traffic somewhere.

Wire the rule files into Prometheus and point it at Alertmanager:

# prometheus.yml — Prometheus 3.x
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - rules/recording.yml
  - rules/alerts.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: search-api
    static_configs:
      - targets: ['search-api:8080']

Routing with Alertmanager

A firing rule is not a notification. Alertmanager turns the stream of firing alerts into grouped, deduplicated messages and decides who gets them. This config routes page severity to PagerDuty and ticket severity to a Slack channel, with grouping so a cascade becomes one message.

# alertmanager.yml
route:
  receiver: default-slack
  group_by: ['alertname', 'team']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: pagerduty-search
      group_wait: 10s
      continue: false
    - matchers:
        - severity = ticket
      receiver: default-slack

receivers:
  - name: default-slack
    slack_configs:
      - api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
        channel: '#alerts-search'
        send_resolved: true
        title: '{{ .CommonAnnotations.summary }}'
        text: >-
          {{ range .Alerts }}{{ .Annotations.description }}
          {{ if .Annotations.runbook }}<{{ .Annotations.runbook }}|runbook>{{ end }}
          {{ end }}

  - name: pagerduty-search
    pagerduty_configs:
      - routing_key: '{{ env "PD_ROUTING_KEY" }}'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'

inhibit_rules:
  - source_matchers: [severity = page]
    target_matchers: [severity = ticket]
    equal: ['team', 'route']

The inhibit_rules block is what stops alert storms. When HighErrorRate pages, the lower-priority ticket alerts for the same route are suppressed — on-call gets one signal, not a feed. See the Alertmanager configuration docs for the full matcher syntax.

The Dashboard

With recording rules in place, the Grafana 11 dashboard is thin — it just renders the named series. Here’s the JSON model for a four-panel RED dashboard. Import it via Dashboards → New → Import.

{
  "title": "Search API — RED",
  "schemaVersion": 39,
  "refresh": "15s",
  "time": { "from": "now-3h", "to": "now" },
  "panels": [
    {
      "title": "Request rate",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
      "targets": [
        { "expr": "job:http_requests:rate5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "Error ratio",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 0.02 }
            ]
          }
        }
      },
      "targets": [
        { "expr": "job:http_error_ratio:rate5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "p99 latency",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
      "fieldConfig": { "defaults": { "unit": "s" } },
      "targets": [
        { "expr": "job:http_request_duration:p99_5m", "legendFormat": "{{route}}" }
      ]
    },
    {
      "title": "Firing alerts",
      "type": "alertlist",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
      "options": { "alertName": "", "stateFilter": { "firing": true } }
    }
  ]
}

Each panel reads a recorded series, so the dashboard loads instantly even over a long time range and shows exactly the same number the alert evaluated. Set the error-ratio panel threshold to the same 0.02 the alert uses — when the line goes red, on-call already knows a page is imminent.

Common Pitfalls

Alerting on causes. High CPU isn’t an incident; slow responses are. Page on symptoms, keep cause metrics on the dashboard for diagnosis only.
No for window. An alert without a sustained window fires on a single scrape blip. Five to ten minutes filters noise without hiding real outages.
Duplicated PromQL. Dashboard and alert computing the same quantile separately will drift. Always go through a recording rule.
No send_resolved. Without it, on-call never learns the alert cleared and keeps investigating a dead incident.
Forgetting NoTraffic. A silent service looks healthy on every error-rate panel. Alert on zero traffic explicitly.
One giant route in Alertmanager. Without group_by and inhibit_rules, one outage becomes a notification flood.

Troubleshooting

Symptom: Alert shows as firing in Prometheus but no notification arrives. Cause: Prometheus can’t reach Alertmanager, or Alertmanager has no matching route. Fix: Check Status → Runtime & Build in Prometheus for the Alertmanager endpoint, then amtool config routes test severity=page to confirm the alert matches a receiver.

Symptom: Error-ratio panel shows NaN or no data. Cause: Division by zero on a route with no traffic. Fix: Wrap the denominator in clamp_min(..., 1e-9) in the recording rule.

Symptom: The same incident pages five times in ten minutes. Cause: group_interval too short or group_by missing the right labels. Fix: Group by alertname and team, raise group_interval to 5m, and add inhibit_rules so high-severity alerts suppress related low-severity ones.

Symptom: Dashboard p99 and alert disagree. Cause: Panel and alert compute the quantile independently with different windows. Fix: Point both at the job:http_request_duration:p99_5m recording rule.

Wrapping Up

Trustworthy Prometheus Grafana alerting dashboards come from discipline, not panel count: instrument RED metrics, precompute with recording rules, alert on sustained symptoms, and route through Alertmanager so each incident pages one team once. Build that and on-call starts trusting the alerts again — which is the only metric that really matters. From here, codify these dashboards so they’re reproducible and version-controlled.

The Layers That Make Alerting Trustworthy

Instrumenting the Service

Recording Rules

Alerting Rules

Routing with Alertmanager

The Dashboard

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

September Retro, One Stack to Watch Them All

Alerting with Prometheus Alertmanager

Building an Observability Stack in 2022

Alerting on Sensor Anomalies in IIoT

End-to-End Industrial AI, From Camera to Dashboard

SLOs and Error Budgets for Distributed AI Pipelines

Managing Grafana Dashboards as Code with Terraform

SLOs and Burn Rate Alerting in 2025, A Practical Guide

Let’s Start a Project