Building Real-Time Alerting Dashboards with Prometheus and Grafana
TL;DR — Alert on symptoms users feel, not on every twitchy gauge / Recording rules keep dashboards fast and alert expressions honest / Route through Alertmanager so the right person gets paged once, not the whole team five times.
Every team I’ve joined has had the same dashboard graveyard: forty panels, twelve alerts, and an on-call rotation that has muted half of them because they cried wolf at 3am for a month. The dashboards weren’t wrong. They were built to display everything instead of answering one question — is the service healthy right now, and if not, who needs to act?
Prometheus Grafana alerting dashboards earn their keep when they’re built around that question. That means a small set of alerts tied to user-visible symptoms, recording rules that pre-aggregate the expensive math, and Alertmanager routing that turns a firing rule into exactly one notification to exactly one team. Get those three layers right and the dashboard becomes the thing on-call actually opens during an incident instead of a wall of green they ignore.
This is a build-it guide for Prometheus 3.x, Grafana 11, and Alertmanager. We’ll instrument a service, write recording and alerting rules, route them, and build a dashboard that shows the four signals that matter. If you’re alerting on a vector store specifically, the symptom approach here pairs well with vector database performance monitoring .
The Layers That Make Alerting Trustworthy
A real-time alerting stack has four moving parts and people usually skip the middle two:
- Instrumentation — the app exposes counters and histograms.
- Recording rules — Prometheus precomputes rates and quantiles on a schedule so both dashboards and alerts read cheap, consistent values.
- Alerting rules — expressions that fire when a recorded value crosses a threshold for a sustained window.
- Alertmanager — deduplicates, groups, silences, and routes the firing alerts to humans.
Skipping recording rules is the most common mistake. When a dashboard panel and an alert both compute histogram_quantile(0.99, rate(...[5m])) independently, they can disagree because of scrape timing, and on-call ends up debugging the monitoring instead of the outage. Compute it once, name it, reuse it everywhere.
Instrumenting the Service
Start with the application. Here’s a Go HTTP service exposing the RED metrics — Rate, Errors, Duration — which is all you need to alert on user-facing health.
// metrics.go — prometheus/client_golang v1.21.x
package main
import (
"net/http"
"strconv"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequests = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests by route and status.",
},
[]string{"route", "method", "status"},
)
httpDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency.",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
},
[]string{"route", "method"},
)
)
// instrument wraps a handler and records RED metrics.
func instrument(route string, next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
sw := &statusWriter{ResponseWriter: w, status: http.StatusOK}
next(sw, r)
dur := time.Since(start).Seconds()
httpDuration.WithLabelValues(route, r.Method).Observe(dur)
httpRequests.WithLabelValues(route, r.Method, strconv.Itoa(sw.status)).Inc()
}
}
type statusWriter struct {
http.ResponseWriter
status int
}
func (s *statusWriter) WriteHeader(code int) {
s.status = code
s.ResponseWriter.WriteHeader(code)
}
func main() {
mux := http.NewServeMux()
mux.HandleFunc("/api/search", instrument("/api/search", searchHandler))
mux.Handle("/metrics", promhttp.Handler())
_ = http.ListenAndServe(":8080", mux)
}
Histogram bucket choice matters more than people think. If your SLO is 250ms, you want a bucket boundary near 0.25 so histogram_quantile interpolates accurately around the threshold you’ll alert on. Default buckets often have a gap exactly where you need precision.
Recording Rules
Now precompute the values dashboards and alerts share. Prometheus 3.x evaluates these on the evaluation_interval and stores the result as a new series.
# rules/recording.yml
groups:
- name: http_red
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job, route) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job, route) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_ratio:rate5m
expr: |
job:http_errors:rate5m
/
clamp_min(job:http_requests:rate5m, 1e-9)
- record: job:http_request_duration:p99_5m
expr: |
histogram_quantile(
0.99,
sum by (job, route, le) (rate(http_request_duration_seconds_bucket[5m]))
)
The clamp_min guards against divide-by-zero when a route has no traffic — without it the ratio becomes NaN and your alert silently never fires. That’s a bug I’ve shipped, and it’s invisible until the one night you need the alert.
Alerting Rules
Alerts read the recorded series. Two principles: alert on symptoms (error ratio, latency) not causes (high CPU), and require a sustained for window so a single bad scrape doesn’t page anyone.
# rules/alerts.yml
groups:
- name: http_slo_alerts
rules:
- alert: HighErrorRate
expr: job:http_error_ratio:rate5m > 0.02
for: 5m
labels:
severity: page
team: search
annotations:
summary: "Error ratio above 2% on {{ $labels.route }}"
description: >
{{ $labels.route }} is returning errors at
{{ $value | humanizePercentage }} over the last 5m.
runbook: "https://runbooks.internal/http-high-error-rate"
- alert: HighLatencyP99
expr: job:http_request_duration:p99_5m > 0.5
for: 10m
labels:
severity: page
team: search
annotations:
summary: "p99 latency above 500ms on {{ $labels.route }}"
description: "p99 is {{ $value | humanizeDuration }}."
- alert: NoTraffic
expr: job:http_requests:rate5m == 0
for: 10m
labels:
severity: ticket
team: search
annotations:
summary: "No traffic on {{ $labels.route }} for 10m"
NoTraffic is the alert teams forget. A service returning zero errors because it’s serving zero requests looks perfectly healthy on an error-ratio panel. Always alert on the absence of traffic somewhere.
Wire the rule files into Prometheus and point it at Alertmanager:
# prometheus.yml — Prometheus 3.x
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- rules/recording.yml
- rules/alerts.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: search-api
static_configs:
- targets: ['search-api:8080']
Routing with Alertmanager
A firing rule is not a notification. Alertmanager turns the stream of firing alerts into grouped, deduplicated messages and decides who gets them. This config routes page severity to PagerDuty and ticket severity to a Slack channel, with grouping so a cascade becomes one message.
# alertmanager.yml
route:
receiver: default-slack
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity = page
receiver: pagerduty-search
group_wait: 10s
continue: false
- matchers:
- severity = ticket
receiver: default-slack
receivers:
- name: default-slack
slack_configs:
- api_url: '{{ env "SLACK_WEBHOOK_URL" }}'
channel: '#alerts-search'
send_resolved: true
title: '{{ .CommonAnnotations.summary }}'
text: >-
{{ range .Alerts }}{{ .Annotations.description }}
{{ if .Annotations.runbook }}<{{ .Annotations.runbook }}|runbook>{{ end }}
{{ end }}
- name: pagerduty-search
pagerduty_configs:
- routing_key: '{{ env "PD_ROUTING_KEY" }}'
severity: critical
description: '{{ .CommonAnnotations.summary }}'
inhibit_rules:
- source_matchers: [severity = page]
target_matchers: [severity = ticket]
equal: ['team', 'route']
The inhibit_rules block is what stops alert storms. When HighErrorRate pages, the lower-priority ticket alerts for the same route are suppressed — on-call gets one signal, not a feed. See the Alertmanager configuration docs
for the full matcher syntax.
The Dashboard
With recording rules in place, the Grafana 11 dashboard is thin — it just renders the named series. Here’s the JSON model for a four-panel RED dashboard. Import it via Dashboards → New → Import.
{
"title": "Search API — RED",
"schemaVersion": 39,
"refresh": "15s",
"time": { "from": "now-3h", "to": "now" },
"panels": [
{
"title": "Request rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{ "expr": "job:http_requests:rate5m", "legendFormat": "{{route}}" }
]
},
{
"title": "Error ratio",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 0.02 }
]
}
}
},
"targets": [
{ "expr": "job:http_error_ratio:rate5m", "legendFormat": "{{route}}" }
]
},
{
"title": "p99 latency",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
"fieldConfig": { "defaults": { "unit": "s" } },
"targets": [
{ "expr": "job:http_request_duration:p99_5m", "legendFormat": "{{route}}" }
]
},
{
"title": "Firing alerts",
"type": "alertlist",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 },
"options": { "alertName": "", "stateFilter": { "firing": true } }
}
]
}
Each panel reads a recorded series, so the dashboard loads instantly even over a long time range and shows exactly the same number the alert evaluated. Set the error-ratio panel threshold to the same 0.02 the alert uses — when the line goes red, on-call already knows a page is imminent.
Common Pitfalls
- Alerting on causes. High CPU isn’t an incident; slow responses are. Page on symptoms, keep cause metrics on the dashboard for diagnosis only.
- No
forwindow. An alert without a sustained window fires on a single scrape blip. Five to ten minutes filters noise without hiding real outages. - Duplicated PromQL. Dashboard and alert computing the same quantile separately will drift. Always go through a recording rule.
- No
send_resolved. Without it, on-call never learns the alert cleared and keeps investigating a dead incident. - Forgetting
NoTraffic. A silent service looks healthy on every error-rate panel. Alert on zero traffic explicitly. - One giant route in Alertmanager. Without
group_byandinhibit_rules, one outage becomes a notification flood.
Troubleshooting
Symptom: Alert shows as firing in Prometheus but no notification arrives.
Cause: Prometheus can’t reach Alertmanager, or Alertmanager has no matching route.
Fix: Check Status → Runtime & Build in Prometheus for the Alertmanager endpoint, then amtool config routes test severity=page to confirm the alert matches a receiver.
Symptom: Error-ratio panel shows NaN or no data.
Cause: Division by zero on a route with no traffic.
Fix: Wrap the denominator in clamp_min(..., 1e-9) in the recording rule.
Symptom: The same incident pages five times in ten minutes.
Cause: group_interval too short or group_by missing the right labels.
Fix: Group by alertname and team, raise group_interval to 5m, and add inhibit_rules so high-severity alerts suppress related low-severity ones.
Symptom: Dashboard p99 and alert disagree.
Cause: Panel and alert compute the quantile independently with different windows.
Fix: Point both at the job:http_request_duration:p99_5m recording rule.
Wrapping Up
Trustworthy Prometheus Grafana alerting dashboards come from discipline, not panel count: instrument RED metrics, precompute with recording rules, alert on sustained symptoms, and route through Alertmanager so each incident pages one team once. Build that and on-call starts trusting the alerts again — which is the only metric that really matters. From here, codify these dashboards so they’re reproducible and version-controlled.