background-shape
Grafana Dashboards That Actually Help
September 12, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Use RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for resources. Three-tier dashboards: overview, service, machine. Six panels per screen max. Color thresholds match alert thresholds. Less is more.

After Node instrumentation, the dashboards built on top of all that data. This post is opinions plus heuristics.

RED method for services

For any user-facing service, three panels:

R - Rate. Requests per second.

sum(rate(http_requests_total{service="api"}[5m]))

E - Errors. Rate of 5xx (and 4xx if you care).

sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))

Error rate as percentage:

sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

D - Duration. P50, P95, P99 latency.

histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="api"}[5m])))

These three answer “is the service healthy?” In 10 seconds.

USE method for resources

For machines and infrastructure:

U - Utilization. Percentage of resource in use.

1 - rate(node_cpu_seconds_total{mode="idle"}[5m])    # CPU
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)  # Memory

S - Saturation. Where resource is overcommitted.

node_load5 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})  # Load avg

E - Errors. Hardware errors.

rate(node_disk_io_now[5m])
rate(node_network_transmit_errs_total[5m])

USE for the box. RED for the service running on the box.

Three-tier dashboards

Three dashboards per service, increasingly specific:

Overview — fleet-wide. One row per service: rate, error rate, p95. Color-coded. “Which service is unhealthy right now?”

Service — per service. RED + saturation panels. Templated by environment + instance. “What’s happening to API?”

Machine — per host. USE + per-process. “What’s the host doing?”

Links between levels: clicking a service in overview goes to that service’s dashboard.

The 6-panel rule

A dashboard with 20 panels is harder to read than 6. The most-actionable signals deserve the most visual real estate.

Pick the 6 most-important panels. The rest go in linked dashboards.

For an API service overview:

  1. Request rate (over time, big)
  2. Error rate (% with threshold colors)
  3. P95 latency (over time)
  4. Instances healthy / total
  5. Top 5 slowest endpoints
  6. Recent error log links

Six panels. One screen. Full picture.

Thresholds and colors

Color = quick decision. Tune thresholds to alert thresholds:

  • Green: 0-95% normal
  • Yellow: 95-99% normal (warning)
  • Red: >99% normal or alert firing

A panel showing “OK” green when an alert just fired is worse than no panel.

Configure per stat:

Thresholds (error rate):
0    Green
0.01 Yellow (1% errors = warning)
0.05 Red    (5% errors = critical)

Same threshold drives alert + dashboard color. Single source of truth.

Time range and refresh

Default time range matters. Most dashboards: “last 1 hour” makes sense.

For executive dashboards: “last 24h” or “last 7d”.

For incident dashboards: “last 15m” so the eye focuses on now.

Auto-refresh: 30s typical. For incident response: 5s. Don’t auto-refresh executive dashboards every 5s; it’s distracting.

Variables for reuse

Templated dashboards. Variables for:

  • $service — service name
  • $env — environment (prod, staging)
  • $instance — specific replica
  • $interval — Grafana auto-sets

One dashboard serves all services via variable selection. Don’t create 12 near-identical dashboards.

sum(rate(http_requests_total{service="$service",env="$env"}[$__interval]))

Anti-patterns

Speedometers / radial gauges everywhere. Pretty; bad for comparison. Use line charts.

Y-axis not starting at 0. Subtle; misleading. Click to “start at 0” unless explicitly comparing.

Logarithmic axes by default. Useful sometimes; confusing in dashboards meant for non-experts.

Heat maps without explanation. Powerful but require legends and context.

Dashboards with no time correlation. All panels should share the dashboard’s time range. Avoid hard-coded time ranges on individual panels.

“More data = better.” A 30-panel dashboard takes minutes to load. Each loaded panel is a Prometheus query. Cull aggressively.

Annotations

Mark events on time series:

  • Deploys (from CI webhook → grafana_annotations table)
  • Incidents (manual ack from on-call)
  • Maintenance windows
  • Feature flags toggled

Pattern: cross-reference deploys to performance changes. “Latency rose after the 14:30 deploy” → you can see it on the dashboard.

A real-world overview dashboard

┌────────────────────────────────────────────────────────────────────┐
│ Fleet Status — selected services                                   │
├─────────────────┬─────────────────┬─────────────────┬─────────────┤
│ api             │ bff             │ worker          │ scheduler   │
│ 1240 req/s      │ 380 req/s       │ -               │ -           │
│ Err: 0.02%      │ Err: 0.0%       │ Err: 0.1%       │ Err: 0.0%   │
│ P95: 45ms       │ P95: 120ms      │ -               │ -           │
│ ●●● healthy     │ ●●● healthy     │ ●●○ degraded    │ ●●● healthy │
├────────────────────────────────────────────────────────────────────┤
│ [Time series: requests/sec across all services, last 1h]           │
│                                                                    │
├────────────────────────────────────────────────────────────────────┤
│ [Time series: error rate %, last 1h, threshold lines]              │
│                                                                    │
├────────────────────────────────────────────────────────────────────┤
│ [Table: recent alerts firing]                                      │
└────────────────────────────────────────────────────────────────────┘

Four stat panels, two time series, one table. Click any stat panel → that service’s detail dashboard.

Common Pitfalls

Building dashboards before alerts. Dashboards aren’t monitoring. Alerts page; dashboards explain. Build alerts first.

Templates without defaults. Every dashboard view starts with “select environment.” Annoying. Set sensible defaults.

Forgetting to share dashboards with the team. Grafana lets dashboards be folder-scoped. Make sure your team can see them.

No version control. Dashboards drift; revisions lost. Export to JSON, commit to git.

Color theory mistakes. Red-green colorblindness is common. Use blue for “good” and orange for “bad” as a colorblind-friendly default.

One huge dashboard with everything. Slow to load; hard to use. Three focused dashboards beat one giant one.

Wrapping Up

RED + USE + three-tier + 6-panel rule + threshold colors. Wednesday: Alertmanager — turning metric signals into pages.