Alerting with Prometheus Alertmanager
TL;DR — Alertmanager receives alerts from Prometheus, groups by labels, dedupes, routes via tree of policies to receivers (Slack, PagerDuty). Inhibition suppresses noise; silencing handles maintenance. Tune for “alert when something is actually wrong”; never for “alert on every blip.”
After dashboards, the alerting layer. Dashboards explain; Alertmanager pages. Prometheus generates alert events; Alertmanager makes them actionable.
Architecture
[Prometheus] alert rules → fire events → [Alertmanager] → route → [Slack/PagerDuty/email]
Alertmanager handles:
- Grouping (50 alerts about the same service → 1 notification)
- Deduplication (same alert from clustered Prometheus → 1)
- Inhibition (one alert suppresses others)
- Silencing (planned maintenance windows)
- Routing (which alert goes where)
Config: alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-slack'
routes:
- matchers: ['severity="critical"']
receiver: 'pagerduty-critical'
continue: true
- matchers: ['team="api"']
receiver: 'slack-api-team'
receivers:
- name: 'default-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_KEY'
- name: 'slack-api-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#api-team'
inhibit_rules:
- source_matchers: ['alertname="ServiceDown"']
target_matchers: ['severity="warning"']
equal: ['service']
The tree: routes match alert labels; longest match wins (or continue: true cascades to multiple receivers).
Grouping
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
group_by: combine alerts sharing these labels into one notificationgroup_wait: wait 30s after first alert in a group fires before sending (lets related alerts join)group_interval: wait 5m before sending updated notification for the group
A network outage causing 50 services to alert: with grouping by alertname, you get one Slack message listing all 50. Without, you get 50.
Inhibition
When one alert means others are noise:
inhibit_rules:
- source_matchers: ['alertname="ServiceDown"']
target_matchers: ['severity="warning"']
equal: ['service']
Reads: “If ServiceDown is firing for service X, suppress any severity=warning alerts about the same service.”
If your API is down, you don’t need 14 warning alerts about its slow latency. The down alert subsumes them.
Other useful inhibitions:
- “Whole cluster down” inhibits “individual node alerts”
- “Database master down” inhibits “replica lag warnings”
- “Network partition” inhibits “service unreachable” alerts during planned migrations
Silencing
Planned maintenance? Silence the relevant alerts.
UI: Alertmanager → Silences → New. Matchers + start/end time + creator name.
API: POST to /api/v2/silences.
CLI:
amtool silence add alertname=HighErrorRate service=api --duration=1h --comment="Deploying v1.2"
Always set an end time. Forever-silences are how alerts vanish from awareness.
Receivers — getting to Slack and PagerDuty
Slack:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T.../B.../...'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
{{ end }}
The {{ ... }} is Go template syntax over the alert payload. Customize message format.
PagerDuty:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_INTEGRATION_KEY'
severity: '{{ .CommonLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
PD then handles escalation, on-call rotations, ack/resolve.
Webhook (custom):
- name: 'custom-webhook'
webhook_configs:
- url: 'https://internal-alerter.example.com/hook'
send_resolved: true
For pushing to your own systems.
Alert rules — best practices
Recap from Prometheus 101:
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
> 0.05
for: 2m
labels:
severity: warning
team: api
annotations:
summary: "API error rate > 5%"
description: "{{ $value | humanizePercentage }} 5xx for 2 min"
runbook: "https://wiki.example.com/runbooks/api-error-rate"
Three labels every alert should have:
severity— critical / warning / infoteam— for routingservice— for grouping
Three annotations:
summary— short, fits in a Slack titledescription— longer, can include$valuerunbook— link to actions to take
Severity discipline
- Critical: page someone. Stop the world. Real user impact.
- Warning: notify in Slack. Look at when convenient. Trending bad.
- Info: dashboard or low-priority channel. Awareness only.
If everything is critical, nothing is. Calibrate.
Common Pitfalls
Routing all alerts to one channel. Becomes noise; everyone mutes. Route by team / severity.
No grouping. Outage = 50 individual messages. Group.
No inhibition. Cascading failures = cascading notifications. Inhibit hierarchy.
Silences without end times. Alerts vanish forever.
Critical severity for non-critical. Page fatigue.
No runbook link. On-call gets paged, has no idea what to do. Always link the runbook.
Repeat interval too short. Same alert pings every 30s. People mute. 4-hour default is reasonable.
Wrapping Up
Alertmanager: routing tree + grouping + inhibition + silencing + tuned severity = useful alerts. Friday: Loki for log aggregation — the logs pillar.