Alerting with Prometheus Alertmanager

September 14, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Alertmanager receives alerts from Prometheus, groups by labels, dedupes, routes via tree of policies to receivers (Slack, PagerDuty). Inhibition suppresses noise; silencing handles maintenance. Tune for “alert when something is actually wrong”; never for “alert on every blip.”

After dashboards, the alerting layer. Dashboards explain; Alertmanager pages. Prometheus generates alert events; Alertmanager makes them actionable.

Architecture

[Prometheus] alert rules → fire events → [Alertmanager] → route → [Slack/PagerDuty/email]

Alertmanager handles:

Grouping (50 alerts about the same service → 1 notification)
Deduplication (same alert from clustered Prometheus → 1)
Inhibition (one alert suppresses others)
Silencing (planned maintenance windows)
Routing (which alert goes where)

Config: alertmanager.yml

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'

  routes:
    - matchers: ['severity="critical"']
      receiver: 'pagerduty-critical'
      continue: true
    - matchers: ['team="api"']
      receiver: 'slack-api-team'

receivers:
  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_KEY'

  - name: 'slack-api-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#api-team'

inhibit_rules:
  - source_matchers: ['alertname="ServiceDown"']
    target_matchers: ['severity="warning"']
    equal: ['service']

The tree: routes match alert labels; longest match wins (or continue: true cascades to multiple receivers).

Grouping

group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m

group_by: combine alerts sharing these labels into one notification
group_wait: wait 30s after first alert in a group fires before sending (lets related alerts join)
group_interval: wait 5m before sending updated notification for the group

A network outage causing 50 services to alert: with grouping by alertname, you get one Slack message listing all 50. Without, you get 50.

Inhibition

When one alert means others are noise:

inhibit_rules:
  - source_matchers: ['alertname="ServiceDown"']
    target_matchers: ['severity="warning"']
    equal: ['service']

Reads: “If ServiceDown is firing for service X, suppress any severity=warning alerts about the same service.”

If your API is down, you don’t need 14 warning alerts about its slow latency. The down alert subsumes them.

Other useful inhibitions:

“Whole cluster down” inhibits “individual node alerts”
“Database master down” inhibits “replica lag warnings”
“Network partition” inhibits “service unreachable” alerts during planned migrations

Silencing

Planned maintenance? Silence the relevant alerts.

UI: Alertmanager → Silences → New. Matchers + start/end time + creator name.

API: POST to /api/v2/silences.

CLI:

amtool silence add alertname=HighErrorRate service=api --duration=1h --comment="Deploying v1.2"

Always set an end time. Forever-silences are how alerts vanish from awareness.

Receivers — getting to Slack and PagerDuty

Slack:

- name: 'slack-alerts'
  slack_configs:
    - api_url: 'https://hooks.slack.com/services/T.../B.../...'
      channel: '#alerts'
      title: '{{ .GroupLabels.alertname }}'
      text: |
        {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }}
        *Severity:* {{ .Labels.severity }}
        *Service:* {{ .Labels.service }}
        {{ end }}

The {{ ... }} is Go template syntax over the alert payload. Customize message format.

PagerDuty:

- name: 'pagerduty-critical'
  pagerduty_configs:
    - service_key: 'YOUR_INTEGRATION_KEY'
      severity: '{{ .CommonLabels.severity }}'
      description: '{{ .CommonAnnotations.summary }}'

PD then handles escalation, on-call rotations, ack/resolve.

Webhook (custom):

- name: 'custom-webhook'
  webhook_configs:
    - url: 'https://internal-alerter.example.com/hook'
      send_resolved: true

For pushing to your own systems.

Alert rules — best practices

Recap from Prometheus 101:

groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api"}[5m]))
          > 0.05
        for: 2m
        labels:
          severity: warning
          team: api
        annotations:
          summary: "API error rate > 5%"
          description: "{{ $value | humanizePercentage }} 5xx for 2 min"
          runbook: "https://wiki.example.com/runbooks/api-error-rate"

Three labels every alert should have:

severity — critical / warning / info
team — for routing
service — for grouping

Three annotations:

summary — short, fits in a Slack title
description — longer, can include $value
runbook — link to actions to take

Severity discipline

Critical: page someone. Stop the world. Real user impact.
Warning: notify in Slack. Look at when convenient. Trending bad.
Info: dashboard or low-priority channel. Awareness only.

If everything is critical, nothing is. Calibrate.

Common Pitfalls

Routing all alerts to one channel. Becomes noise; everyone mutes. Route by team / severity.

No grouping. Outage = 50 individual messages. Group.

No inhibition. Cascading failures = cascading notifications. Inhibit hierarchy.

Silences without end times. Alerts vanish forever.

Critical severity for non-critical. Page fatigue.

No runbook link. On-call gets paged, has no idea what to do. Always link the runbook.

Repeat interval too short. Same alert pings every 30s. People mute. 4-hour default is reasonable.

Wrapping Up

Alertmanager: routing tree + grouping + inhibition + silencing + tuned severity = useful alerts. Friday: Loki for log aggregation — the logs pillar.

Architecture

Config: alertmanager.yml

Grouping

Inhibition

Silencing

Receivers — getting to Slack and PagerDuty

Alert rules — best practices

Severity discipline

Common Pitfalls

Wrapping Up

Related posts

September Retro, One Stack to Watch Them All

Prometheus Cardinality and Cost Control

Instrumenting Node.js Services for Prometheus

Instrumenting Go Services for Prometheus

Prometheus 101, Metrics, Scraping, and PromQL

Building an Observability Stack in 2022

Monitoring n8n in Production

Rust Service Observability in 2024, Metrics, Logs, and Traces That Help

Let’s Start a Project