background-shape
Alerting with Prometheus Alertmanager
September 14, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Alertmanager receives alerts from Prometheus, groups by labels, dedupes, routes via tree of policies to receivers (Slack, PagerDuty). Inhibition suppresses noise; silencing handles maintenance. Tune for “alert when something is actually wrong”; never for “alert on every blip.”

After dashboards, the alerting layer. Dashboards explain; Alertmanager pages. Prometheus generates alert events; Alertmanager makes them actionable.

Architecture

[Prometheus] alert rules → fire events → [Alertmanager] → route → [Slack/PagerDuty/email]

Alertmanager handles:

  • Grouping (50 alerts about the same service → 1 notification)
  • Deduplication (same alert from clustered Prometheus → 1)
  • Inhibition (one alert suppresses others)
  • Silencing (planned maintenance windows)
  • Routing (which alert goes where)

Config: alertmanager.yml

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-slack'

  routes:
    - matchers: ['severity="critical"']
      receiver: 'pagerduty-critical'
      continue: true
    - matchers: ['team="api"']
      receiver: 'slack-api-team'

receivers:
  - name: 'default-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_KEY'

  - name: 'slack-api-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#api-team'

inhibit_rules:
  - source_matchers: ['alertname="ServiceDown"']
    target_matchers: ['severity="warning"']
    equal: ['service']

The tree: routes match alert labels; longest match wins (or continue: true cascades to multiple receivers).

Grouping

group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
  • group_by: combine alerts sharing these labels into one notification
  • group_wait: wait 30s after first alert in a group fires before sending (lets related alerts join)
  • group_interval: wait 5m before sending updated notification for the group

A network outage causing 50 services to alert: with grouping by alertname, you get one Slack message listing all 50. Without, you get 50.

Inhibition

When one alert means others are noise:

inhibit_rules:
  - source_matchers: ['alertname="ServiceDown"']
    target_matchers: ['severity="warning"']
    equal: ['service']

Reads: “If ServiceDown is firing for service X, suppress any severity=warning alerts about the same service.”

If your API is down, you don’t need 14 warning alerts about its slow latency. The down alert subsumes them.

Other useful inhibitions:

  • “Whole cluster down” inhibits “individual node alerts”
  • “Database master down” inhibits “replica lag warnings”
  • “Network partition” inhibits “service unreachable” alerts during planned migrations

Silencing

Planned maintenance? Silence the relevant alerts.

UI: Alertmanager → Silences → New. Matchers + start/end time + creator name.

API: POST to /api/v2/silences.

CLI:

amtool silence add alertname=HighErrorRate service=api --duration=1h --comment="Deploying v1.2"

Always set an end time. Forever-silences are how alerts vanish from awareness.

Receivers — getting to Slack and PagerDuty

Slack:

- name: 'slack-alerts'
  slack_configs:
    - api_url: 'https://hooks.slack.com/services/T.../B.../...'
      channel: '#alerts'
      title: '{{ .GroupLabels.alertname }}'
      text: |
        {{ range .Alerts }}
        *Alert:* {{ .Annotations.summary }}
        *Severity:* {{ .Labels.severity }}
        *Service:* {{ .Labels.service }}
        {{ end }}

The {{ ... }} is Go template syntax over the alert payload. Customize message format.

PagerDuty:

- name: 'pagerduty-critical'
  pagerduty_configs:
    - service_key: 'YOUR_INTEGRATION_KEY'
      severity: '{{ .CommonLabels.severity }}'
      description: '{{ .CommonAnnotations.summary }}'

PD then handles escalation, on-call rotations, ack/resolve.

Webhook (custom):

- name: 'custom-webhook'
  webhook_configs:
    - url: 'https://internal-alerter.example.com/hook'
      send_resolved: true

For pushing to your own systems.

Alert rules — best practices

Recap from Prometheus 101:

groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api"}[5m]))
          > 0.05
        for: 2m
        labels:
          severity: warning
          team: api
        annotations:
          summary: "API error rate > 5%"
          description: "{{ $value | humanizePercentage }} 5xx for 2 min"
          runbook: "https://wiki.example.com/runbooks/api-error-rate"

Three labels every alert should have:

  • severity — critical / warning / info
  • team — for routing
  • service — for grouping

Three annotations:

  • summary — short, fits in a Slack title
  • description — longer, can include $value
  • runbook — link to actions to take

Severity discipline

  • Critical: page someone. Stop the world. Real user impact.
  • Warning: notify in Slack. Look at when convenient. Trending bad.
  • Info: dashboard or low-priority channel. Awareness only.

If everything is critical, nothing is. Calibrate.

Common Pitfalls

Routing all alerts to one channel. Becomes noise; everyone mutes. Route by team / severity.

No grouping. Outage = 50 individual messages. Group.

No inhibition. Cascading failures = cascading notifications. Inhibit hierarchy.

Silences without end times. Alerts vanish forever.

Critical severity for non-critical. Page fatigue.

No runbook link. On-call gets paged, has no idea what to do. Always link the runbook.

Repeat interval too short. Same alert pings every 30s. People mute. 4-hour default is reasonable.

Wrapping Up

Alertmanager: routing tree + grouping + inhibition + silencing + tuned severity = useful alerts. Friday: Loki for log aggregation — the logs pillar.