background-shape
Alerting on Sensor Anomalies in IIoT
August 24, 2022 · 5 min read · by Muhammad Amal programming

TL;DR — Three alert classes: threshold (value > X), rate-of-change (delta > X/min), missing data (no reading in N min). Grafana 9 has built-in alerting; Alertmanager routes to Slack/PagerDuty/SMS. Severity matters: factory-fire vs degraded-quality go to different channels.

After Grafana dashboards, the operational concern: catching real problems automatically.

Three alert types

Threshold: value exceeds a static limit. “Press 42 temperature > 90°C.” Simplest. Most common.

Rate of change: value changing too fast. “Pressure dropped > 0.2 bar in 30 seconds.” Catches things thresholds miss (fast leak vs slow drift).

Missing data: no reading for N minutes. “Press 42 hasn’t reported in 5 minutes.” Catches device or network failures.

All three matter; threshold alone is not enough.

Grafana alerting basics

Grafana 9 has “unified alerting” — alerts attached to dashboards or standalone, evaluated server-side, routed via contact points.

Steps:

  1. Define an alert rule (a query that produces a value)
  2. Define conditions (when to fire)
  3. Define a contact point (email, Slack, PagerDuty)
  4. Define a notification policy (which alerts route where)

Threshold alert example

In Grafana, Alert rules → New alert rule:

Query:

SELECT
  $__timeGroupAlias(ts, '1m'),
  device_id,
  max(value) AS "max_temp"
FROM sensor_readings
WHERE metric = 'temperature' AND $__timeFilter(ts)
GROUP BY 1, device_id

Conditions:

  • max_temp IS ABOVE 85 → Critical
  • max_temp IS ABOVE 75 → Warning

For: 2 minutes (alert only if condition persists)

Labels:

  • severity: critical (for the higher threshold)
  • team: maintenance

Routes to PagerDuty for critical, Slack for warning.

Rate-of-change alert

SELECT
  $__timeGroupAlias(ts, '30s'),
  device_id,
  metric,
  max(value) - min(value) AS "delta"
FROM sensor_readings
WHERE metric = 'pressure' AND $__timeFilter(ts)
GROUP BY 1, device_id, metric

Conditions: delta IS ABOVE 0.5 → alarm

Per-window delta. Fast drops trigger immediately.

For trend changes over longer windows, query with LAG:

WITH pressure_5min AS (
  SELECT ts, value FROM sensor_readings
  WHERE device_id = '$device' AND metric = 'pressure'
    AND ts > now() - interval '15 minutes'
)
SELECT
  max(value) - min(value) AS swing
FROM pressure_5min;

Missing-data alert

SELECT
  device_id,
  EXTRACT(EPOCH FROM (now() - max(ts))) AS seconds_since_last
FROM sensor_readings
WHERE ts > now() - interval '10 minutes'
GROUP BY device_id
HAVING max(ts) < now() - interval '2 minutes'

Returns rows for devices whose last reading is 2+ minutes old. Each row triggers an alert.

Tune the threshold to your reporting rate. 2 minutes for devices reporting at 1 Hz = ~120 missed readings, definitely a problem.

Contact points

Grafana 9 supports many: email, Slack, PagerDuty, Webhook, Telegram, Microsoft Teams, OpsGenie, etc.

For our setup:

  • Slack #alerts-iot-warning — warning level
  • Slack #alerts-iot-critical + PagerDuty — critical level
  • Email backup — for ops manager when on PTO

Critical alerts page on-call via PagerDuty. Warnings stay in Slack — visible but don’t wake people.

Routing policies

Notification policies route by labels. Example:

  • severity=critical AND team=maintenance → PagerDuty + Slack critical
  • severity=warning → Slack warning
  • team=quality → Slack quality channel
  • Default → Slack default

Labels are set on alert rules. The routing matches first-fit; specific policies before general ones.

Alertmanager (alternative)

For more sophisticated routing, deduplication, grouping, silencing — Alertmanager (Prometheus stack) is better than Grafana’s built-in.

Grafana exports alerts to Alertmanager; Alertmanager handles the rest:

  • Grouping: bundle 50 alerts of the same kind into one notification
  • Silencing: shut up alerts during planned maintenance
  • Inhibition: if “site down” fires, suppress per-device alerts

For factories with 100+ devices, Alertmanager grouping is essential. Without it, every offline device pages on a network outage.

Avoiding alert fatigue

The #1 mistake: too many alerts. Within a week, the team mutes everything.

Rules I follow:

Tier 1: paging alerts. Wake someone up. Threshold: actual safety issue or production stoppage. Examples: pressure spike on hydraulic press, line stopped mid-shift, fire suppression triggered. ~5-10 alerts per year per critical line if tuned well.

Tier 2: warning alerts. Slack, no pager. Threshold: degraded but not stopped. Examples: temperature high but not critical, pressure variance high, one machine offline (out of N). ~10-20 per week.

Tier 3: info events. Slack low-priority or dashboard only. Threshold: notable but not actionable. Examples: shift change, weekly status.

The “for: 5 minutes” gate before firing is critical. Most blips don’t last 5 minutes. Eliminates flapping.

Acknowledging and resolving

Two patterns:

  • Auto-resolve when condition clears. “Pressure dropped back to normal” → alert resolves automatically. Default Grafana behavior.
  • Manual acknowledgment. Some alerts (e.g., “safety incident”) need human ack even after the condition clears. Use external alert manager (PagerDuty, Opsgenie) with ack flows.

For most IIoT, auto-resolve is fine. The state of the system tells you everything.

Common Pitfalls

Alert on every threshold cross. Flapping. Use for: 2 minutes gate.

Per-device alerts on shared causes. Network outage = 50 device alerts. Use Alertmanager inhibition.

Critical thresholds set too low. “Page on temperature > 70” — fires hourly. Set thresholds based on actual danger, not nominal operating range.

No escalation path. Primary on-call doesn’t ack within 15 minutes — what happens? Define escalation in PagerDuty.

Alerts that don’t route to anyone. “We’ll set up routing later.” Means alerts vanish.

Silencing without expiration. Maintenance window ends; silence is forever. Always set expire time.

Wrapping Up

Three alert types, three tiers of severity, routing by labels, escalation via PagerDuty. Alert fatigue is the enemy; tune thresholds aggressively. Friday: securing MQTT.