Alerting on Sensor Anomalies in IIoT
TL;DR — Three alert classes: threshold (value > X), rate-of-change (delta > X/min), missing data (no reading in N min). Grafana 9 has built-in alerting; Alertmanager routes to Slack/PagerDuty/SMS. Severity matters: factory-fire vs degraded-quality go to different channels.
After Grafana dashboards, the operational concern: catching real problems automatically.
Three alert types
Threshold: value exceeds a static limit. “Press 42 temperature > 90°C.” Simplest. Most common.
Rate of change: value changing too fast. “Pressure dropped > 0.2 bar in 30 seconds.” Catches things thresholds miss (fast leak vs slow drift).
Missing data: no reading for N minutes. “Press 42 hasn’t reported in 5 minutes.” Catches device or network failures.
All three matter; threshold alone is not enough.
Grafana alerting basics
Grafana 9 has “unified alerting” — alerts attached to dashboards or standalone, evaluated server-side, routed via contact points.
Steps:
- Define an alert rule (a query that produces a value)
- Define conditions (when to fire)
- Define a contact point (email, Slack, PagerDuty)
- Define a notification policy (which alerts route where)
Threshold alert example
In Grafana, Alert rules → New alert rule:
Query:
SELECT
$__timeGroupAlias(ts, '1m'),
device_id,
max(value) AS "max_temp"
FROM sensor_readings
WHERE metric = 'temperature' AND $__timeFilter(ts)
GROUP BY 1, device_id
Conditions:
max_tempIS ABOVE 85 → Criticalmax_tempIS ABOVE 75 → Warning
For: 2 minutes (alert only if condition persists)
Labels:
severity: critical(for the higher threshold)team: maintenance
Routes to PagerDuty for critical, Slack for warning.
Rate-of-change alert
SELECT
$__timeGroupAlias(ts, '30s'),
device_id,
metric,
max(value) - min(value) AS "delta"
FROM sensor_readings
WHERE metric = 'pressure' AND $__timeFilter(ts)
GROUP BY 1, device_id, metric
Conditions: delta IS ABOVE 0.5 → alarm
Per-window delta. Fast drops trigger immediately.
For trend changes over longer windows, query with LAG:
WITH pressure_5min AS (
SELECT ts, value FROM sensor_readings
WHERE device_id = '$device' AND metric = 'pressure'
AND ts > now() - interval '15 minutes'
)
SELECT
max(value) - min(value) AS swing
FROM pressure_5min;
Missing-data alert
SELECT
device_id,
EXTRACT(EPOCH FROM (now() - max(ts))) AS seconds_since_last
FROM sensor_readings
WHERE ts > now() - interval '10 minutes'
GROUP BY device_id
HAVING max(ts) < now() - interval '2 minutes'
Returns rows for devices whose last reading is 2+ minutes old. Each row triggers an alert.
Tune the threshold to your reporting rate. 2 minutes for devices reporting at 1 Hz = ~120 missed readings, definitely a problem.
Contact points
Grafana 9 supports many: email, Slack, PagerDuty, Webhook, Telegram, Microsoft Teams, OpsGenie, etc.
For our setup:
- Slack #alerts-iot-warning — warning level
- Slack #alerts-iot-critical + PagerDuty — critical level
- Email backup — for ops manager when on PTO
Critical alerts page on-call via PagerDuty. Warnings stay in Slack — visible but don’t wake people.
Routing policies
Notification policies route by labels. Example:
severity=critical AND team=maintenance→ PagerDuty + Slack criticalseverity=warning→ Slack warningteam=quality→ Slack quality channel- Default → Slack default
Labels are set on alert rules. The routing matches first-fit; specific policies before general ones.
Alertmanager (alternative)
For more sophisticated routing, deduplication, grouping, silencing — Alertmanager (Prometheus stack) is better than Grafana’s built-in.
Grafana exports alerts to Alertmanager; Alertmanager handles the rest:
- Grouping: bundle 50 alerts of the same kind into one notification
- Silencing: shut up alerts during planned maintenance
- Inhibition: if “site down” fires, suppress per-device alerts
For factories with 100+ devices, Alertmanager grouping is essential. Without it, every offline device pages on a network outage.
Avoiding alert fatigue
The #1 mistake: too many alerts. Within a week, the team mutes everything.
Rules I follow:
Tier 1: paging alerts. Wake someone up. Threshold: actual safety issue or production stoppage. Examples: pressure spike on hydraulic press, line stopped mid-shift, fire suppression triggered. ~5-10 alerts per year per critical line if tuned well.
Tier 2: warning alerts. Slack, no pager. Threshold: degraded but not stopped. Examples: temperature high but not critical, pressure variance high, one machine offline (out of N). ~10-20 per week.
Tier 3: info events. Slack low-priority or dashboard only. Threshold: notable but not actionable. Examples: shift change, weekly status.
The “for: 5 minutes” gate before firing is critical. Most blips don’t last 5 minutes. Eliminates flapping.
Acknowledging and resolving
Two patterns:
- Auto-resolve when condition clears. “Pressure dropped back to normal” → alert resolves automatically. Default Grafana behavior.
- Manual acknowledgment. Some alerts (e.g., “safety incident”) need human ack even after the condition clears. Use external alert manager (PagerDuty, Opsgenie) with ack flows.
For most IIoT, auto-resolve is fine. The state of the system tells you everything.
Common Pitfalls
Alert on every threshold cross. Flapping. Use for: 2 minutes gate.
Per-device alerts on shared causes. Network outage = 50 device alerts. Use Alertmanager inhibition.
Critical thresholds set too low. “Page on temperature > 70” — fires hourly. Set thresholds based on actual danger, not nominal operating range.
No escalation path. Primary on-call doesn’t ack within 15 minutes — what happens? Define escalation in PagerDuty.
Alerts that don’t route to anyone. “We’ll set up routing later.” Means alerts vanish.
Silencing without expiration. Maintenance window ends; silence is forever. Always set expire time.
Wrapping Up
Three alert types, three tiers of severity, routing by labels, escalation via PagerDuty. Alert fatigue is the enemy; tune thresholds aggressively. Friday: securing MQTT.