Alerting on Sensor Anomalies in IIoT

Iiot article cover illustration on a gradient background

August 24, 2022 · 5 min read · by Muhammad Amal programming

TL;DR — Three alert classes: threshold (value > X), rate-of-change (delta > X/min), missing data (no reading in N min). Grafana 9 has built-in alerting; Alertmanager routes to Slack/PagerDuty/SMS. Severity matters: factory-fire vs degraded-quality go to different channels.

After Grafana dashboards , the operational concern: catching real problems automatically.

Three alert types

Threshold: value exceeds a static limit. “Press 42 temperature > 90°C.” Simplest. Most common.

Rate of change: value changing too fast. “Pressure dropped > 0.2 bar in 30 seconds.” Catches things thresholds miss (fast leak vs slow drift).

Missing data: no reading for N minutes. “Press 42 hasn’t reported in 5 minutes.” Catches device or network failures.

All three matter; threshold alone is not enough.

Grafana alerting basics

Grafana 9 has “unified alerting” — alerts attached to dashboards or standalone, evaluated server-side, routed via contact points.

Steps:

Define an alert rule (a query that produces a value)
Define conditions (when to fire)
Define a contact point (email, Slack, PagerDuty)
Define a notification policy (which alerts route where)

Threshold alert example

In Grafana, Alert rules → New alert rule:

Query:

SELECT
  $__timeGroupAlias(ts, '1m'),
  device_id,
  max(value) AS "max_temp"
FROM sensor_readings
WHERE metric = 'temperature' AND $__timeFilter(ts)
GROUP BY 1, device_id

Conditions:

max_temp IS ABOVE 85 → Critical
max_temp IS ABOVE 75 → Warning

For: 2 minutes (alert only if condition persists)

Labels:

severity: critical (for the higher threshold)
team: maintenance

Routes to PagerDuty for critical, Slack for warning.

Rate-of-change alert

SELECT
  $__timeGroupAlias(ts, '30s'),
  device_id,
  metric,
  max(value) - min(value) AS "delta"
FROM sensor_readings
WHERE metric = 'pressure' AND $__timeFilter(ts)
GROUP BY 1, device_id, metric

Conditions: delta IS ABOVE 0.5 → alarm

Per-window delta. Fast drops trigger immediately.

For trend changes over longer windows, query with LAG:

WITH pressure_5min AS (
  SELECT ts, value FROM sensor_readings
  WHERE device_id = '$device' AND metric = 'pressure'
    AND ts > now() - interval '15 minutes'
)
SELECT
  max(value) - min(value) AS swing
FROM pressure_5min;

Missing-data alert

SELECT
  device_id,
  EXTRACT(EPOCH FROM (now() - max(ts))) AS seconds_since_last
FROM sensor_readings
WHERE ts > now() - interval '10 minutes'
GROUP BY device_id
HAVING max(ts) < now() - interval '2 minutes'

Returns rows for devices whose last reading is 2+ minutes old. Each row triggers an alert.

Tune the threshold to your reporting rate. 2 minutes for devices reporting at 1 Hz = ~120 missed readings, definitely a problem.

Contact points

Grafana 9 supports many: email, Slack, PagerDuty, Webhook, Telegram, Microsoft Teams, OpsGenie, etc.

For our setup:

Slack #alerts-iot-warning — warning level
Slack #alerts-iot-critical + PagerDuty — critical level
Email backup — for ops manager when on PTO

Critical alerts page on-call via PagerDuty. Warnings stay in Slack — visible but don’t wake people.

Routing policies

Notification policies route by labels. Example:

severity=critical AND team=maintenance → PagerDuty + Slack critical
severity=warning → Slack warning
team=quality → Slack quality channel
Default → Slack default

Labels are set on alert rules. The routing matches first-fit; specific policies before general ones.

Alertmanager (alternative)

For more sophisticated routing, deduplication, grouping, silencing — Alertmanager (Prometheus stack) is better than Grafana’s built-in.

Grafana exports alerts to Alertmanager; Alertmanager handles the rest:

Grouping: bundle 50 alerts of the same kind into one notification
Silencing: shut up alerts during planned maintenance
Inhibition: if “site down” fires, suppress per-device alerts

For factories with 100+ devices, Alertmanager grouping is essential. Without it, every offline device pages on a network outage.

Avoiding alert fatigue

The #1 mistake: too many alerts. Within a week, the team mutes everything.

Rules I follow:

Tier 1: paging alerts. Wake someone up. Threshold: actual safety issue or production stoppage. Examples: pressure spike on hydraulic press, line stopped mid-shift, fire suppression triggered. ~5-10 alerts per year per critical line if tuned well.

Tier 2: warning alerts. Slack, no pager. Threshold: degraded but not stopped. Examples: temperature high but not critical, pressure variance high, one machine offline (out of N). ~10-20 per week.

Tier 3: info events. Slack low-priority or dashboard only. Threshold: notable but not actionable. Examples: shift change, weekly status.

The “for: 5 minutes” gate before firing is critical. Most blips don’t last 5 minutes. Eliminates flapping.

Acknowledging and resolving

Two patterns:

Auto-resolve when condition clears. “Pressure dropped back to normal” → alert resolves automatically. Default Grafana behavior.
Manual acknowledgment. Some alerts (e.g., “safety incident”) need human ack even after the condition clears. Use external alert manager (PagerDuty, Opsgenie) with ack flows.

For most IIoT, auto-resolve is fine. The state of the system tells you everything.

Common Pitfalls

Alert on every threshold cross. Flapping. Use for: 2 minutes gate.

Per-device alerts on shared causes. Network outage = 50 device alerts. Use Alertmanager inhibition.

Critical thresholds set too low. “Page on temperature > 70” — fires hourly. Set thresholds based on actual danger, not nominal operating range.

No escalation path. Primary on-call doesn’t ack within 15 minutes — what happens? Define escalation in PagerDuty.

Alerts that don’t route to anyone. “We’ll set up routing later.” Means alerts vanish.

Silencing without expiration. Maintenance window ends; silence is forever. Always set expire time.

Wrapping Up

Three alert types, three tiers of severity, routing by labels, escalation via PagerDuty. Alert fatigue is the enemy; tune thresholds aggressively. Friday: securing MQTT .

Three alert types

Grafana alerting basics

Threshold alert example

Rate-of-change alert

Missing-data alert

Contact points

Routing policies

Alertmanager (alternative)

Avoiding alert fatigue

Acknowledging and resolving

Common Pitfalls

Wrapping Up

Related posts

Real-Time Dashboards for IIoT with Grafana 9

Building Real-Time Alerting Dashboards with Prometheus and Grafana

September Retro, One Stack to Watch Them All

Error Budgets and Burn Rates

Alerting with Prometheus Alertmanager

Grafana Dashboards That Actually Help

Building an Observability Stack in 2022

August Retro, IIoT Production Lessons

Let’s Start a Project