Real Time Anomaly Detection for IIoT Telemetry
TL;DR — Start with robust statistical baselines per tag, layer a small multivariate model where signals correlate, and spend most of your effort on alert hygiene rather than model accuracy.
Real-time anomaly detection sounds like a modeling problem and usually isn’t. The hard parts are deciding what counts as an anomaly for a specific asset, building thresholds that don’t drift into either silence or noise, and making the alerts actionable for the people on the floor. I’ve shipped a few of these systems and the pattern that holds up is layered. Statistics first, ML second, business rules around both.
This post is the recipe I default to in 2024. It runs at the edge for tight loops and in a stream processor for cross-signal patterns. The stack is unsentimental. Python at the edge with TFLite 2.16 if a model is needed, a stream processor like Flink, Faust or a custom Kafka consumer for everything else, and a tiny alert service on top.
The bias I’m bringing. Most “anomaly detection” projects fail at deployment, not training. The model is fine. The alerts are unactionable. We’ll address that explicitly.
The Three Layers
I think of anomaly detection as three layers, each catching a different kind of problem.
The first layer is per-tag statistical. For each metric, maintain a rolling robust estimate of its baseline and dispersion, and flag points outside an adaptive envelope. This catches sensor failures, frozen values, and gross out-of-range readings. It runs locally and doesn’t need ML.
The second layer is multivariate, catching patterns across correlated signals. A 1D-CNN autoencoder, an Isolation Forest, or a small Transformer scoring reconstruction error or anomaly score. This catches the subtle ones, where no single tag is unusual but the combination is.
The third layer is business rules. Curated, named, owned by a human, encoding domain knowledge that no model will learn from data alone. “If pump A is on and pump B is on and the tank level isn’t dropping, alarm.”
Most of the value comes from layers one and three. Layer two is the headline-grabber and the smallest contributor in production.
Layer One, Robust Statistics
For per-tag baselines, the median and the median absolute deviation (MAD) are vastly more useful than mean and standard deviation. They survive sensor spikes without inflating the threshold. The standard “modified z-score” using MAD is the workhorse:
import numpy as np
from collections import deque
class RobustEnvelope:
def __init__(self, window=600, k=4.0):
self.buf = deque(maxlen=window)
self.k = k
def update(self, x: float) -> bool:
# Return True if x is flagged as anomalous given the current window.
if len(self.buf) < self.buf.maxlen // 4:
self.buf.append(x)
return False
med = np.median(self.buf)
mad = np.median(np.abs(np.array(self.buf) - med)) + 1e-9
z = 0.6745 * (x - med) / mad
flagged = abs(z) > self.k
if not flagged:
self.buf.append(x)
return flagged
Note that we don’t update the window when a point is flagged. This prevents anomalies from contaminating the baseline. The 0.6745 constant comes from MAD’s relationship to sigma under normality and is standard.
This 30-line class catches most of the day-to-day “is the sensor okay” alerts you actually want. It runs at the edge on the gateway, costs almost nothing, and you don’t need a model.
Layer Two, Multivariate Models
When per-tag statistics aren’t enough, train a small model on healthy data and score reconstruction error or anomaly score in real time. For sensor windows in the 30 to 120 sample range, a 1D-CNN autoencoder with maybe 30k parameters trained on a few weeks of healthy data is a solid baseline.
The deployment path is what matters here. Train offline on the historian, validate against a labelled holdout, convert to TFLite int8, push to the gateway. Score each window and publish the score to Kafka. A downstream consumer compares against an adaptive threshold per asset and decides whether to alert.
import numpy as np, tflite_runtime.interpreter as tflite
interp = tflite.Interpreter(model_path="ae_line3.tflite", num_threads=2)
interp.allocate_tensors()
inp_idx = interp.get_input_details()[0]["index"]
out_idx = interp.get_output_details()[0]["index"]
def score(window: np.ndarray) -> float:
interp.set_tensor(inp_idx, window[None, ...].astype(np.float32))
interp.invoke()
recon = interp.get_tensor(out_idx)[0]
return float(np.mean((recon - window) ** 2))
A common mistake here is using a global threshold. Use a per-asset rolling quantile (say the 99.5th percentile of recent scores) as the threshold. Assets behave differently and a single global cutoff will either spam or stay silent.
Layer Three, Business Rules
This is the layer engineers underrate and operators love. Encode the rules the maintenance team would tell you over a coffee, the ones that aren’t in any historian. Use a simple DSL or just declarative YAML:
- name: "tank_3_not_filling"
when:
all:
- "pump_a.state == 'ON'"
- "pump_b.state == 'ON'"
- "tank_3.level.delta_5m > -0.1"
for: "2m"
severity: "warning"
owner: "process-team"
The rule engine sits next to the stream processor, evaluates on each tick, and emits an alert with the rule name. The alert is now self-explanatory, owned, and easy to silence if it’s wrong.
Alert Hygiene
Alerts that don’t lead to action become noise. Noise becomes apathy. Apathy means you’ll miss the one alert that mattered.
A few rules that I enforce on every deployment:
- Every alert has a named rule or model, an owner, and a runbook link.
- Every alert can be silenced for a specific asset for a specific reason for a specific time, with a comment.
- Re-alerting is debounced. Default 30 minutes for warning, 5 minutes for critical.
- Every silenced alert is audited weekly. Patterns of silence usually mean the rule needs tuning, not the silence.
Without these, the project will fail regardless of model quality. With them, even a mediocre model will be useful.
Latency Budget
Real-time means different things in different contexts. For most industrial workloads, “alert within 10 to 30 seconds of the event” is plenty. Sub-second alerting is a different system, usually built into the control loop itself and not the IIoT platform.
A typical pipeline I’d ship:
- Edge worker scores layer one and two locally, publishes scores at the same cadence as raw telemetry.
- Stream processor evaluates layer three rules on a sliding window.
- Alert service deduplicates, applies severity, routes to the right channel.
End-to-end p99 latency is comfortably under 5 seconds on this shape.
Common Pitfalls
- Training on contaminated data. If “healthy” data includes the historical anomalies, the model learns them as normal. Spend time labelling. Domain experts beat unsupervised every time.
- One model for the whole plant. Assets differ. Train per-asset or per-asset-class, even if it means more models.
- Static thresholds. Operating conditions change. Use rolling thresholds tied to the regime.
- Alerting on the model output directly. Always alert on a rule built on top of the score, not the score itself. The rule is what an operator can read.
- No feedback loop. If alerts can’t be marked “useful” or “noise”, you’ll never improve. Build the feedback into the alert UI on day one.
Wrapping Up
Real-time anomaly detection for IIoT works in 2024, but the value comes from layered design and alert hygiene, not from clever models. Start with robust statistics, layer in a small model where it earns its keep, wrap business rules around both, and obsess over the alert experience.
The earlier posts on edge AI on industrial gateways and time series for IIoT cover the runtime and storage this sits on. For the classic robust statistics behind layer one, the NIST handbook section on outlier detection is still a great reference.
Alerts you’d answer at 3 AM, and none you wouldn’t. That’s the bar.