Edge AI on Industrial Gateways, TFLite and ONNX in 2024
TL;DR — TFLite owns ARM and microcontroller class hardware. ONNX Runtime owns x86 gateways with NPUs or iGPUs. Both are production-ready in 2024 and the choice is hardware-driven, not religious.
Edge AI on industrial gateways stopped being a science project somewhere around 2022. By August 2024, the toolchains are mature enough that I’d default to it for any anomaly detection, vision inspection or predictive maintenance use case where round-trip latency to the cloud is more than a few hundred milliseconds. Bandwidth costs help the argument too, especially on cellular sites.
The two runtimes that matter are TensorFlow Lite 2.16 and ONNX Runtime 1.18. Both work on Linux, both have decent Python and C++ APIs, and both have hardware acceleration paths for the chips you’ll actually see on industrial PCs and gateways. Picking between them is mostly a hardware question. This post walks through the choice, the deployment shape I use, and the operational stuff that keeps it boring.
I’m going to assume your edge gateway is one of the usual suspects. An NXP iMX8 or Rockchip RK3588 if you’re on ARM, a small x86 industrial PC with maybe an Intel NPU or a Hailo card if you’re going harder, or an Nvidia Jetson Orin Nano if you’ve decided GPU is the right answer for vision.
The Runtime Decision
For ARM-class hardware, TensorFlow Lite is the path of least resistance. The XNNPACK delegate is excellent for CPU inference, GPU delegates work on Mali and Adreno, and there’s good support for vendor NPUs through Edge TPU and Hexagon delegates. The model format is a single .tflite file, the conversion path from Keras or PyTorch is well documented, and the deployment story is just shipping a binary.
For x86 gateways, especially anything with an Intel CPU and integrated GPU, ONNX Runtime is the better choice. The OpenVINO execution provider in ORT 1.18 gives you near-vendor-tool performance with a portable model file. If you’re on Nvidia Jetson, you can use either TFLite or ONNX Runtime, but TensorRT through ORT is usually the fastest path.
Here’s a TFLite 2.16 inference snippet on a Python edge worker, deliberately simple:
import numpy as np
import tflite_runtime.interpreter as tflite
interp = tflite.Interpreter(
model_path="anomaly.tflite",
experimental_delegates=[tflite.load_delegate("libxnnpack_delegate.so")],
num_threads=4,
)
interp.allocate_tensors()
inp = interp.get_input_details()[0]
out = interp.get_output_details()[0]
def infer(window: np.ndarray) -> float:
interp.set_tensor(inp["index"], window.astype(np.float32))
interp.invoke()
return float(interp.get_tensor(out["index"])[0, 0])
And the equivalent ONNX Runtime 1.18 sample, with the OpenVINO provider:
import numpy as np
import onnxruntime as ort
sess = ort.InferenceSession(
"anomaly.onnx",
providers=[("OpenVINOExecutionProvider", {"device_type": "GPU.0"}),
"CPUExecutionProvider"],
)
input_name = sess.get_inputs()[0].name
def infer(window: np.ndarray) -> float:
out = sess.run(None, {input_name: window.astype(np.float32)})
return float(out[0][0, 0])
Same shape, different runtime. The boring similarity is the point.
Latency Budgets
The number that matters more than peak throughput is p99 latency under contention. Industrial gateways do a lot of things at once. Modbus polling, OPC UA subscriptions, MQTT publishing, log shipping, and now ML inference. If your inference grabs a CPU core and starves the comms threads, you’ll see Modbus timeouts before you see model performance complaints.
I budget edge inference as 30–40 percent of one core, max, on a multi-core gateway. That usually means a model in the 1–10 MB range, quantized to int8, with a per-inference latency under 20 ms. For a 1 Hz tag stream that’s enormous headroom. For vision at 15 fps, it’s tight, and you’ll likely need the NPU or iGPU.
Quantization is essentially free quality-wise for most industrial signal models. TFLite’s post-training int8 quantization with representative data gets within a percent or two of float32 accuracy for the kind of LSTM, GRU and 1D-CNN architectures I use for sensor data. ONNX Runtime has equivalent quantization tools. Use them.
The Deployment Shape
A model on a gateway is half the work. The other half is getting models on and off the gateway safely, and reasoning about which version is running where. My standard pattern:
- Models live in object storage, keyed by
{site}/{line}/{model_name}/{version}.tflite. - The gateway has a small supervisor that watches a per-device MQTT topic for a “load this version” command.
- On command, it downloads, verifies a signature, atomically swaps the model file, and restarts the inference worker.
- The worker publishes a heartbeat with the loaded model version, so the central system always knows ground truth.
Atomic swap matters. Don’t overwrite the model in place. Download to a temp path, then rename. On crash recovery, the previous model is still there.
For signing, plain ed25519 over the file is fine. Don’t overthink it. The threat model is “someone reads our MQTT topics” not “nation-state APT”. A signing key per environment, rotated annually.
Models Worth Using
For anomaly detection on multivariate sensor windows, a 1D-CNN autoencoder or a small Transformer with maybe 50k parameters gets you a long way. Train on healthy data, score reconstruction error, alert above a threshold tuned on a labelled validation set. The model fits in a few hundred KB.
For vision inspection, YOLOv8-nano or YOLOv8-small quantized to int8 with TensorRT or OpenVINO is the workhorse in 2024. It runs at 30+ fps on a Jetson Orin Nano with room to spare.
For predictive maintenance regression, gradient boosted trees converted to ONNX via onnxmltools are unbeatable. Tiny, fast, accurate, and explainable. Don’t reach for deep learning when XGBoost will do the job.
Monitoring
Inference at the edge has its own failure modes. Drift is the obvious one. The less obvious ones are silent quantization mismatches when the runtime version changes, and latency creep when the gateway’s other workloads grow.
I publish three things from every inference worker, on MQTT, every minute. Model version. p50 and p99 inference latency. A running histogram of input feature distributions, summarized to a small JSON. The histogram is what catches drift before the alerts get noisy.
client.publish(
f"plant/{site}/{line}/edge/{device}/model_stats",
json.dumps({
"version": MODEL_VERSION,
"p50_ms": p50, "p99_ms": p99,
"input_hist": hist_summary,
}),
qos=1,
)
The central system aggregates and compares against a baseline. Drift over a configurable threshold triggers a review, not an automatic retrain. Auto-retrain on edge is a great way to teach yourself a hard lesson.
Common Pitfalls
- Shipping float32 models to gateways that have int8 NPUs. The NPU sits idle and the CPU melts. Quantize.
- Forgetting that the gateway also has to do its day job. Pin the inference worker to specific cores with
taskset, leave the comms cores alone. - Bundling the model with the firmware. You’ll want to update models more often than firmware. Decouple them.
- No version stamp in the output. When you compare predictions across sites, you need to know which model produced what. Always include the model version in the published result.
- Trusting Python on a 64 MB MCU. Below a certain class of hardware, Python isn’t the right runtime. TFLite Micro on C++ is. Know where the line is.
Wrapping Up
Edge AI on industrial gateways is one of the rare technologies that’s both genuinely useful and now genuinely easy. Pick TFLite 2.16 for ARM, ONNX Runtime 1.18 for x86. Quantize. Budget your CPU. Sign your models. Monitor drift. Don’t auto-retrain.
For the data plumbing that feeds these workers, see the earlier IIoT reference architecture post. The official ONNX Runtime documentation has the most current execution provider matrix if you’re evaluating hardware accelerators.
Smart models, dumb pipelines. That ordering works.