background-shape
August Retro, IIoT Production Lessons
August 29, 2022 · 5 min read · by Muhammad Amal programming

TL;DR — Two months into the IIoT consulting gig. Mosquitto + TimescaleDB + Grafana is a sufficient stack. Web-backend instincts (low-latency, eventual consistency, retry-on-failure) sometimes misled. OT engineers know things web engineers don’t. Listen.

End of August. Closing out the IIoT theme. Honest retro on what I learned in the trenches of an actual factory monitoring project.

What’s running in production

The factory project has been live two months. State:

  • Mosquitto 2.0.15 on a $20/month VM, single instance, no HA
  • TimescaleDB 2.7 on Postgres 14 with 90-day raw retention, 1-year aggregates
  • One Go ingest service consuming MQTT and batching to TimescaleDB
  • Grafana 9 for dashboards (3 main: overview, per-line, per-machine)
  • One OPC UA bridge (custom Go, ~500 lines) connecting 4 Siemens PLCs
  • 120 sensors across 30 machines, 1 Hz average rate
  • 5 alert rules (3 threshold, 1 rate-of-change, 1 missing-data)

Throughput: ~1500 msg/s steady, ~200 MB/day to disk. Trivial for the hardware.

What worked

Mosquitto handles the scale comfortably. The “you must use EMQX for production” myth I encountered repeatedly — not true at this scale. Mosquitto is rock-solid. Resource usage: ~50 MB RAM, <2% CPU at peak.

The OPC UA bridge in Go. ~3 days of work; production-stable. The gopcua library was new to me; quality was good enough. Beats commercial gateways for our scale.

Per-device credentials with config management. When we onboarded 20 new sensors, generating creds + ACL updates was scripted. Took 15 minutes. Without that structure it would have been a day.

Grafana variable-driven dashboards. One dashboard with $device and $line variables serves the OT team’s queries. Saved building 30 dashboards.

TimescaleDB continuous aggregates. A year-long temperature chart loads in 200ms because it hits a materialized view, not raw data. The investment in the aggregate definition paid off the first week.

What surprised me

Network reliability assumptions. Web backends treat network as mostly-reliable with occasional blips. Factory networks have brief outages constantly — switches reboot, ethernet cables get unplugged for maintenance, ARP storms happen. The “buffer on disconnect” pattern from the Go edge post saved real data within the first week.

Time-series query patterns. My instinct was “give me all data for the last hour.” OT engineers ask “give me the max for the last hour.” The aggregation is the query, not the post-query step. Caused me to redesign three dashboards.

Industrial protocol quirks. OPC UA’s “Quality” field has more states than just good/bad. PLCs sometimes report BadCommunicationError for a single read but recover. Treating it as binary “ok/not-ok” lost information.

Time zones. Factory shift schedules are in local time. Dashboards using UTC confused operators. Set Grafana timezone to factory local; never UTC for shop-floor displays.

Alert fatigue threshold. The plant engineer set initial thresholds tight (“alert on > 70°C”). Within a day, 30 alerts/shift. Disabled. Took two weeks of co-tuning to get to “5 alerts per shift, all actionable.” Calibration is collaborative.

What didn’t work

Sparkplug B. Tried it. Library quality (in Go especially) was rough. Tooling sparse. Reverted to plain MQTT with JSON. Same data flow; less ceremony. Will revisit Sparkplug for the next project where we control both ends.

Real-time MQTT streaming to Grafana. Plugin existed; reliability poor. Reverted to “TimescaleDB with 5-second dashboard refresh.” Operators don’t actually need sub-second; 5s is fine.

One-Big-Service architecture. Started with a single Go service doing OPC UA → MQTT → TimescaleDB. Hard to scale parts independently. Refactored into three: OPC UA bridge, MQTT ingest, alert evaluator. Each ~200-300 lines. Modular.

InfluxDB. Briefly tried InfluxDB 2 instead of TimescaleDB. Flux query language adds friction; the team knows SQL. Reverted to TimescaleDB. Both work technically; SQL won on team familiarity.

What I’d do differently

Listen to OT engineers earlier. Spent a week designing the dashboard from web-developer instincts. OT team rewrote half on day one of demo. Their tacit knowledge of “what operators look at” is years deep.

Start with retention policy on day one. Forgot. Two weeks in, the DB was 4 GB and growing. Adding retention then required some chunk-by-chunk cleanup. Should have been in the initial schema.

Document the topic taxonomy upfront. Topic naming evolved during the project. Three migrations later, all subscribers still work but the names are inconsistent. Plan upfront; commit; stick.

Use environment-specific brokers. Started with one Mosquitto for dev + prod. “Just be careful.” Predictably, dev test traffic showed up in prod TimescaleDB. Two brokers from day one.

What August didn’t cover

Honest gaps:

  • AWS IoT Core / Azure IoT Hub. The managed cloud offerings. Different cost model; different operational story. Out of scope.
  • Edge ML / inference. Running models on the edge gateway. Different problem space.
  • Wireless sensor networks. Battery-powered sensors over LoRa / NB-IoT / 5G. We’re hardwired.
  • Functional safety. SIL-rated systems for high-consequence environments. Outside my expertise.
  • Modbus, Profibus, HART. Older industrial protocols. We only saw OPC UA.

If you need those, this month wasn’t a complete IIoT guide. It’s a pragmatic starting point.

What September looks like

September theme: System Reliability: Centralized logging and monitoring with Prometheus and Grafana. Bridges nicely from August’s IIoT monitoring + alerting into the full observability stack. Pattern is same; scale is bigger; the tooling extends beyond MQTT.

The factory project’s monitoring will become the running example. Some specifics:

  • Prometheus 2.37 (just released)
  • Grafana 9 (continuing from August)
  • Loki 2.6 for logs
  • Tempo 1.5 for traces
  • Alertmanager
  • SLOs in practice

Same shape: 13 articles, M/W/F. See you Friday.