Backpressure and Reliability Patterns for IIoT Pipelines
TL;DR — IIoT pipelines fail at the seams between components, not inside them / Every hop needs bounded buffers, a drop policy, and a metric you alert on / “Don’t drop data” is a non-goal; “know what you dropped, why, and how recently” is the real goal.
A pipeline that never drops data is a pipeline you haven’t load-tested. The interesting question isn’t whether you’ll drop data — under sufficiently bad conditions you will, every system does — but whether the drop is bounded, observable, and the right data. This is the post I wish I’d written for myself three years ago.
We’ve spent the month walking through the IIoT stack: brokers, storage, Kafka, edge, OPC UA. This final post is about the seams between those components and how to make them honest. I’ll pull from the Kafka post for the durable-buffer specifics.
The Pipeline as a Series of Queues
Every IIoT pipeline is a chain of queues, even when it doesn’t look like one:
[Sensor]
-> OPC UA monitored item queue (queuesize=N)
-> Bridge in-memory buffer
-> MQTT broker offline message queue (max_queued_messages)
-> MQTT bridge to upstream broker (max_queued_messages)
-> Upstream broker
-> Kafka producer batch buffer (linger.ms)
-> Kafka topic partition (retention.ms)
-> Consumer fetch buffer
-> Time-series DB write buffer
-> Disk
Every one of those queues has a depth, a fill rate, a drain rate, and a behavior when full. The pipeline is reliable when every queue’s drain rate matches or exceeds its sustained fill rate, and every queue’s full-behavior is one of:
- Drop oldest (telemetry: I’d rather have recent data than complete data)
- Drop newest (command/control: never overwrite a queued command with a new one of unknown ordering)
- Block upstream (only safe if upstream can absorb the block — e.g. the sensor has its own buffer)
- Spill to disk (a transient option, not a permanent strategy)
The architecture decision is which policy at which queue.
End-to-End Backpressure Decisions
Working through the stack:
OPC UA bridge. The OPC UA subscription’s queuesize should be small (5–10). The bridge’s internal asyncio buffer should also be small. Both should drop-oldest when full. The PLC has its own internal buffer; the bridge does not need to be a second one.
MQTT broker (edge). This is the longest buffer in the chain — it has to cover WAN outages. Sized for hours of capacity, drop-oldest, persistence enabled. The Mosquitto tuning post covers the specifics.
MQTT bridge edge to hub. Bounded queue (max_queued_messages), drop-oldest. If the WAN is down longer than the buffer, you’ve lost the tail — that’s the design.
Kafka producer. linger.ms + batch.size + buffer.memory. When buffer.memory fills (because Kafka brokers are slow), the producer’s send() blocks for up to max.block.ms, then throws. Catch that exception and either backoff-retry or drop to local disk.
Kafka topic retention. This is your replay window. Size it for the longest acceptable consumer outage plus margin. 7 days is a reasonable default for telemetry; 30 days if you have downstream batch consumers.
Consumer to time-series DB. This is the hop where most production incidents happen. If the DB is slow, the consumer falls behind. If the consumer falls behind past retention.ms, you lose data permanently. Alert on consumer lag with a threshold well below retention.ms.
Dead-Letter Queues That You Actually Look At
Every consumer that does validation needs a DLQ for malformed messages. The DLQ is useless without monitoring; a DLQ that no one watches is just /dev/null with extra steps.
# Prometheus alert on DLQ growth
groups:
- name: iot-dlq
rules:
- alert: DlqGrowing
expr: rate(kafka_topic_partition_high_water[5m]{topic="iot.dlq"}) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Messages landing in iot.dlq for {{ $value }} msg/sec"
runbook: https://runbooks.example.com/iot-dlq
- alert: DlqStuck
expr: kafka_consumergroup_lag{topic="iot.dlq"} > 1000
for: 1h
labels:
severity: ticket
annotations:
summary: "iot.dlq has unprocessed messages and is not being drained"
The DLQ consumer is a separate, low-throughput service whose job is: classify the message, file a ticket or auto-remediate, and either replay to the main topic or archive. The classify-and-route logic is the easiest place to use a small rules table because the failure modes are predictable.
Graceful Degradation: Modes, Not Failures
Instead of “up” and “down,” think in modes:
- Mode Green: Everything works. Hot path writes to time-series DB with sub-second latency. Real-time dashboards are live.
- Mode Yellow: Time-series DB is slow or unavailable. Kafka buffers normally. Dashboards show a “data delayed” indicator. Alert is raised, but no data is lost yet.
- Mode Orange: Kafka is degraded. Edge brokers buffer locally. Backhaul is queued. Sensor data is still being collected at the edge. No real-time analytics.
- Mode Red: Edge is disconnected. Sensors continue publishing to local broker. Local control logic still functions. WAN restoration triggers replay.
Each mode has explicit detection criteria, explicit user-visible effects, and an explicit recovery path. Document this. The on-call runbook should be “we’re in mode Orange because X; expect Y” not “stuff is broken.”
A Concrete Consumer With Backpressure
A Go consumer writing to TimescaleDB with bounded inflight, retries, and a DLQ:
// Kafka 3.5, franz-go v1.14, pgx v5
package main
import (
"context"
"encoding/json"
"errors"
"log/slog"
"time"
"github.com/jackc/pgx/v5/pgxpool"
"github.com/twmb/franz-go/pkg/kgo"
)
type Reading struct {
DeviceID string `json:"device_id"`
Metric string `json:"metric"`
Value float64 `json:"value"`
TS int64 `json:"ts"`
Quality int `json:"quality"`
}
const insertSQL = `
INSERT INTO sensor_readings (ts, device_id, metric, value, quality)
SELECT to_timestamp(t / 1000.0), d, m, v, q
FROM unnest($1::bigint[], $2::text[], $3::text[], $4::double precision[], $5::smallint[])
AS u(t, d, m, v, q)
ON CONFLICT (device_id, metric, ts) DO NOTHING
`
func writeBatch(ctx context.Context, db *pgxpool.Pool, batch []Reading) error {
ts := make([]int64, len(batch))
dev := make([]string, len(batch))
met := make([]string, len(batch))
val := make([]float64, len(batch))
qu := make([]int16, len(batch))
for i, r := range batch {
ts[i], dev[i], met[i], val[i], qu[i] = r.TS, r.DeviceID, r.Metric, r.Value, int16(r.Quality)
}
cctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
_, err := db.Exec(cctx, insertSQL, ts, dev, met, val, qu)
return err
}
func main() {
log := slog.Default()
ctx := context.Background()
db, _ := pgxpool.New(ctx, "postgres://writer@timescale/iot?pool_max_conns=8")
defer db.Close()
cl, _ := kgo.NewClient(
kgo.SeedBrokers("kafka-0:9092"),
kgo.ConsumerGroup("ts-writer"),
kgo.ConsumeTopics("iot.telemetry.normalized"),
kgo.DisableAutoCommit(),
kgo.FetchMaxBytes(20<<20),
)
defer cl.Close()
backoff := 100 * time.Millisecond
for {
f := cl.PollFetches(ctx)
batch := make([]Reading, 0, 1024)
dlq := make([]*kgo.Record, 0)
f.EachRecord(func(r *kgo.Record) {
var rd Reading
if err := json.Unmarshal(r.Value, &rd); err != nil {
dlq = append(dlq, &kgo.Record{Topic: "iot.dlq", Key: r.Key, Value: r.Value})
return
}
batch = append(batch, rd)
})
if len(batch) > 0 {
err := writeBatch(ctx, db, batch)
for err != nil {
log.Warn("ts write failed", "err", err, "backoff", backoff)
time.Sleep(backoff)
if backoff < 30*time.Second {
backoff *= 2
}
err = writeBatch(ctx, db, batch)
if errors.Is(err, context.Canceled) { return }
}
backoff = 100 * time.Millisecond
}
if len(dlq) > 0 {
cl.ProduceSync(ctx, dlq...)
}
if err := cl.CommitUncommittedOffsets(ctx); err != nil {
log.Error("commit", "err", err)
}
}
}
The exponential backoff up to 30s is what makes this consumer well-behaved when TimescaleDB is degraded. It doesn’t drop messages on the floor; it lets Kafka be the buffer and waits patiently. The ON CONFLICT DO NOTHING makes the write idempotent so consumer-side reprocessing is safe.
What to Measure
If I were starting an IIoT deployment tomorrow, these are the dashboards I’d build before anything else:
- End-to-end latency p50/p99/p999 — from
SourceTimestampat the sensor tonow()when written to TS DB. Histogrammed. - Drop rate per queue — Mosquitto bridge drops, Kafka producer drops, consumer DLQ rate. One panel per queue.
- Consumer lag, all consumers —
kafka_consumergroup_lagpanels per group. - Mode — single big indicator showing Green/Yellow/Orange/Red derived from the above.
If you can answer “what’s the worst lag in the system right now and where is it” in under 10 seconds, you’ve done the observability right.
Common Pitfalls
- Treating “no error logs” as “everything is fine.” Silent backpressure is the most common failure mode. Logs say nothing; latency p999 climbs from 200ms to 30s and your alert thresholds don’t catch it. Alert on percentiles, not error counts.
- Unbounded retry loops at every layer. Retries are useful at one layer; retries at every layer multiply load on the failing component and turn a recoverable incident into a self-DoS. Pick one layer to absorb retries (usually Kafka producer) and make every other layer fail-fast.
- DLQ that grows forever. A DLQ without a drain process is a memory leak. Either auto-replay after a fixed retention, or have a humans-in-the-loop process that drains it weekly.
- Sync writes on the hot path. A consumer that writes one row at a time to the TS DB is leaving 50x throughput on the floor. Batch, and use
COPYor array-based inserts as shown above. - No load test. Production behavior at 80% capacity is wildly different from behavior at 110%. Run a synthetic load test that pushes you over the cliff and see how the system fails. That failure mode is the one you ship.
Wrapping Up
Reliability in IIoT is a property of the seams between components, not the components themselves. Pick a drop policy for every buffer, alert on the metrics that show backpressure before it becomes data loss, and design for modes of degradation rather than binary up/down. That’s the closing thought for this August series — next month I’ll get into the analytics side of the same pipeline, starting with anomaly detection on streaming sensor data.