Optimizing MQTT Clusters for Critical Environmental Monitoring
TL;DR — A single EMQX node is fine until it isn’t / Cluster routing, backpressure, and session storage are the three knobs that decide whether you lose data / Tune them before an air-quality alert gets dropped, not after.
Environmental monitoring has a property that makes it unforgiving: the messages you care about most are the rare ones. A particulate sensor reporting a normal reading every ten seconds is noise. The one reading that crosses a regulatory threshold is the entire reason the system exists. If your broker drops that message under load, the deployment has failed at the only job that mattered.
I ran into this on a network of roughly 4,000 air-quality and water-quality sensors feeding a regional monitoring authority. The first version ran on a single EMQX node. It worked for months, then a firmware rollout caused every sensor to reconnect inside a 90-second window. The node spent its CPU on TLS handshakes, the inflight queues backed up, and QoS 1 messages started timing out. Nobody lost regulatory data that day, but it was luck, not design.
This post is about turning luck into design. We’ll build a three-node EMQX 5.8 cluster, walk through the routing internals that decide where a message goes, tune backpressure so a slow consumer can’t take down the broker, and configure session persistence so a node failure doesn’t drop inflight messages. The companion piece on Go sensor event processing covers the consumer side; here we stay inside the broker.
Why a Cluster, and What It Actually Buys You
A common misconception is that an EMQX cluster is a load balancer with a shared brain. It isn’t. Each node owns its own TCP connections and its own sessions. What the cluster shares is the routing table: a distributed map from topic filters to the nodes that have subscribers for them.
When a publisher on node A sends to env/sector-7/pm25, node A consults the routing table, sees that node C has a matching subscriber, and forwards the message over the inter-node Erlang distribution channel. There’s no central coordinator. This matters for two reasons. First, cross-node forwarding has a cost, and topic design directly affects how much of it happens. Second, the routing table is replicated, so a node join or leave triggers table reconciliation — a real CPU event during the exact firmware-rollout scenario that bit me.
Cluster gives you three things that matter for environmental monitoring: connection capacity (more nodes, more concurrent sensors), fault tolerance (a node dies, its sensors reconnect elsewhere), and rolling upgrades (drain one node at a time without a maintenance window). It does not give you free horizontal throughput for a single hot topic — that’s still bounded by the node hosting the subscriber.
Standing Up the Cluster
EMQX 5.8 clusters via static node discovery or DNS. For a fixed sensor deployment I prefer static — it’s predictable and there’s no extra service to fail. Here’s cluster.hocon for node 1 of 3.
# /etc/emqx/cluster.hocon
node {
name = "[email protected]"
cookie = "env-monitor-prod-cookie-rotate-me"
data_dir = "/var/lib/emqx"
# Reserve schedulers; leave headroom for the OS and TLS offload
process_limit = 2097152
max_ports = 1048576
}
cluster {
name = emqxcl
discovery_strategy = static
static {
seeds = ["[email protected]", "[email protected]", "[email protected]"]
}
# core-replicant split: cores hold routing state, replicants only serve connections
core_nodes = ["[email protected]", "[email protected]", "[email protected]"]
}
# Inter-node distribution port range must be open in the firewall
rpc {
port_discovery = manual
tcp_server_port = 5370
tcp_client_num = 8
}
A note on cookie: it’s the Erlang distribution shared secret. If it differs between nodes, they silently fail to cluster and you get no obvious error — just two single-node clusters that think they’re alone. Set it identically everywhere and treat it like a credential.
The core_nodes / replicant split is the single most useful EMQX 5.x feature for this workload. Core nodes participate in the replicated database (routing, sessions). Replicant nodes connect to cores, serve client connections, but don’t carry database replication overhead. For 4,000 sensors three cores is plenty; if you scale to 40,000 you add replicants and leave the core count at three or five.
Bring up the listener with sane limits. The defaults assume a chat workload, not a sensor fleet that all reconnects at once.
# listeners.ssl.default in cluster.hocon
listeners.ssl.default {
bind = "0.0.0.0:8883"
max_connections = 50000
# Accept rate limiter: cap new TLS handshakes per second per node.
# This is the knob that saved me during the firmware-rollout storm.
max_conn_rate = 500
ssl_options {
certfile = "/etc/emqx/certs/server.pem"
keyfile = "/etc/emqx/certs/server.key"
cacertfile = "/etc/emqx/certs/ca.pem"
verify = verify_peer
fail_if_no_peer_cert = true
# Session tickets cut reconnect-storm CPU dramatically
reuse_sessions = true
}
tcp_options {
backlog = 1024
nodelay = true
send_timeout = "15s"
}
}
max_conn_rate is the difference between a reconnect storm being a non-event and being an outage. At 500 handshakes/sec per node, 4,000 sensors across three nodes drain in under three seconds, and the CPU spent on TLS is bounded.
Topic Design Drives Routing Cost
Before tuning the broker, fix the topic tree. Routing cost in a cluster is dominated by two things: the number of distinct topics in the routing table, and how many wildcard subscriptions have to be evaluated per publish.
Use a hierarchical, predictable structure with the high-cardinality identifier deep in the tree:
env/{region}/{site}/{sensor_type}/{device_id}/measurement
env/jawa-barat/site-12/pm25/dev-0a3f/measurement
env/jawa-barat/site-12/co2/dev-0a3f/measurement
Backend consumers subscribe with shared subscriptions so the broker load-balances delivery across consumer instances:
$share/ingest/env/+/+/+/+/measurement
The $share/ingest/ prefix is MQTT 5.0 shared subscriptions. Without it, every consumer instance gets every message and you’ve built a fan-out, not a load balancer. With it, EMQX round-robins messages across the group named ingest. For environmental monitoring I run the ingest group at six members across three consumer hosts.
One rule that pays off: never put a device ID in a topic level that consumers wildcard over individually. A subscription like env/+/+/+/dev-0a3f/measurement per device means 4,000 routing entries. A single shared wildcard means one. The routing table goes from 4,000 entries to a handful.
Backpressure, the Part Everyone Skips
Backpressure is what stops a slow subscriber from turning into broker-wide memory exhaustion. EMQX implements it per-session through the inflight window and the message queue. Get these wrong and a single misbehaving consumer triggers an out-of-memory kill that takes a whole node’s connections down with it.
Configure zone-level overrides for the consumer-facing path:
# mqtt zone defaults — apply cluster-wide
mqtt {
max_inflight = 32
max_mqueue_len = 10000
mqueue_store_qos0 = false
# Drop oldest when the queue is full instead of dropping new messages.
# For monitoring, the freshest reading wins.
mqueue_default_priority = lowest
max_packet_size = "1MB"
retry_interval = "20s"
# Idle sensors that never PINGREQ get reaped
keepalive_multiplier = 1.5
}
# Force-shed flow control per connection
force_shutdown {
enable = true
max_mailbox_size = 100000
max_heap_size = "64MB"
}
Three decisions here are worth defending. max_inflight = 32 keeps the per-session memory bounded while still pipelining enough QoS 1 traffic to saturate a healthy consumer. mqueue_store_qos0 = false means QoS 0 messages are never queued for an offline session — for monitoring, a QoS 0 reading is a routine sample and a stale one is worthless. force_shutdown.max_heap_size is the circuit breaker: if a session’s process heap blows past 64 MB, EMQX kills that one session instead of letting it OOM the node.
Set a global overload protection threshold so the node sheds load before it falls over:
overload_protection {
enable = true
backoff_delay = "1s"
backoff_gc = true
backoff_hibernation = true
backoff_new_conn = true
}
When CPU saturation is detected, EMQX delays new connections and forces garbage collection rather than accepting work it can’t complete. During the firmware storm, this is what kept the node responsive enough to actually serve the existing sensors.
Session Persistence Across Node Failure
In EMQX 5.8, durable sessions store subscriptions and inflight QoS 1/2 messages in the replicated database, so a session survives the death of the node that hosted it. The client reconnects to a surviving node, the session is found, and inflight messages resume. This is the feature that turns “node failure drops data” into “node failure is a reconnect”.
session_persistence {
enable = true
# Messages live in the durable store until acked or expired
message_retention_period = "2h"
# Flush batched writes; trade a little latency for throughput
storage {
builtin_raft {
n_sites = 3
replication_factor = 3
egress {
batch_size = 1000
flush_interval = "100ms"
}
}
}
}
replication_factor = 3 means every durable message is on all three cores; a single node loss never loses a persisted message. The egress batching trades up to 100 ms of write latency for a large throughput gain — acceptable when the alternative is a fsync per message.
Pair this with client-side Session Expiry Interval in the MQTT 5.0 CONNECT packet. Sensors should request something like 4 hours so a sensor that drops off briefly keeps its durable session and its queued alerts:
# paho-mqtt 2.1.0 — sensor side
import paho.mqtt.client as mqtt
from paho.mqtt.properties import Properties
from paho.mqtt.packettypes import PacketTypes
connect_props = Properties(PacketTypes.CONNECT)
connect_props.SessionExpiryInterval = 14400 # 4 hours
client = mqtt.Client(
client_id="dev-0a3f",
protocol=mqtt.MQTTv5,
callback_api_version=mqtt.CallbackAPIVersion.VERSION2,
)
client.tls_set(ca_certs="/etc/sensor/ca.pem")
client.connect("mqtt.env-monitor.local", 8883,
clean_start=False, properties=connect_props)
clean_start=False plus a non-zero expiry is the contract: the broker keeps the session and its inflight alerts even across a broker node failure.
Verifying the Cluster Behaves
Don’t trust a config until you’ve measured it. EMQX exposes a Prometheus endpoint and a CLI. Check routing table size and cross-node forwarding:
# Cluster status and per-node connection counts
emqx ctl cluster status
emqx ctl listeners
# Routing table size — should be small if topic design is right
emqx ctl topics list | wc -l
# Live metrics worth watching
curl -s http://localhost:18083/api/v5/prometheus/stats \
| grep -E 'emqx_routes_count|emqx_messages_dropped|emqx_messages_forward'
Load-test the reconnect storm explicitly with emqtt-bench before going live:
# Simulate 4000 sensors connecting; watch max_conn_rate hold the line
emqtt_bench conn -h mqtt.env-monitor.local -p 8883 \
-c 4000 -i 2 --ssl --cacertfile /etc/emqx/certs/ca.pem
If emqx_messages_dropped stays at zero through that test and emqx_routes_count stays in the dozens rather than the thousands, the design is sound.
Common Pitfalls
- Mismatched Erlang cookie. Nodes silently refuse to cluster. Always verify with
emqx ctl cluster statusshowing all three nodes, not just one. - Per-device subscriptions. Putting device IDs in subscriber wildcards explodes the routing table. Use shared subscriptions over a wildcard instead.
- Forgetting
max_conn_rate. Without it, a reconnect storm is a CPU spike that cascades into QoS timeouts. It’s the cheapest insurance in this entire config. clean_start=Trueon sensors. This throws away the durable session on every reconnect, defeating session persistence entirely. The alert queued during a brief dropout is gone.- Treating a cluster as throughput scaling for one topic. A single hot topic is still bottlenecked on the node hosting its subscriber. Shared subscriptions spread the load; more nodes alone don’t.
- No overload protection. A node under CPU pressure that keeps accepting connections degrades for everyone. Enable
overload_protectionso it sheds gracefully.
Troubleshooting
Symptom: New nodes show as separate single-node clusters.
Cause: Erlang cookie or RPC port mismatch between nodes.
Fix: Confirm identical node.cookie and an open rpc.tcp_server_port (5370) in the firewall. Restart and check emqx ctl cluster status.
Symptom: QoS 1 publishes time out under load even though CPU has headroom.
Cause: max_inflight too low, so the pipeline stalls waiting for PUBACKs across a high-latency link.
Fix: Raise max_inflight toward 64 and verify network RTT between sensors and broker; the inflight window must cover the bandwidth-delay product.
Symptom: A node OOM-kills under a slow consumer.
Cause: Unbounded message queue for the slow session.
Fix: Set force_shutdown.max_heap_size and max_mqueue_len; the offending session is killed instead of the node.
Symptom: Messages lost after a node restart despite session persistence enabled.
Cause: session_persistence.enable = false, or clients connecting with clean_start=True.
Fix: Enable durable sessions on the broker and set clean_start=False with a non-zero SessionExpiryInterval on every sensor.
Symptom: emqx_routes_count in the tens of thousands.
Cause: Per-device or non-shared wildcard subscriptions.
Fix: Move backend consumers to $share/ shared subscriptions over a single wildcard.
Wrapping Up
A well-tuned EMQX 5.8 cluster makes node failure a reconnect and a reconnect storm a non-event — but only if topic design, backpressure, and durable sessions are configured together, not bolted on after the first incident. Measure the routing table size and dropped-message counter under a simulated storm before you trust the deployment. Next, look at QoS and persistence tuning to decide exactly which messages are worth the durability cost.
See the EMQX clustering documentation for the full set of discovery strategies and core-replicant guidance.