Advanced MQTT Clustering with EMQX 5.8, A Production Guide
TL;DR — EMQX 5.8 (December 2024) finally made clustered persistent sessions usable out of the box. Use Mria with
core/replicantroles, turn on session persistence with RocksDB, use shared subscriptions to fan-out work, and budget for rolling upgrades. Three nodes is enough for most fleets. Five if you can’t tolerate any quorum loss.
EMQX has gone through enough redesigns that “MQTT clustering” depends entirely on what version you’re on. EMQX 4.x used Mnesia and was a pain. EMQX 5.0-5.5 introduced Mria but session persistence was experimental. EMQX 5.8, the December 2024 release, is the first I’d call genuinely production-ready for stateful workloads.
This post is for the engineer being told “we need MQTT clustering” by a product manager who doesn’t know what they’re asking for. We’ll cover the cluster topology, session persistence (the bit that actually breaks at 3am), shared subscriptions, and the upgrade dance.
If you’ve been following the series, the previous post on Go 1.24 telemetry processing consumed from a single broker. Here we’re standing up the broker tier that those consumers connect to.
1. The cluster topology that actually works
There are roughly two cluster shapes worth running. Pick by the failure mode you’re trying to survive.
Shape A: Symmetric 3-node Shape B: Core + Replicants
+--------+ +--------+ +--------+ +--------+ +--------+
| Core 1 |---| Core 2 |---| Core 3 | | Core 1 |---| Core 2 |
+--------+ +--------+ +--------+ +--------+ +--------+
\ | / | |
\ | / +--------+ +--------+
+- clients-+ | Repl 1 | | Repl 2 |
+--------+ +--------+
\ /
+- clients-+
Shape A (three cores) is the simplest. All nodes are equal, all replicate state, any one can fail and the others handle the load. This works up to about 200k concurrent connections per node before the replication overhead starts to hurt.
Shape B (cores plus replicants) splits the cluster: cores handle metadata and replication, replicants handle client connections. Replicants don’t participate in the consensus group, so adding more of them doesn’t slow down the cluster. This is what you want above ~500k concurrent clients.
For the rest of this post I’ll use Shape A because it covers 90% of real deployments.
2. Bootstrapping a 3-node cluster
I’ll use Docker Compose for the demo so you can run it on a laptop. The same configs apply to bare-metal or Kubernetes with minor adjustments.
# docker-compose.yml
version: "3.9"
services:
emqx1:
image: emqx/emqx:5.8.4
container_name: emqx1
hostname: emqx1
environment:
EMQX_NODE_NAME: emqx@emqx1
EMQX_CLUSTER__DISCOVERY_STRATEGY: static
EMQX_CLUSTER__STATIC__SEEDS: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_CLUSTER__CORE_NODES: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_DASHBOARD__DEFAULT_PASSWORD: "change-me-please"
EMQX_DURABLE_SESSIONS__ENABLE: "true"
EMQX_DURABLE_STORAGE__MESSAGES__BACKEND: "builtin_raft"
ports:
- "1883:1883"
- "8083:8083"
- "18083:18083"
volumes:
- emqx1_data:/opt/emqx/data
networks: [emqx_net]
emqx2:
image: emqx/emqx:5.8.4
container_name: emqx2
hostname: emqx2
environment:
EMQX_NODE_NAME: emqx@emqx2
EMQX_CLUSTER__DISCOVERY_STRATEGY: static
EMQX_CLUSTER__STATIC__SEEDS: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_CLUSTER__CORE_NODES: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_DURABLE_SESSIONS__ENABLE: "true"
EMQX_DURABLE_STORAGE__MESSAGES__BACKEND: "builtin_raft"
volumes:
- emqx2_data:/opt/emqx/data
networks: [emqx_net]
emqx3:
image: emqx/emqx:5.8.4
container_name: emqx3
hostname: emqx3
environment:
EMQX_NODE_NAME: emqx@emqx3
EMQX_CLUSTER__DISCOVERY_STRATEGY: static
EMQX_CLUSTER__STATIC__SEEDS: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_CLUSTER__CORE_NODES: '[emqx@emqx1,emqx@emqx2,emqx@emqx3]'
EMQX_DURABLE_SESSIONS__ENABLE: "true"
EMQX_DURABLE_STORAGE__MESSAGES__BACKEND: "builtin_raft"
volumes:
- emqx3_data:/opt/emqx/data
networks: [emqx_net]
volumes:
emqx1_data:
emqx2_data:
emqx3_data:
networks:
emqx_net:
driver: bridge
Bring it up:
docker compose up -d
sleep 15
docker exec emqx1 emqx ctl cluster status
# Cluster status: #{running_nodes => ['emqx@emqx1','emqx@emqx2','emqx@emqx3'],
# stopped_nodes => []}
The builtin_raft backend for durable storage is a 5.8 addition. Earlier versions used builtin_local which only persisted on the local node. Raft replicates session state across the cluster.
3. Persistent sessions, the thing that breaks if you skip it
MQTT sessions persist subscriptions and queued QoS 1/2 messages for clients that reconnect. In a cluster, the question is: when a client reconnects to a different node, can it still find its session?
EMQX 5.8 answers yes, if you configure it correctly.
3.1 Server-side configuration
You set this once cluster-wide. The durable_sessions block in emqx.conf (or via env vars as above).
# emqx.conf
durable_sessions {
enable = true
batch_size = 100
idle_poll_interval = 100ms
heartbeat_interval = 5000ms
session_gc_interval = 10m
}
durable_storage {
messages {
backend = builtin_raft
n_shards = 16
replication_factor = 3
}
}
n_shards = 16 splits the durable storage into 16 Raft groups. Each group has its own quorum. This is critical for write throughput: a single Raft group caps out around 5k ops/sec. With 16 shards you get roughly 16x that.
3.2 Client-side, MQTT 5 with session expiry
# mqtt5_client.py
import paho.mqtt.client as mqtt
from paho.mqtt.properties import Properties
from paho.mqtt.packettypes import PacketTypes
props = Properties(PacketTypes.CONNECT)
props.SessionExpiryInterval = 3600 # 1 hour
client = mqtt.Client(
callback_api_version=mqtt.CallbackAPIVersion.VERSION2,
client_id="device-001",
protocol=mqtt.MQTTv5,
)
client.connect("emqx1", 1883, keepalive=30, clean_start=False, properties=props)
client.subscribe("device/001/cmd", qos=1)
client.loop_forever()
The combination of clean_start=False and SessionExpiryInterval > 0 is what tells EMQX to persist the session. Without it, your messages evaporate the moment the client disconnects.
4. Shared subscriptions for fan-out workers
Shared subscriptions ($share/group/topic) let multiple consumers share the load of a high-volume topic. EMQX uses a round-robin or hash strategy to spread messages.
+-----------------+
| Topic: ingest/+ |
+--------+--------+
|
+---------------+---------------+
| | |
$share/etl/ingest/+ ... (one |
| group) |
+-------+-------+-------+-------+ |
| | | | | ...
etl-1 etl-2 etl-3 etl-4 etl-5
each gets ~1/5 of messages
Configure load balancing strategy:
broker {
shared_subscription_strategy = random
# options: random, round_robin, sticky, hash_clientid, hash_topic
}
For most workloads, random is the right default. sticky is what you want if individual workers cache per-topic state. hash_topic is for sharded processing where each topic must always go to the same worker.
// Worker subscribing to a shared subscription
package main
import (
"context"
"log"
"net"
"github.com/eclipse/paho.golang/paho"
)
func main() {
conn, err := net.Dial("tcp", "emqx1:1883")
if err != nil { log.Fatal(err) }
c := paho.NewClient(paho.ClientConfig{Conn: conn})
_, err = c.Connect(context.Background(), &paho.Connect{
ClientID: "etl-worker-1",
KeepAlive: 30,
CleanStart: false, // persistent session
Properties: &paho.ConnectProperties{
SessionExpiryInterval: uint32Ptr(3600),
},
})
if err != nil { log.Fatal(err) }
_, err = c.Subscribe(context.Background(), &paho.Subscribe{
Subscriptions: []paho.SubscribeOptions{{
Topic: "$share/etl/ingest/+",
QoS: 1,
}},
})
if err != nil { log.Fatal(err) }
select {}
}
func uint32Ptr(v uint32) *uint32 { return &v }
Start five of these and they’ll split the ingest/+ topic between them. Stop two and the remaining three pick up the slack.
5. Rolling upgrades without dropping clients
The 5.x line supports rolling upgrades between minor versions (e.g., 5.8.3 to 5.8.4) without restarting the cluster as a whole. The pattern is: drain, upgrade, rejoin, repeat.
5.1 Drain a node
# Tell the load balancer to stop sending new connections to emqx1
# (Depends on your LB; for HAProxy, set the backend to "drain")
# Force existing clients to disconnect and reconnect (they'll land on a different node)
docker exec emqx1 emqx ctl broker kick all
# Or, more graceful:
docker exec emqx1 emqx ctl listeners stop tcp:default
# This stops accepting new connections; existing keep working until they disconnect
5.2 Upgrade the node
docker compose stop emqx1
# Edit docker-compose.yml: bump image from 5.8.3 to 5.8.4
docker compose up -d emqx1
# Wait for it to rejoin
docker exec emqx1 emqx ctl cluster status
5.3 Rinse and repeat
Move to the next node. Always wait for full sync between nodes before draining the next one.
The dangerous moment is between a major version upgrade (5.8 -> 5.9, hypothetically). Cluster protocol changes can require a coordinated cutover. Check the EMQX release notes before any major upgrade.
6. Production monitoring
EMQX exposes Prometheus metrics on /api/v5/prometheus. The metrics that actually matter:
# Hot list, monitor these closely
emqx_connections_count # current concurrent clients
emqx_messages_received_rate # ingress rate
emqx_messages_dropped # if this climbs, you have a problem
emqx_durable_sessions_count # persistent session count
emqx_messages_qos1_received # at-least-once volume
emqx_node_running # 1/0 per node (alert on 0)
emqx_cluster_nodes_running # quorum check
emqx_vm_memory_proc_used # Erlang process memory
emqx_vm_memory_total # total VM memory
Set up an alert for emqx_cluster_nodes_running < n_cores - 1. If you’ve lost more than one node, you’re one failure from losing quorum.
7. Common Pitfalls
Pitfall 1, single-shard durable storage
The default n_shards = 12 is usually right, but I’ve seen people set it to 1 “to keep things simple.” That caps your durable write throughput at one Raft group’s worth, around 5k ops/sec. Bump it to at least 12 and align with your node count.
Pitfall 2, mixing clean_start=true with persistent session expectations
If any client connects with clean_start=true, the broker wipes its session, including queued messages. I’ve seen libraries default to true and silently throw away QoS 1 deliveries. Always be explicit, and audit your client config.
Pitfall 3, running cores across network partitions
Erlang’s distribution protocol assumes low latency between cores. Running cores in two different cloud regions is a recipe for split brain. Keep cores in one AZ (or at most one region) and use MQTT bridges between regions if you need cross-region replication.
Pitfall 4, no rate limiting on the ingress listener
A misbehaving client can publish a million messages per second and DoS your cluster. EMQX has per-client rate limits, configure them.
listeners.tcp.default {
bind = "0.0.0.0:1883"
max_connections = 1000000
limiter {
messages = 1000 # messages per second per client
bytes = 1048576 # 1 MB/s per client
}
}
8. Troubleshooting
“Node not in cluster” after restart
Almost always a node name mismatch. EMQX uses the long-form node name emqx@hostname. If the hostname changes between restarts (e.g., a Kubernetes pod with a new IP), the node can’t rejoin. Use a stable DNS name or EMQX_NODE_NAME env var that doesn’t change.
Durable session count goes up but never down
Session GC is configured by durable_sessions.session_gc_interval. The default is 10 minutes. If your clients use SessionExpiryInterval > 0, sessions persist until that expires, even after client disconnect. That’s by design. If you’re seeing unbounded growth, check the session expiry on the client side.
Clients reconnect successfully but don’t receive queued messages
The session reconnected without the broker recognizing it. This is usually because client_id is randomized on the client side (some libraries do this). Pin your client_id to something stable, like the device serial number.
9. Capacity planning, brief notes
I’ve found two heuristics that get you close to the right hardware spec without a formal load test.
Connections per core. An EMQX node on modern x86 (Xeon Gold or Ryzen 7) handles roughly 100k idle MQTT 5 connections per core, dropping to about 25k connections per core when each client is publishing at 10 msg/sec. ARM (Graviton, Ampere) is in the same ballpark. So a 16-core node handles 400k active clients comfortably. Above that, add nodes, not cores.
RAM per session. Each persistent session costs about 8 KB plus the size of any queued messages. A million persistent sessions with no backlog is about 8 GB of RAM. A million sessions with 10 queued messages each at 200 bytes is closer to 10 GB. Provision RAM accordingly and leave headroom for the Erlang VM (about 2 GB baseline) and your OS.
Disk for durable storage scales with retention. The Raft log keeps the last N hours of messages, replicated replication_factor times. A modest 5k msg/sec at 200 bytes each, retained for 24 hours, replicated 3x: that’s 5000 * 200 * 86400 * 3 = ~260 GB. Use NVMe; spinning rust will be the bottleneck.
10. Wrapping Up
EMQX 5.8 is the first version I’d recommend without caveats for stateful production workloads. The Raft-backed durable storage is fast enough, the operations are sane, and clustering works the way you’d expect. Stand up three cores, enable durable sessions, use shared subscriptions for your worker pools, and budget for rolling upgrades. Three nodes for most fleets, five if quorum loss is unacceptable.
The next post will move from the broker out to the model serving side, looking at ONNX Runtime 1.20 on edge devices. The broker is fed by sensors and feeds inference workers; we’ve now got the middle layer figured out.
EMQX’s official docs are at emqx.io/docs/en/v5.8 and the Mria deep-dive paper is genuinely worth reading.