background-shape
EMQX 5.1 Clustering for Multi-Site IIoT Without the Foot-Guns
August 9, 2023 · 6 min read · by Muhammad Amal programming

TL;DR — EMQX 5’s core/replicant split is the architecture you actually want for IIoT, not the flat clusters from EMQX 4 / Keep cores small and stateful, push connection load onto replicants / Bridges and Rule Engine actions are where most production incidents originate, not the broker itself.

EMQX 5.1 has been out long enough now that I can talk about it without hedging on “but the 5.0 release notes said…” The clustering model — Mria with optional Khepri backend, plus the core/replicant role split — is one of the few genuine architectural improvements in the IIoT broker space this year. It also has enough new operational surface area to embarrass you if you treat it like EMQX 4.

This is the deployment pattern I’ve been using for multi-plant industrial customers in 2023: one logical EMQX cluster per region, federated via MQTT bridges, with rule-engine fan-out into Kafka and a time-series store. If you haven’t picked a broker yet, my comparison post covers when EMQX is the right choice in the first place.

Core vs Replicant: The Split That Matters

EMQX 5 nodes come in two roles:

  • Core nodes hold the routing metadata, session state, and cluster membership. They use Mria (an extended Mnesia) and replicate among themselves synchronously.
  • Replicant nodes are read replicas of the metadata, but handle client connections, TLS, and Rule Engine execution. They’re stateless from a cluster-coordination standpoint.

In practice: cores are the brain, replicants are the muscle. For a 300k-device deployment I run 3 core nodes (small, e.g. 4 vCPU / 8 GB) and 6–12 replicant nodes (large, 8–16 vCPU each). Adding capacity means adding replicants. Adding cores is a careful operation you do once at deployment and rarely after.

The Manifest That Survives a Cluster Upgrade

Here’s the EMQX Operator manifest I’m shipping with new deployments:

# emqx.yaml — EMQX Operator 2.2.x, EMQX 5.1.5
apiVersion: apps.emqx.io/v2beta1
kind: EMQX
metadata:
  name: emqx-prod
  namespace: iot
spec:
  image: emqx/emqx-enterprise:5.1.5
  imagePullPolicy: IfNotPresent
  config:
    data: |
      cluster.discovery_strategy = k8s
      cluster.k8s.apiserver = "https://kubernetes.default.svc:443"
      cluster.k8s.service_name = "emqx-prod-headless"
      cluster.k8s.address_type = hostname
      cluster.k8s.suffix = "svc.cluster.local"
      mqtt.max_packet_size = 1MB
      mqtt.max_clientid_len = 128
      mqtt.keepalive_multiplier = 1.5
      authentication = [
        {
          mechanism = password_based
          backend = http
          method = post
          url = "http://auth-svc.iot.svc.cluster.local/mqtt/auth"
          body { username = "${username}", password = "${password}", clientid = "${clientid}" }
        }
      ]
  coreTemplate:
    metadata:
      labels:
        app.kubernetes.io/component: core
    spec:
      replicas: 3
      resources:
        requests: { cpu: "2",   memory: "4Gi" }
        limits:   { cpu: "4",   memory: "8Gi" }
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/component: core
      volumeClaimTemplates:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3-iops
        resources:
          requests:
            storage: 50Gi
  replicantTemplate:
    metadata:
      labels:
        app.kubernetes.io/component: replicant
    spec:
      replicas: 6
      resources:
        requests: { cpu: "4",   memory: "8Gi" }
        limits:   { cpu: "8",   memory: "16Gi" }
  listenersServiceTemplate:
    metadata:
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-type: nlb
        service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    spec:
      type: LoadBalancer
      externalTrafficPolicy: Local

Three things in there that I have to argue for on every project:

  1. topologySpreadConstraints on cores. If two cores end up in the same AZ and that AZ goes, you’ve got a degraded cluster and probably a split-brain investigation. Spread them, hard.
  2. externalTrafficPolicy: Local. Preserves the source IP, which you need for both rate-limiting and any geo-aware auth. Yes, it costs you a tiny bit of load-balancing fairness.
  3. HTTP authentication backend. Don’t put auth in EMQX’s built-in user table. Keep it in your existing identity service. Cache aggressively at the EMQX side (authentication.cache.enable = true).

The EMQX 5.1 cluster docs cover the Mria internals if you want to understand what’s happening under the hood.

Rule Engine: The Place Where IIoT Actually Lives

The Rule Engine is what makes EMQX worth the operational tax. Every customer I’ve worked with has the same fan-out requirement: telemetry hits the broker, a hot copy goes to a real-time dashboard / alerting consumer, a warm copy lands in a time-series database, and a cold copy goes to object storage. Doing that without a rule engine means three sidecars.

-- Rule: high-frequency vibration data
-- Route raw payload to Kafka for stream processing,
-- downsampled to InfluxDB for dashboards
SELECT
  payload,
  topic,
  timestamp,
  clientid AS device_id
FROM
  "plant/+/machines/+/vibration"
WHERE
  payload.rms > 0.0

Action 1 — Kafka producer:

# Kafka bridge for raw vibration stream
bridges.kafka_producer.vibration_raw {
  bootstrap_hosts = "kafka-0:9092,kafka-1:9092,kafka-2:9092"
  topic = "iot.vibration.raw"
  message {
    key = "${device_id}"
    value = "${payload}"
  }
  producer {
    required_acks = all
    compression = snappy
    max_batch_bytes = 896KB
    partition_strategy = key_dispatch
  }
  ssl { enable = true, verify = verify_peer }
}

Action 2 — InfluxDB writer:

# InfluxDB 2.7 line-protocol writer
bridges.influxdb_v2.vibration_agg {
  server = "influxdb.iot.svc:8086"
  bucket = "vibration_5s"
  org = "plant-east"
  token = "${INFLUX_TOKEN}"
  write_syntax = "vibration,device=${device_id} rms=${payload.rms},peak=${payload.peak} ${timestamp}"
}

Two production lessons:

  • Set required_acks = all on Kafka. If you don’t, you’ll lose messages during broker leader elections and never notice until your data scientist asks why the histogram has gaps.
  • Use partition_strategy = key_dispatch keyed on device_id. This keeps a single device’s messages ordered through Kafka, which downstream stream processors will appreciate.

Multi-Site Federation

For multi-plant deployments I don’t try to stretch one cluster across regions. The Mria sync cost across a WAN with packet loss is not worth the operational simplicity it might look like on paper. Instead: one cluster per plant, MQTT bridges back to a central hub.

# Outbound bridge from a plant cluster to the regional hub
bridges.mqtt.to_hub {
  server = "mqtts://hub.iot.example.com:8883"
  username = "plant-east"
  password = "${HUB_PASSWORD}"
  clientid = "plant-east-bridge"
  ssl { enable = true, verify = verify_peer }

  ingress {
    remote { topic = "hub/plant-east/cmd/#", qos = 1 }
    local  { topic = "cmd/${1}",            qos = 1 }
  }
  egress {
    local  { topic = "plant/+/alerts/#" }
    remote { topic = "hub/plant-east/alerts/${1}/${2}", qos = 1 }
  }
}

The bridge is the thing that breaks. Alert on bridge state with the Prometheus exporter, and budget for the bridge queue to fill — set a max queue size with an overflow policy that drops oldest, because dropping the bridge entirely (and losing forward progress) is worse than losing the tail.

Common Pitfalls

  • Treating replicants like cores. Replicants don’t hold the source of truth. If you lose all your cores and only have replicants, you lose routing metadata and need to rebootstrap. Back up data/mnesia/ from at least one core node, daily.
  • Auth cache TTL too low. Default auth cache TTL is short. On a 100k-device cluster with HTTP auth backend, that’s a thundering herd against your auth service every reconnect storm. Set it to 5–10 minutes for sensor-class clients and have the auth service invalidate proactively if a credential is revoked.
  • Shared subscriptions without sticky strategy. $share/<group>/topic with the default hash strategy will reshuffle on every consumer change. For ordered processing pipelines use sticky or local strategies and accept the trade-offs.
  • Rule Engine SQL hot reloads not actually atomic. The dashboard makes it feel atomic. In a busy cluster a change to a high-throughput rule briefly drops messages while the new rule compiles on replicants. Stage rule changes during low-traffic windows or use the staged-disable-update-enable pattern.
  • mqtt.max_packet_size defaults. The default is generous, but in 2023 I still see deployments where someone sized it for 2 KB telemetry and then started sending firmware blobs over MQTT. Either route firmware over HTTP and just signal availability over MQTT, or raise the limit deliberately.

Wrapping Up

EMQX 5.1’s clustering is genuinely good, but it puts more responsibility on whoever owns the broker. Plan the core/replicant ratio for your worst-case connection storm, federate across sites rather than stretching one cluster, and treat the Rule Engine like production code with reviews and staging. Next post I’ll cover what happens after the broker — the Telegraf and InfluxDB pipeline that consumes all this telemetry.