n8n Queue Mode with Redis at Scale, A Production Walkthrough
TL;DR — n8n queue mode on Redis 7.4 scales linearly to about 5000 jobs per minute on a single Redis node. Past that, Sentinel for HA and partitioned queues for throughput. Persistence is RDB plus AOF with
appendfsync everysec. Watch the worker concurrency dial more than anything else.
The default Redis you get from docker run redis is fine for development and small production deployments up to maybe a thousand executions per minute. Past that, things start to bite. Bull’s blocking pop becomes a bottleneck. RDB-only persistence drops jobs on a crash. A single Redis node becomes the single point of failure for the whole n8n cluster.
This article is the operational manual for running queue mode at real production scale. We’ll size the Redis instance, configure persistence properly, set up Sentinel for failover, partition queues for throughput beyond what one Redis can handle, and walk through the failure modes. The architecture builds on the advanced n8n architecture article but goes deeper on the broker side.
Opinions stated plainly. Don’t run Redis on the same host as n8n. Don’t use Redis Cluster for Bull, Sentinel is the right HA pattern. Don’t disable persistence because “the queue is ephemeral”, you’ll lose in-flight jobs on every crash. Pin Redis to a specific minor version and treat upgrades like database upgrades.
1. Sizing the Redis Instance
Bull is light on memory but chatty. Each waiting job is a small hash, maybe 500 bytes. Each delayed job is similar. Active jobs live until completion. Completed and failed jobs are kept for a configured window.
A rough memory budget:
memory_bytes = (waiting + delayed + active) * 500
+ (completed_window + failed_window) * 500
+ queue_overhead_bytes
For 10K jobs per minute with an average duration of 5 seconds, you have around 800 active at any time. Add a 1K-completed retention window and a 1K-failed window, you’re at roughly 1.5MB of job data. The actual memory pressure comes from Bull’s job metadata in Redis hashes and the script execution overhead, which together push it to maybe 50MB at that throughput. Redis with 1GB of memory is comfortable.
The CPU side is more interesting. Bull issues a LUA script on every job state transition. At 10K jobs per minute that’s roughly 600 script calls per second. Redis 7.4 handles this on a single core without breaking a sweat, but you want to leave headroom for replication and AOF rewrites.
Production sizing I use as a starting point:
+---------------------+--------+---------+--------+
| jobs/minute target | RAM | CPU | disk |
+---------------------+--------+---------+--------+
| up to 2,000 | 2 GB | 2 cores | 20 GB |
| up to 10,000 | 4 GB | 4 cores | 50 GB |
| up to 50,000 | 16 GB | 8 cores | 200 GB |
+---------------------+--------+---------+--------+
The disk sizing accounts for AOF growth between rewrites. If you tune auto-aof-rewrite-percentage aggressively, you can shrink the budget.
2. Persistence Configuration
Redis offers RDB snapshots and AOF (append-only file). For Bull, you want both. RDB gives you a fast restore on cold start. AOF gives you durability between snapshots.
# /etc/redis/redis.conf
appendonly yes
appendfsync everysec
appendfilename "appendonly.aof"
aof-use-rdb-preamble yes
save 900 1
save 300 10
save 60 10000
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
maxmemory 4gb
maxmemory-policy noeviction
Three things worth flagging. appendfsync everysec is the right default for queues. It loses at most one second of jobs on a crash. appendfsync always is overkill for n8n and halves throughput. appendfsync no is a footgun.
maxmemory-policy noeviction is critical. The default for Redis 7.4 in some distributions is allkeys-lru, which will silently evict your queue keys when memory fills up. Use noeviction and let the queue back-pressure n8n instead.
The aof-use-rdb-preamble yes setting writes a compact RDB-format snapshot as the AOF preamble, then appends operations after. Restart times drop dramatically on large AOF files.
3. The Bull-Specific Tuning
Bull supports a few Redis options that matter at scale.
// n8n env mapping for Bull
QUEUE_BULL_REDIS_HOST=redis-master.svc.cluster.local
QUEUE_BULL_REDIS_PORT=6379
QUEUE_BULL_REDIS_DB=0
QUEUE_BULL_REDIS_PASSWORD=...
# Bull-specific
QUEUE_BULL_PREFIX=bull
QUEUE_RECOVERY_INTERVAL=60
QUEUE_HEALTH_CHECK_ACTIVE=true
QUEUE_WORKER_LOCK_DURATION=30000
QUEUE_WORKER_LOCK_RENEW_TIME=15000
QUEUE_WORKER_STALLED_INTERVAL=30000
QUEUE_WORKER_MAX_STALLED_COUNT=1
QUEUE_WORKER_LOCK_DURATION is the heartbeat interval. Each worker renews its lock on an active job every LOCK_RENEW_TIME. If a worker crashes, the lock expires after LOCK_DURATION and another worker can pick it up. Tune these to match your typical workflow runtime. For workflows that occasionally take 5 minutes, raise both values, otherwise Bull will mark the job stalled and retry it on a different worker while the original is still working.
QUEUE_WORKER_MAX_STALLED_COUNT=1 is conservative. After one stall the job goes to failed. The alternative is endless reruns of a job that crashes the worker.
4. Sentinel for High Availability
Single-node Redis is fine until it isn’t. The HA pattern for Bull is Redis Sentinel, not Redis Cluster. Cluster doesn’t play well with Bull’s multi-key LUA scripts because it can’t guarantee they’re on the same slot.
A Sentinel setup is three Sentinels and three Redis nodes (one master, two replicas). The Sentinels do health checks and orchestrate failover. n8n connects via Sentinel and gets the current master transparently.
+-----------+ +-----------+ +-----------+
| Sentinel |<------>| Sentinel |<------>| Sentinel |
+-----+-----+ +-----+-----+ +-----+-----+
| | |
| monitor | monitor | monitor
v v v
+-----+-----+ +-----+-----+ +-----+-----+
| Redis |<-rep--+| Redis |<-rep--+| Redis |
| master | | replica | | replica |
+-----------+ +-----------+ +-----------+
The Redis side:
# replica config
replicaof redis-master 6379
replica-read-only yes
masterauth supersecret
requirepass supersecret
The Sentinel side:
# /etc/redis/sentinel.conf
port 26379
sentinel monitor n8n-master redis-master 6379 2
sentinel auth-pass n8n-master supersecret
sentinel down-after-milliseconds n8n-master 5000
sentinel parallel-syncs n8n-master 1
sentinel failover-timeout n8n-master 30000
n8n needs Sentinel-aware connection config:
export QUEUE_BULL_REDIS_HOST= # leave empty
export QUEUE_BULL_REDIS_SENTINELS='[{"host":"sentinel-1","port":26379},{"host":"sentinel-2","port":26379},{"host":"sentinel-3","port":26379}]'
export QUEUE_BULL_REDIS_MASTER_NAME=n8n-master
export QUEUE_BULL_REDIS_PASSWORD=supersecret
Bull’s underlying ioredis client handles the Sentinel discovery and automatic reconnect on failover. Expected failover time on this config is around 15 seconds, mostly the down-after-milliseconds plus the failover election. n8n workers will see ECONNRESET during the window and reconnect.
5. Partitioning Queues for Throughput
Past about 20K jobs per minute on a single Redis, you start seeing tail latency on the LUA script execution. The fix is to partition queues across multiple Redis instances. n8n 1.78 supports running multiple worker fleets pointed at different QUEUE_NAME values, each backed by a separate Redis.
+--------------+
| Main |
+------+-------+
|
+--------------+--------------+
| | |
queue:billing queue:syncs queue:webhooks
| | |
+----+----+ +----+----+ +-----+-----+
| Redis A | | Redis B | | Redis C |
+----+----+ +----+----+ +-----+-----+
| | |
workers-A workers-B workers-C
(4 nodes) (8 nodes) (2 nodes)
The main process can write to multiple queue names based on workflow tags. The implementation is a small router that runs as part of the trigger and stamps each execution with a target queue.
// in a custom trigger or pre-execution hook
function routeQueue(workflow: { tags?: string[] }): string {
if (workflow.tags?.includes('billing')) return 'billing';
if (workflow.tags?.includes('sync')) return 'syncs';
if (workflow.tags?.includes('webhook')) return 'webhooks';
return 'default';
}
Workers are started with QUEUE_NAME=billing to pull from the billing queue only. Different fleets can scale independently and a noisy-neighbor workflow can’t starve the rest.
There’s no built-in router in n8n 1.78 for this, you implement it in a Code node at the start of the workflow or via a custom trigger node. Most teams don’t need this until they’re past 50K jobs per minute.
6. Observability for Redis
Redis exposes a lot of metrics. The ones that matter for Bull:
redis_db_keys{db="0"}— total keys in the n8n DB. Steady growth means the completed/failed window is too generous.redis_memory_used_bytes— watch againstmaxmemory.redis_commands_processed_total— total throughput.redis_blocked_clients— Bull’s BLPOP workers count here. Should equal total worker concurrency.redis_connected_clients— should be stable. Spikes mean reconnect storms.
Scrape with the official Redis exporter:
# prometheus.yml
- job_name: redis
static_configs:
- targets: [redis-exporter:9121]
The Bull-specific metrics come from n8n’s own Prometheus endpoint, covered in the observability article.
The official Redis 7.4 persistence docs cover the AOF and RDB internals if you want to go deeper on tuning.
Common Pitfalls
Four mistakes that bite at scale.
Running Redis without maxmemory set. The OS will OOM-kill it when memory fills up, taking the queue with it. Always set maxmemory to about 75 percent of the host RAM, and maxmemory-policy noeviction.
Sharing one Redis across staging and production via different DB numbers. It works until a slow query in staging blocks production for a beat. The Redis process is single-threaded for command execution. Use separate Redis instances per environment.
Using appendfsync always. Halves throughput for nearly no durability gain over everysec. The always mode is for financial transaction logs, not n8n queues.
Trusting Redis Cluster with Bull. I see this every year. Bull’s LUA scripts touch multiple keys per call, and Cluster’s slot routing doesn’t guarantee they’re on the same node. The symptoms are intermittent script errors with the CROSSSLOT code. Use Sentinel.
Troubleshooting
Three failure modes.
Workers stop processing jobs after a Sentinel failover. ioredis sometimes caches the old master address. The fix is to make sure the n8n worker pods have a graceful reconnect, and set enableReadyCheck=true in the ioredis config. n8n 1.78 sets this by default but older versions need it explicit. Check worker logs for ECONNREFUSED repeated more than a few times after failover.
Queue depth grows even though workers are idle. The workers are subscribed to a different queue name than the main process is enqueuing to. Confirm with redis-cli LRANGE bull:default:wait 0 -1 and compare against QUEUE_NAME on the worker.
Redis memory grows unbounded. Either EXECUTIONS_DATA_PRUNE is off (so completed Bull jobs accumulate), or you have a stuck workflow holding hundreds of items in active state. Check INFO memory and XLEN for any stream keys. The usual culprit is EXECUTIONS_DATA_SAVE_ON_SUCCESS=all combined with a high-volume sync.
Wrapping Up
Queue mode at production scale is a Redis problem more than an n8n problem. Size the instance for your job rate. Configure RDB plus AOF with everysec. Use Sentinel for HA, not Cluster. Partition queues when one Redis can’t keep up. And treat Redis observability as a first-class concern with proper alerting on memory, blocked clients, and queue depth.
The next article gets into credentials and secrets management for enterprise n8n, which is the security side of the same production picture. After that we’ll get into error handling and retries, the operational side of running these workflows long-term.