background-shape
Advanced n8n Architecture in 2025, Queue Mode and Worker Scaling
August 4, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — n8n 1.78 ships a battle-tested queue mode with a clear main/worker split. Redis 7.4 is the broker, Postgres 17 holds execution state, and worker concurrency is the dial you actually need to tune. Don’t run anything past a few hundred executions a day on the default single-process mode.

I’ve been running n8n in anger since the 0.220 days, and the difference between the toy single-process setup and a real queue-mode deployment is the difference between a demo and a system that survives a Monday morning. Most teams discover this the hard way, usually around the time a long-running HTTP node blocks the entire instance and the on-call channel lights up.

This article is what I wish I could have handed those teams six months earlier. We’ll cover the actual process topology of n8n 1.78, how queue mode splits work between a main process and N workers, the role of Redis and Postgres, the concurrency knobs that matter, and the failure modes that show up at volume. I’ll save the deeper Kubernetes packaging for the self-hosted on Kubernetes guide later in this series.

Opinions, stated plainly. Queue mode is not optional for production. The default execution mode is for development. Workers should be CPU-bound stateless drones. The main process should never run a workflow. Treat n8n like any other Node service, not a magic box.

1. The Process Topology in n8n 1.78

n8n ships three runtime modes. regular runs everything in one Node.js process. queue splits the work across a main process, one or more worker processes, and optionally a dedicated webhook process. In August 2025 with n8n 1.78, queue is the only mode I deploy.

                       +---------------------+
                       |   HTTP clients      |
                       +----------+----------+
                                  |
                       +----------v----------+
                       |   Main (UI + API)   |
                       |   - editor          |
                       |   - REST API        |
                       |   - cron scheduler  |
                       |   - active triggers |
                       +----+----------+-----+
                            |          |
                writes      |          | enqueues
                executions  |          | jobs
                            |          |
                  +---------v--+  +----v-----------+
                  | Postgres 17|  |   Redis 7.4    |
                  | exec state |  |  Bull queue    |
                  +-----+------+  +----+-----------+
                        ^              |
                  reads |              | pops jobs
                  state |              |
                        |        +-----v------+
                        +--------+  Worker 1  |
                        +--------+  Worker N  |
                                 +------------+

The main process owns the editor, the public API, the cron scheduler, and any active triggers (webhooks unless you split them out, polling triggers, and the like). It writes execution rows to Postgres and pushes jobs onto a Bull queue in Redis. Workers pop jobs and execute nodes. Workers do not own triggers, they do not serve the UI, and they do not hold any persistent state of their own.

Set the mode explicitly. The defaults change between minor versions, and I’ve been bitten by silent fallback to regular mode when Redis was misconfigured.

# main process
export EXECUTIONS_MODE=queue
export N8N_RUNNERS_ENABLED=true
export QUEUE_BULL_REDIS_HOST=redis.svc.cluster.local
export QUEUE_BULL_REDIS_PORT=6379
export QUEUE_BULL_REDIS_DB=0
export DB_TYPE=postgresdb
export DB_POSTGRESDB_HOST=pg.svc.cluster.local
export DB_POSTGRESDB_DATABASE=n8n
export N8N_ENCRYPTION_KEY=$(cat /run/secrets/n8n_enc_key)

The N8N_RUNNERS_ENABLED flag turns on the new task runner subsystem that landed stable in 1.74 and is the default code-execution path in 1.78. It isolates Code node execution into a child process, which matters for memory and security. Don’t disable it unless you have a specific reason.

2. How Jobs Flow Through Redis

Bull is the queue library. In Redis you’ll see a handful of keys per queue, namespaced under bull:jobs:. Understanding the shape helps when you have to triage a stuck job.

bull:jobs:wait        # FIFO list of job IDs waiting for a worker
bull:jobs:active      # set of job IDs currently being processed
bull:jobs:delayed     # sorted set keyed by when the job should run
bull:jobs:failed      # failed jobs, kept for inspection
bull:jobs:completed   # completed jobs, kept for the configured window
bull:jobs:<job-id>    # hash with the job payload and metadata

A job payload in n8n is small. It’s not the full workflow data, it’s a pointer to the execution row in Postgres plus a few flags. Workers read the execution from Postgres, run the workflow, and write results back. This is why your Postgres latency matters at least as much as your Redis latency.

A quick sanity check from redis-cli:

redis-cli -h redis.svc.cluster.local
> LLEN bull:jobs:wait
(integer) 0
> ZCARD bull:jobs:delayed
(integer) 14
> SMEMBERS bull:jobs:active
1) "8431"
2) "8432"

If bull:jobs:wait grows without bound, you don’t have enough worker concurrency. If bull:jobs:active has stale entries that never complete, you’ve got a stuck worker, usually a blocked HTTP node or a runaway Code node.

Visibility of jobs in transit

n8n stores the actual workflow execution data in Postgres in the execution_entity table. Bull is just the dispatch mechanism. This split is important for two reasons. First, Redis stays small even when you have thousands of executions per minute. Second, Postgres is the source of truth, so a Redis crash loses scheduling state but not history.

3. Worker Concurrency, the One Dial That Matters

Each worker process is a Node.js event loop. Each event loop is single-threaded. The N8N_CONCURRENCY_PRODUCTION_LIMIT env var controls how many workflows a single worker will execute concurrently.

# worker process
export EXECUTIONS_MODE=queue
export N8N_RUNNERS_ENABLED=true
export N8N_CONCURRENCY_PRODUCTION_LIMIT=10
export QUEUE_BULL_REDIS_HOST=redis.svc.cluster.local
export DB_TYPE=postgresdb
export DB_POSTGRESDB_HOST=pg.svc.cluster.local
export N8N_ENCRYPTION_KEY=$(cat /run/secrets/n8n_enc_key)

# start as worker
n8n worker

The right value depends on what your workflows do. For IO-bound flows that mostly call external APIs and wait, 10 to 20 is reasonable. For CPU-bound flows with heavy Code nodes or data transformations, drop it to 2 or 3. The wrong move is to crank concurrency to 50 because “Node is async” and then watch event loop lag spike to seconds.

A rough sizing heuristic I use:

concurrency_per_worker = floor( target_p99_latency_ms / avg_node_blocking_ms )
total_workers          = ceil( peak_executions_per_sec * avg_exec_seconds /
                               concurrency_per_worker )

If your average workflow blocks the loop for 50ms of cumulative time and your p99 latency budget is 500ms, that’s 10 concurrent slots per worker. If you peak at 20 executions per second with an average duration of 3 seconds, you need 6 workers. Round up and add headroom.

Workflow-level overrides

n8n 1.78 lets you override concurrency per workflow via the settings.executionOrder and settings.timezone fields, but the concurrency cap is global per worker. If you have a single workflow that hogs the loop, isolate it on a dedicated worker pool with its own queue name. That requires running multiple n8n worker fleets pointed at different Redis queue names via QUEUE_NAME.

4. Splitting Out the Webhook Process

By default the main process serves webhooks. That’s fine until a webhook handler does something expensive synchronously, at which point your editor gets slow. The fix is the dedicated webhook process.

# webhook process
export EXECUTIONS_MODE=queue
export N8N_DISABLE_PRODUCTION_MAIN_PROCESS=true
export QUEUE_BULL_REDIS_HOST=redis.svc.cluster.local
export DB_TYPE=postgresdb
export N8N_ENCRYPTION_KEY=$(cat /run/secrets/n8n_enc_key)

n8n webhook

Webhook processes are stateless HTTP listeners. They receive the webhook, persist the execution row, push the job to Redis, and return a response. The actual workflow runs on a worker. Run as many as you have ingress capacity for. I usually run two for redundancy even in small clusters.

Process inventory at the end

+-----------+--------+---------+-----------------------------------+
| process   | count  | scales  | responsibility                    |
+-----------+--------+---------+-----------------------------------+
| main      |   1    | no      | UI, API, cron, active triggers    |
| webhook   |   2+   | yes     | inbound HTTP, enqueue only        |
| worker    |   N    | yes     | execute workflows                 |
+-----------+--------+---------+-----------------------------------+

You scale workers and webhooks horizontally. You do not scale main horizontally in 1.78. Active-active main is on the roadmap but not shipped. If you want HA for main, run a passive standby and accept a failover gap.

5. Database Sizing and Execution Retention

n8n writes every execution to Postgres, including the data payloads of every node if EXECUTIONS_DATA_SAVE_ON_SUCCESS=all. At volume this table grows fast. I’ve seen single-tenant deployments hit 200GB of execution data in a month.

The relevant knobs:

# don't save full data for successful runs in production
export EXECUTIONS_DATA_SAVE_ON_ERROR=all
export EXECUTIONS_DATA_SAVE_ON_SUCCESS=none
export EXECUTIONS_DATA_SAVE_ON_PROGRESS=false
export EXECUTIONS_DATA_SAVE_MANUAL_EXECUTIONS=false

# auto-prune executions older than 14 days
export EXECUTIONS_DATA_PRUNE=true
export EXECUTIONS_DATA_MAX_AGE=336
export EXECUTIONS_DATA_PRUNE_MAX_COUNT=100000

The pruner runs hourly. It deletes rows from execution_entity and the related execution_data table. On Postgres 17 with partitioning, you can do much better by partitioning execution_entity by startedAt and dropping old partitions in one shot, but that requires a custom schema migration that survives n8n upgrades. I usually just live with the row-by-row pruner and a generous EXECUTIONS_DATA_MAX_AGE.

Tune Postgres for write-heavy workloads. synchronous_commit=off on a replicated cluster, wal_compression=zstd, and a generous checkpoint_timeout of 30 minutes. The n8n schema doesn’t need anything exotic, but the execution data table benefits from TOAST compression which is on by default in 17.

6. Observability Hooks You Want From Day One

The basic Prometheus endpoint is on by default in 1.78. Enable it explicitly and scrape it from every process type.

export N8N_METRICS=true
export N8N_METRICS_PREFIX=n8n_
export N8N_METRICS_INCLUDE_DEFAULT_METRICS=true
export N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true
export N8N_METRICS_INCLUDE_NODE_TYPE_LABEL=true

The metrics that matter on day one:

  • n8n_workflow_executions_total{status="success|error|crashed"}
  • n8n_workflow_execution_duration_seconds_bucket
  • n8n_queue_jobs_waiting
  • n8n_queue_jobs_active
  • n8n_event_loop_lag_seconds

If n8n_queue_jobs_waiting rises while n8n_queue_jobs_active is at the concurrency cap, add workers. If event loop lag rises on a single worker, you have a blocking node. The deeper observability story including OpenTelemetry traces is in the observability article.

For the full catalog of metrics and their semantics in 1.78, the official n8n metrics docs are kept up to date.

Common Pitfalls

Four traps I’ve watched teams fall into.

Running main and worker as the same process. This happens by accident when EXECUTIONS_MODE is unset or set to regular. The symptom is that the editor freezes when a workflow runs. The fix is to set EXECUTIONS_MODE=queue everywhere and start workers separately. Verify with ps aux | grep n8n that you actually have distinct processes.

Sharing one Redis DB between staging and production. Bull doesn’t namespace queues across environments by default. Two n8n clusters pointed at the same Redis DB will steal jobs from each other. Use QUEUE_BULL_REDIS_DB and a different number per environment, or use entirely separate Redis instances.

Forgetting the encryption key on workers. N8N_ENCRYPTION_KEY must be identical on main, webhook, and all workers. If a worker has the wrong key, it can’t decrypt credentials, and every workflow that uses any credential fails with a vague “cannot decrypt” error. Mount the key as a Kubernetes secret and reference it on all process types.

Trusting the default execution retention. The default is to keep all data for all executions forever. At 1000 executions per day with 1MB of data each, that’s 30GB per month. By the time the on-call notices, the disk is full and Postgres is in recovery mode. Set the prune env vars on day one.

Troubleshooting

Three failure modes that show up at volume.

Workers stop picking up jobs. Check bull:jobs:wait in Redis. If it has entries but workers are idle, the workers have lost their Redis connection. Look for ECONNRESET in worker logs. The usual culprit is a Redis failover with a stale DNS cache. Restart the workers, or set family=4 and a short DNS TTL in your sidecar.

Executions stuck in “running” state forever. A worker crashed mid-execution. Bull doesn’t have a heartbeat by default. Set N8N_QUEUE_HEALTH_CHECK_ACTIVE=true and QUEUE_RECOVERY_INTERVAL=60 so the main process reaps stale executions every minute. Confirm with a query: SELECT id, status, "startedAt" FROM execution_entity WHERE status='running' AND "startedAt" < NOW() - INTERVAL '10 minutes'.

Editor “Cannot load workflow” after a worker upgrade. Workers and main must run the same n8n version. A mixed-version cluster will fail to deserialize execution data written by a newer worker. Roll your deploy so all workers update before main, or use a blue-green rollout. The community node n8n-nodes-base version mismatch is the typical signature in the logs.

Wrapping Up

Queue mode in n8n 1.78 is the only architecture I’d ship to production in August 2025. Split main, webhook, and worker into distinct process types. Pin worker concurrency based on what your workflows actually do, not what feels async. Treat Postgres execution retention as a first-class concern, not a cleanup task you’ll get to later. And wire up the Prometheus endpoint before you wire up anything else.

The next article in this series goes through writing custom n8n nodes in TypeScript so you can stop wrapping every API in HTTP Request nodes. After that we’ll get into the data sync patterns that this architecture was built to support.