background-shape
Error Handling and Retries for Production n8n Workflows
August 18, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — n8n 1.78’s per-node retry is good for transient errors, useless for everything else. Layer error workflows for terminal failures, a circuit breaker for cascading outages, and a dead letter table for poison messages. The default settings will fail you in production.

The most common n8n production failure isn’t a workflow that doesn’t work. It’s a workflow that works 99 percent of the time and silently fails on the other 1 percent. The retry settings look reasonable, the executions list shows mostly green, and one weekend the on-call ends up triaging a partial outage that started three days earlier. This is preventable.

This article is the playbook. We’ll cover per-node retry and what it can’t fix, error workflows for terminal failures, circuit breakers for cascading outages, dead letter routing, and the alerting hooks that make all this visible. The patterns assume a queue-mode cluster like the one in the advanced n8n architecture article and the data-sync patterns from the enterprise data syncs article.

Opinions stated plainly. Retry on fail is not error handling, it’s noise reduction. Catch-all error workflows that “log and continue” hide bugs. Dead letter tables are a feature, not a workaround. And no workflow should be marked active until you can answer “what happens when this fails.”

1. Per-Node Retry, the First Layer

Every n8n node has a “Retry On Fail” setting. Toggle it on, set the count, set the wait time. For transient errors (network blips, 502s, rate limits), this catches most of them.

{
  "name": "Call billing API",
  "type": "n8n-nodes-base.httpRequest",
  "parameters": {
    "url": "https://billing.acme.internal/v1/invoices",
    "method": "POST"
  },
  "retryOnFail": true,
  "maxTries": 5,
  "waitBetweenTries": 2000,
  "continueOnFail": false,
  "onError": "stopWorkflow"
}

The retry uses the same wait time between attempts. That’s not exponential backoff, it’s flat. For a 3-attempt retry against a service that’s restarting, three 2-second waits give you 6 seconds of coverage. If the restart takes 20 seconds, you’re done.

The fix is custom backoff via a Code node. For anything past the simplest retry, I write the loop myself.

// Code node: HTTP with exponential backoff
const maxAttempts = 6;
const baseMs = 500;

for (let attempt = 1; attempt <= maxAttempts; attempt++) {
  try {
    const response = await this.helpers.httpRequest({
      method: 'POST',
      url: 'https://billing.acme.internal/v1/invoices',
      body: $input.first().json,
      json: true,
      returnFullResponse: true,
    });

    if (response.statusCode < 500 && response.statusCode !== 429) {
      return [{ json: response.body }];
    }

    if (attempt === maxAttempts) {
      throw new Error(`Final attempt failed: ${response.statusCode}`);
    }

    const jitter = Math.random() * baseMs;
    const wait = Math.min(30000, baseMs * Math.pow(2, attempt - 1) + jitter);
    await new Promise(r => setTimeout(r, wait));
  } catch (err) {
    if (attempt === maxAttempts) throw err;
  }
}

The retry logic is what you’d write in any well-behaved HTTP client. Exponential backoff with jitter, max attempt cap, terminal error after the cap. The reason it’s in a Code node and not a node setting is that n8n’s built-in retry doesn’t expose enough knobs.

When retry helps and when it doesn’t

Retry helps for: network resets, 502/503/504 responses, 429 rate limits, timeouts.

Retry does not help for: 4xx client errors (you sent something invalid), data validation failures, missing fields, anything where the input is wrong.

A retry on a 400 will get 400 every time, just five times now instead of once. Inspect the response code and don’t retry on 4xx (except 429).

2. The Error Workflow

Per-node retry handles transient. After retries are exhausted, the workflow throws. This is where the error workflow comes in.

Configure each workflow with an error workflow setting (Workflow settings > Error Workflow). When the workflow throws, n8n fires the configured error workflow with the failure context.

+------------------+
| Main workflow    |
| - retries exh.   |
| - throws         |
+--------+---------+
         |
         | error event
         v
+--------+---------+
| Error workflow   |
| - Error Trigger  |
| - Categorize     |
| - Route          |
+----+--------+----+
     |        |
     v        v
+------+--+ +-+------+
|  DLQ    | | Alert  |
|  table  | | Slack  |
+---------+ +--------+

The error trigger payload includes the original workflow, the failed execution ID, the node that failed, the error message, and the item data at the point of failure.

// Code node: categorize the error
const e = $input.first().json.execution.error;

let category = 'unknown';
if (e.message.includes('ECONNREFUSED')) category = 'network';
else if (e.httpCode === 429) category = 'rate-limit';
else if (e.httpCode >= 400 && e.httpCode < 500) category = 'client-error';
else if (e.httpCode >= 500) category = 'server-error';
else if (e.message.includes('validation')) category = 'validation';
else if (e.message.includes('timeout')) category = 'timeout';

return [{ json: { ...$input.first().json, category } }];

Routing then depends on category. Network and 5xx errors get alerts and a retry queue. Validation errors go straight to the DLQ. Rate limits trigger a backoff in the parent workflow.

The error workflow itself should be the simplest workflow in your cluster. It should not call external services that might fail. If the error workflow throws, n8n doesn’t fire another error workflow, the error is just logged and dropped.

3. Circuit Breaker for Cascading Failures

Retries are dangerous in aggregate. If a downstream service is down, every n8n workflow that calls it will retry, increasing the load on the recovering service and prolonging the outage. The fix is a circuit breaker.

The pattern: track the recent error rate per dependency. If errors exceed a threshold, open the circuit for a cooldown period. While open, fail fast without calling the dependency.

Implement it in Redis with a couple of keys per dependency.

// Code node: circuit breaker check
const dep = 'billing-api';
const window = 60;        // seconds
const threshold = 0.5;    // 50% error rate
const cooldown = 30;      // seconds

const redis = require('ioredis');
const r = new redis(process.env.QUEUE_BULL_REDIS_HOST);

const openUntil = await r.get(`cb:${dep}:open_until`);
if (openUntil && parseInt(openUntil) > Date.now()) {
  throw new Error(`Circuit open for ${dep}, fail fast`);
}

const total = parseInt(await r.get(`cb:${dep}:total:${Math.floor(Date.now()/1000/window)}`) || '0');
const errors = parseInt(await r.get(`cb:${dep}:errors:${Math.floor(Date.now()/1000/window)}`) || '0');

if (total > 20 && errors / total > threshold) {
  await r.set(`cb:${dep}:open_until`, Date.now() + cooldown * 1000, 'EX', cooldown + 5);
  throw new Error(`Circuit opened for ${dep}`);
}

return $input.all();

The check runs before any call to the dependency. After the call, success or failure increments the appropriate counter:

// after call, in error handler
await r.incr(`cb:${dep}:total:${Math.floor(Date.now()/1000/window)}`);
await r.expire(`cb:${dep}:total:${Math.floor(Date.now()/1000/window)}`, window * 2);
if (callFailed) {
  await r.incr(`cb:${dep}:errors:${Math.floor(Date.now()/1000/window)}`);
  await r.expire(`cb:${dep}:errors:${Math.floor(Date.now()/1000/window)}`, window * 2);
}

The window bucket gives you a rolling rate over the last minute without the precision of a sliding window. Good enough for a circuit breaker.

For the canonical formal pattern with half-open state and progressive recovery, see the Microsoft Azure circuit breaker docs. The simplified version above is what I deploy and what’s been enough in practice.

4. Dead Letter Routing

Some errors will never resolve. Bad input data, schema mismatches, business rules violated by the source. These belong in a dead letter table for human review.

CREATE TABLE workflow_dead_letter (
  id            BIGSERIAL PRIMARY KEY,
  workflow_name TEXT NOT NULL,
  execution_id  TEXT NOT NULL,
  node_name     TEXT NOT NULL,
  category      TEXT NOT NULL,
  error_message TEXT NOT NULL,
  payload       JSONB NOT NULL,
  retry_count   INT NOT NULL DEFAULT 0,
  status        TEXT NOT NULL DEFAULT 'pending',
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_dlq_pending ON workflow_dead_letter (status, created_at) WHERE status = 'pending';

The error workflow inserts into this table when the category is validation, client-error, or other non-retryable types.

INSERT INTO workflow_dead_letter
  (workflow_name, execution_id, node_name, category, error_message, payload)
VALUES ($1, $2, $3, $4, $5, $6::jsonb)
ON CONFLICT DO NOTHING;

A separate “DLQ replay” workflow runs daily, scans for items that should be retried (after the underlying issue is fixed), and re-feeds them to the original workflow with a dlq_replay=true flag in the item to skip duplicate-detection or special handling.

The replay logic:

+----------+   +------------+   +-------------+   +----------+
|  Cron    |-->|  SELECT *  |-->|  for each   |-->| Original |
|          |   |  FROM DLQ  |   |  call WF    |   | workflow |
+----------+   +------------+   +-------------+   +-----+----+
                                                        |
                                                  +-----v-----+
                                                  | success?  |
                                                  +-----+-----+
                                                        |
                                            +-----------+----------+
                                            | UPDATE status='done' |
                                            +----------------------+

The DLQ table is also the source of the weekly “what broke last week” report. Categorize by error type, group by source, and you have a real picture of where your integrations are flaky.

5. Idempotency, the Hidden Foundation

Retries only work if the underlying operations are idempotent. POSTing the same payment twice is a bug. The fix is in the workflow design, not the retry logic.

For destination APIs that support idempotency keys (Stripe-style), include one with every mutation.

// Code node before the HTTP POST
const crypto = require('crypto');
const idemKey = crypto
  .createHash('sha256')
  .update(`${$execution.id}:${$json.invoice_id}`)
  .digest('hex')
  .slice(0, 32);

return [{ json: { ...$json, _idempotencyKey: idemKey } }];

The HTTP node sends Idempotency-Key: {{ $json._idempotencyKey }}. Two retries with the same execution ID produce the same key. The destination treats the second call as a no-op.

For destinations without native idempotency, build it in the workflow with a check-before-write pattern:

SELECT id FROM destination_records WHERE source_key = $1

If a row exists, skip the insert. The lookup adds a round trip per item but eliminates duplicates.

6. Alerting and Visibility

The last piece is making errors visible. Every error workflow should emit metrics to Prometheus and route critical errors to Slack or PagerDuty.

// Code node: emit metric and route alert
const { category, workflow } = $input.first().json;

await this.helpers.httpRequest({
  method: 'POST',
  url: 'http://pushgateway:9091/metrics/job/n8n_errors',
  body: `n8n_workflow_errors_total{workflow="${workflow.name}",category="${category}"} 1\n`,
  headers: { 'Content-Type': 'text/plain' },
});

if (category === 'server-error' || category === 'timeout') {
  await this.helpers.httpRequest({
    method: 'POST',
    url: process.env.SLACK_WEBHOOK_URL,
    body: { text: `:rotating_light: ${workflow.name} failed: ${$input.first().json.execution.error.message}` },
    json: true,
  });
}

return [];

Alert thresholds belong in Prometheus alertmanager, not in the workflow. The workflow emits counts, the alertmanager applies rate-of-change rules and silencing. The official n8n error workflows docs cover the trigger payload shape in detail.

Common Pitfalls

Four mistakes.

Setting retry on every node by default. Half your nodes don’t need retry. A Code node that transforms data doesn’t fail intermittently, it either works or has a bug. Retrying it just delays the bug report.

Retrying on 4xx errors. A 400 means the request was malformed. Retrying produces the same 400 with extra latency. Filter on status code before retrying.

Error workflow that catches and continues silently. Every error becomes invisible. The workflow looks healthy because executions are green. Months later you discover data has been silently dropping. Always emit a metric, always log, even if you choose not to alert.

Idempotency keys that include a timestamp. The key changes between retries, defeating the dedup. The key should be a function of the business identifier (invoice ID, customer ID), not the wall clock.

Troubleshooting

Three failure modes.

Workflow retries fire but no executions appear in the UI. n8n only shows the top-level retries, not Code-node-internal loops. If your retry is in a Code node, you only see the final outcome. Add logging via console.log or emit metrics from inside the loop to track attempts.

Error workflow throws and is silently dropped. The error workflow itself failed. n8n doesn’t fire an error-workflow-for-the-error-workflow. Look in the n8n process logs for Error workflow failed. Most common cause is a credential the error workflow doesn’t have access to.

DLQ replay creates infinite loops. The replay re-fires the same failure, which dead-letters again, which gets replayed. Add a retry_count column and skip replay above a threshold. Better, fix the underlying data issue before replay.

Wrapping Up

Production n8n error handling is layered. Per-node retry for transient flaps. Error workflows for terminal failures, with categorization. Circuit breakers to prevent cascading load on recovering services. Dead letter tables for poison messages, with explicit replay. Idempotency keys at every mutation. Metrics and alerts driven by the error workflow.

The next article gets into Kubernetes packaging for the whole stack, including how all of these patterns play with Helm and HPA-driven scaling. After that we’ll round out the series with the observability story.