background-shape
Error Handling and Retries in n8n
May 20, 2022 · 7 min read · by Muhammad Amal programming

TL;DR — n8n has per-node retry, “continue on fail” branches, and an Error Trigger workflow that fires when any workflow fails. Wire those three with a dead-letter pattern + Slack alerts on consecutive failures. Reliable automation is mostly about handling failure gracefully.

After webhooks, the next reliability concern: what happens when a workflow fails? Every external API has off days. Every node can hit a timeout. Most “automation went bad” stories are really “automation failed silently for two weeks.” This post covers the n8n primitives for not letting that happen.

Three levels of failure handling

In rough order of granularity:

  1. Per-node retry. Built into most n8n nodes. Configurable on the node’s settings tab.
  2. Per-branch continue-on-fail. A node’s error output feeds into a different branch. “If this fails, do that.”
  3. Per-workflow Error Trigger. A dedicated workflow that fires when any other workflow fails. Catches the things that escape per-node handling.

You’ll use all three.

Per-node retry

Open any node. Bottom-left, “Settings” tab. Three relevant options:

Retry On Fail: ON
Max Tries: 3
Wait Between Tries: 5000 ms

The node retries up to 3 times with 5 seconds between attempts. Transient failures (a 503 from an API, a brief network blip) usually clear within those retries.

When to enable per-node retry:

  • Every external API call (HTTP, Jira, Slack, GitHub, Stripe, etc.)
  • Every DB write
  • Every queue operation

When to skip retry:

  • Signature verification (it’s not transient; retrying won’t help)
  • Validation steps (same)
  • Operations that mutate state non-idempotently (you could double-charge a customer)

The defaults of 3 tries + 5 sec are fine for most cases. For long-running APIs that occasionally take 30+ seconds, bump wait to 30 sec.

Per-branch continue-on-fail

Sometimes failure is part of normal flow. Example: “look up user by email; if not found, create them.”

[HTTP: GET /users?email=...]
    ├─ on success → [continue normally with found user]
    └─ on fail   → [HTTP: POST /users to create] → [continue]

In the failing node’s settings:

Continue On Fail: ON

Now the node has two outputs: success and failure. Connect each to the appropriate downstream branch.

Failure output includes the error in $json.error. Useful for distinguishing “user not found” (404 → expected) from “server error” (5xx → real problem). In the failure branch’s first node, check:

IF $json.error.httpCode = 404 → handle as not-found
otherwise → re-throw via Code node:  throw new Error(JSON.stringify($json.error));

Error Trigger — the safety net

The most under-used n8n feature. Create a new workflow with an “Error Trigger” node as the first node. This workflow fires whenever ANY other workflow in your n8n instance fails (i.e., a node throws and isn’t handled).

[Error Trigger]
[Code: format the failure into a Slack message]
[Slack: post to #automation-alerts]

The Error Trigger receives:

{
  "execution": {
    "id": "...",
    "url": "https://n8n.example.com/execution/...",
    "error": { "message": "...", "stack": "...", "node": "..." }
  },
  "workflow": {
    "id": "...",
    "name": "Jira Auto Assign"
  }
}

You can build a one-shot alerter from there:

Code:
const item = $input.first().json;
const w = item.workflow.name;
const errMsg = item.execution.error.message;
const node = item.execution.error.node?.name ?? 'unknown';
const url = item.execution.url;

return [{ json: {
  text: `🚨 Workflow *${w}* failed at node *${node}*\n` +
        `Error: \`${errMsg}\`\n` +
        `<${url}|View execution>`
}}];

Slack: post to #automation-alerts

In every production workflow’s settings: set “Error Workflow” to point at this alerter. Now every failure pages someone.

Dead-letter pattern

For workflows that process events in batches (webhook receivers, queue consumers), failed items shouldn’t block the rest. The pattern:

[Webhook receives batch of events]
[Split: one item at a time]
[Process each — wrap critical work in try/catch (Code node)]
[If failed: store in dead-letter table; continue]
[At end: log summary; ack webhook]

Code-node skeleton:

const items = $input.all();
const results = [];

for (const item of items) {
  try {
    const out = await doWork(item.json);
    results.push({ json: { ok: true, ...out } });
  } catch (err) {
    await deadLetter.insert({
      payload: item.json,
      error: err.message,
      stack: err.stack,
      received_at: new Date().toISOString(),
    });
    results.push({ json: { ok: false, err: err.message } });
  }
}

return results;

Dead-letter is just a Postgres table:

CREATE TABLE deadletter_events (
  id bigserial PRIMARY KEY,
  workflow_name text NOT NULL,
  payload jsonb NOT NULL,
  error text NOT NULL,
  stack text,
  received_at timestamptz NOT NULL DEFAULT now(),
  retried_at timestamptz,
  resolved boolean NOT NULL DEFAULT false
);

A separate “replay” workflow can pick from this table on a schedule. Once a fix is deployed, you re-run the failed events.

Consecutive-failure alerting

A single failed execution is usually noise. Consecutive failures = real problem.

In your Error Trigger workflow, instead of alerting on every failure, increment a counter (Postgres or n8n’s “Static Data” feature):

const w = item.workflow.name;
const result = await db.query(
  `INSERT INTO workflow_failures (workflow_name, count) VALUES ($1, 1)
   ON CONFLICT (workflow_name) DO UPDATE SET count = workflow_failures.count + 1
   RETURNING count`,
  [w]
);
const count = result.rows[0].count;

if (count === 1 || count % 5 === 0) {
  // alert on 1st failure and every 5th after
  return [{ json: { ...buildSlackMsg(item), count } }];
}
return []; // don't alert

On success of the original workflow (separate workflow tracking?), reset the counter.

This avoids “20 alerts for the same outage” and “no alert because each individual failure was filed away as noise.”

Rate limits as expected errors

External APIs return 429 when rate-limited. n8n’s HTTP node sees this as an error. You probably want to retry, not alert.

Pattern: catch 429 explicitly, sleep based on Retry-After, retry. n8n’s per-node retry is too crude for this — use a Code node with manual logic:

const url = '...';
const maxTries = 5;
let backoff = 1000;

for (let i = 0; i < maxTries; i++) {
  const resp = await fetch(url, { headers: { Authorization: '...' } });
  if (resp.ok) return [{ json: await resp.json() }];
  if (resp.status === 429) {
    const retryAfter = parseInt(resp.headers.get('retry-after') || '5', 10);
    await new Promise(r => setTimeout(r, retryAfter * 1000));
    continue;
  }
  if (resp.status >= 500) {
    await new Promise(r => setTimeout(r, backoff));
    backoff *= 2;
    continue;
  }
  throw new Error(`API ${resp.status}: ${await resp.text()}`);
}
throw new Error('max retries exceeded');

n8n 1.0 (later in 2022) adds more sophisticated retry options. For now, Code nodes carry the load.

Audit trail: every failure traceable

Three sources of truth:

  • n8n’s built-in execution history — searchable in the UI, retained for EXECUTIONS_DATA_MAX_AGE (30 days in our setup)
  • Slack alerts channel — human-visible, ephemeral
  • Postgres dead-letter table — long-term, queryable, scriptable

The Slack channel is for “did this fail right now?” Dead-letter is for “did this fail twice last month?” n8n’s execution history is for “exactly what happened in that one failure.”

Common Pitfalls

No Error Workflow set on production workflows. Failures vanish into n8n’s execution log; nobody notices until something downstream surfaces it days later.

Alerting on every failure. Alert fatigue. After two days, the channel is muted and real failures are missed.

Retrying non-idempotent operations. Sending duplicate notifications, charging twice, creating duplicate tickets. Always think about idempotency before enabling retry.

Dead-letter without a replay path. Items pile up; nobody knows how to reprocess them. Build the replay workflow from day one.

Returning 200 from webhook handlers that then fail downstream. Sender thinks success; the event is lost. Either process synchronously or write to dead-letter before responding.

Putting credentials in error messages. Stack traces sometimes include URL with embedded auth. Sanitize before sending to Slack.

Treating every error as transient. A 401 won’t fix itself. Distinguish transient (timeout, 5xx, 429) from permanent (4xx, validation) — different responses.

Wrapping Up

Per-node retry + per-branch continue-on-fail + error trigger workflow + dead-letter + consecutive-failure alerting. Five patterns; sufficient for almost any production n8n setup. Monday: a standup bot in n8n + Slack — a workflow that ties together everything we’ve covered so far.