background-shape
Monitoring n8n in Production
May 27, 2022 · 6 min read · by Muhammad Amal programming

TL;DR — Health probes hit /healthz. Scrape execution stats from Postgres into Prometheus. Ship n8n logs to Loki. Dashboard: per-workflow success rate, p95 execution duration, failure count. Alert on consecutive failures or sustained low success rate.

Final how-to post of the month, before the retro. After securing n8n, the last operational concern: knowing when things break. n8n’s built-in execution history is useful but only when you remember to check. Production needs metrics, dashboards, alerts.

What to monitor

Three levels of question:

  1. Is n8n up? Process health, DB connectivity.
  2. Are workflows running? Schedule triggers firing, webhooks receiving.
  3. Are workflows succeeding? Execution success rate per workflow over time.

Different signals, different alerts.

Level 1: process health

Run a healthcheck against n8n’s /healthz endpoint (or / for older versions):

# docker-compose.yml
n8n:
  healthcheck:
    test: ["CMD", "wget", "-qO-", "http://localhost:5678/healthz"]
    interval: 30s
    timeout: 5s
    retries: 3

Kubernetes liveness/readiness probes do the equivalent. If /healthz is down, n8n is restarted automatically.

For external monitoring, ping /healthz from your status page (Uptime Robot, Better Stack, your own k8s liveness probes via Prometheus blackbox exporter).

This catches process crashes. Doesn’t catch “n8n is up but every workflow is failing.”

Level 2: workflows running

The most actionable signal: is the workflow you expected to run actually running?

n8n’s Postgres stores execution metadata. Useful queries:

Recently failed executions:

SELECT
  w.name AS workflow,
  e.id AS execution_id,
  e.started_at,
  e.finished,
  e.mode,
  e.data->>'error' AS error
FROM execution_entity e
JOIN workflow_entity w ON w.id = e.workflow_id
WHERE e.finished_at > now() - interval '1 hour'
  AND (e.finished = false OR e.data ? 'error')
ORDER BY e.started_at DESC;

Success rate per workflow, last 24 hours:

SELECT
  w.name AS workflow,
  count(*) FILTER (WHERE e.finished AND NOT (e.data ? 'error')) AS succeeded,
  count(*) FILTER (WHERE e.finished AND e.data ? 'error') AS failed,
  round(100.0 * count(*) FILTER (WHERE e.finished AND NOT (e.data ? 'error')) /
        NULLIF(count(*), 0), 2) AS success_pct
FROM execution_entity e
JOIN workflow_entity w ON w.id = e.workflow_id
WHERE e.started_at > now() - interval '24 hours'
GROUP BY w.name
ORDER BY success_pct ASC;

Workflows below 95% success rate are the ones to investigate.

Prometheus scraping

Two paths:

Path A — n8n’s experimental metrics endpoint. n8n 0.176 has an experimental /metrics endpoint (set N8N_METRICS=true). Exposes basic Prometheus-style counters. Useful but limited; expect more in n8n 1.0.

Path B — postgres_exporter on n8n’s database. Run prometheus-community/postgres_exporter pointed at the n8n DB, with custom queries that derive metrics from execution_entity:

# postgres_exporter custom-queries.yaml
n8n_executions:
  query: |
    SELECT
      w.name,
      sum(CASE WHEN e.finished AND NOT (e.data ? 'error') THEN 1 ELSE 0 END) AS succeeded,
      sum(CASE WHEN e.finished AND e.data ? 'error' THEN 1 ELSE 0 END) AS failed,
      sum(CASE WHEN NOT e.finished THEN 1 ELSE 0 END) AS in_progress
    FROM execution_entity e
    JOIN workflow_entity w ON w.id = e.workflow_id
    WHERE e.started_at > now() - interval '5 minutes'
    GROUP BY w.name
  metrics:
    - name: { usage: LABEL, description: "workflow name" }
    - succeeded: { usage: COUNTER, description: "succeeded executions, last 5 min" }
    - failed: { usage: COUNTER, description: "failed executions, last 5 min" }
    - in_progress: { usage: GAUGE, description: "in-progress executions" }

Now Prometheus scrapes n8n_executions_succeeded{name="standup-bot"}. Per-workflow visibility.

Grafana dashboard

Useful panels:

  • Execution count per workflow (last 24h) — stacked bar
  • Success rate per workflow (last 24h) — table sorted ascending
  • P50/P95/P99 execution duration (per workflow) — line chart
  • Failures in last hour — single stat, big number
  • Workflow run age — for each scheduled workflow: time since last successful run

The “time since last success” panel is the most actionable. A workflow whose last success was 3 days ago is either broken or shouldn’t exist anymore.

Log shipping

n8n logs to stdout (set N8N_LOG_OUTPUT=console). Ship via your normal Docker / k8s log pipeline:

  • Loki + Promtail: Promtail tails container logs, ships to Loki
  • Fluent Bit → ElasticSearch: alternative for ELK shops
  • Cloud-native: CloudWatch / Cloud Logging / Azure Monitor

Set N8N_LOG_LEVEL=info in prod. Debug logs are noisy and contain sensitive payloads.

Useful log queries (in Loki):

# Errors in last 15 min
{container="n8n"} |= "error" | json | level="error"

# Specific workflow failures
{container="n8n"} | json | workflow="standup-bot" | level="error"

# Slow executions
{container="n8n"} | json | duration > 5000

Alerting

Three alerts I recommend, in order of priority:

1. n8n process down for > 2 minutes.

(rate(up{job="n8n"}[2m]) == 0)

Pages immediately. Means n8n itself is broken.

2. Consecutive workflow failures.

From the database, count failed executions in the last hour per workflow:

SELECT name, count(*) AS failures
FROM workflow_entity w
JOIN execution_entity e ON e.workflow_id = w.id
WHERE e.started_at > now() - interval '1 hour'
  AND e.data ? 'error'
GROUP BY w.name
HAVING count(*) >= 5;

Alert on rows. Paired with the n8n Error Trigger workflow from error handling post, this catches workflows that are failing repeatedly.

3. Workflow that should have run but didn’t.

For scheduled workflows, check time since last success:

SELECT
  w.name,
  max(e.finished_at) AS last_success
FROM workflow_entity w
LEFT JOIN execution_entity e ON e.workflow_id = w.id
  AND e.finished = true AND NOT (e.data ? 'error')
WHERE w.active = true
  AND w.name IN ('standup-bot', 'daily-report', 'nightly-sync')
GROUP BY w.name;

If last_success for any of these is older than ~2 hours past schedule, alert. Workflow is silently not running.

Per-workflow SLOs

For workflows the team depends on (standup bot, deploy bot, on-call assignment), define explicit SLOs:

  • Standup bot success rate: 99% over 30 days
  • Standup bot execution latency: P95 < 5 seconds
  • Jira auto-assign: 95% success over 7 days
  • Webhook receivers: P99 response time < 1 second

These don’t need to be in fancy SLO tooling. A weekly review of “did each critical workflow meet its target?” is enough.

Dashboard mockup — the panels I actually look at

┌─────────────────────────────────────────────────────────────────────┐
│ n8n: Up ✓    Postgres: Up ✓    Last hour failures: 2                │
├─────────────────────────────────────────────────────────────────────┤
│ Success rate per workflow (last 24h):                                │
│   standup-bot         99.8%  ██████████████████████████              │
│   jira-auto-assign    97.2%  █████████████████████████               │
│   notion-jira-sync    100.0% ██████████████████████████              │
│   slack-slash-jira    94.5%  ████████████████████████                │
├─────────────────────────────────────────────────────────────────────┤
│ Time since last success:                                             │
│   standup-bot         18 min                                         │
│   jira-auto-assign    3 min                                          │
│   notion-jira-sync    4 hours  ⚠                                     │
│   slack-slash-jira    1 hour                                         │
├─────────────────────────────────────────────────────────────────────┤
│ Recent errors (last 1h):                                             │
│   12:14  jira-auto-assign:  Jira API 429                             │
│   13:02  notion-jira-sync:  Notion 502                               │
└─────────────────────────────────────────────────────────────────────┘

One screen, weekly review on Mondays. That’s it.

Common Pitfalls

Watching n8n’s UI execution log instead of an external dashboard. Nobody checks the n8n UI proactively. Push metrics to a dashboard you already look at.

Alerting on every failure. Alert fatigue. Threshold (consecutive failures, sustained rate) is what makes alerts useful.

Not retaining execution data long enough. Tuning EXECUTIONS_DATA_MAX_AGE to 7 days means you can’t debug a “this broke 2 weeks ago” report. 30 days is sane.

Retaining forever. The execution_entity table grows fast. Workflows that produce 1MB JSON payloads × 1000 runs/day × 365 days = a lot. Prune.

Confusing “process is up” with “workflows are running.” They’re independent failure modes. Monitor both.

Logging payloads with PII. Be specific about what to log. Never the full payload.

Treating n8n metrics as an afterthought. If automation is load-bearing, observability is too. Set it up before you need it.

Wrapping Up

Process health, execution metrics, log aggregation, three targeted alerts. n8n in production is just another service that needs observability — the patterns are the same as your other Node services. Monday: the May retro wrapping up the month.