Observability for n8n in 2025, Metrics, Logs, and Traces

Observability for n8n in 2025, Metrics, Logs, and Traces

August 22, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — n8n 1.78 exposes Prometheus metrics natively, writes structured logs if asked, and accepts OpenTelemetry SDK wiring with a small init script. Wire all three. Alert on event loop lag, queue depth, and execution error rate. Correlate via execution ID across logs and traces.

The first time an n8n workflow silently breaks in production, you’ll wish you had observability. The second time, you’ll have it. This article skips the first time.

We’ll cover metrics from the native Prometheus endpoint, structured JSON logs with correlation IDs, OpenTelemetry tracing with spans per node, the dashboards I actually use, and the alerting rules that catch real incidents. The stack assumes the Kubernetes deployment from the self-hosted on Kubernetes article and the queue mode setup from earlier in this series.

Opinions stated plainly. Metrics catch trends, logs catch facts, traces catch causation. You need all three. A workflow execution without a trace is a debugging nightmare across a queue-mode cluster. The default n8n logs are too terse and unstructured for any real log pipeline.

1. Metrics, the Foundation

n8n 1.78 exposes a /metrics endpoint with Prometheus-format metrics when N8N_METRICS=true. The metric set covers process internals (event loop, GC), HTTP server, and workflow executions.

export N8N_METRICS=true
export N8N_METRICS_PREFIX=n8n_
export N8N_METRICS_INCLUDE_DEFAULT_METRICS=true
export N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true
export N8N_METRICS_INCLUDE_NODE_TYPE_LABEL=true
export N8N_METRICS_INCLUDE_QUEUE_METRICS=true
export N8N_METRICS_INCLUDE_API_ENDPOINTS=true

Scrape config for Prometheus 3.0:

scrape_configs:
  - job_name: n8n
    kubernetes_sd_configs:
      - role: pod
        namespaces: { names: [n8n] }
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: n8n
        action: keep
      - source_labels: [__meta_kubernetes_pod_label_component]
        target_label: component
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__address__]
        target_label: __address__
        regex: '(.+)'
        replacement: '${1}:5678'

The relabeling captures the component label (main/webhook/worker), which is essential for slicing dashboards by process type.

The metrics that matter

A short list of metrics worth alerting on:

+-----------------------------------+--------------------------------+
| metric                            | what it tells you              |
+-----------------------------------+--------------------------------+
| n8n_workflow_executions_total     | success vs error rate          |
| n8n_workflow_execution_duration   | p50/p95/p99 latency            |
| n8n_queue_jobs_waiting            | backlog                        |
| n8n_queue_jobs_active             | in-flight                      |
| n8n_event_loop_lag_seconds        | worker health                  |
| n8n_process_resident_memory_bytes | memory growth                  |
| n8n_http_request_duration_seconds | API latency                    |
+-----------------------------------+--------------------------------+

Custom metrics from within workflows

The native metrics cover infrastructure. Business metrics (records synced, invoices processed, etc.) come from the workflow itself. Use a small custom node or HTTP Request to a Prometheus pushgateway.

// Code node: emit a business metric
const value = $input.first().json.invoiceCount;
const labels = `workflow="billing-sync",destination="warehouse"`;

await this.helpers.httpRequest({
  method: 'POST',
  url: 'http://pushgateway:9091/metrics/job/n8n_business',
  body: `# TYPE n8n_invoices_synced_total counter\nn8n_invoices_synced_total{${labels}} ${value}\n`,
  headers: { 'Content-Type': 'text/plain' },
});

return $input.all();

The pushgateway is the right pattern for short-lived workflows where Prometheus can’t pull. Don’t push generic metrics this way, just the business counters that don’t have an obvious scrape target.

2. Structured Logs

The default n8n log format is plain text. For a real log pipeline, switch to JSON.

export N8N_LOG_LEVEL=info
export N8N_LOG_OUTPUT=console
export N8N_LOG_FORMAT=json

Each log line becomes a JSON object with timestamp, level, message, and a few context fields. Pipe to Vector or Logstash, parse, and ship to Loki or Elasticsearch.

A typical log line:

{
  "timestamp": "2025-08-22T09:14:33.211Z",
  "level": "info",
  "message": "Execution finished",
  "workflow": { "id": "42", "name": "billing-sync" },
  "execution": { "id": "9182", "status": "success", "duration": 1432 }
}

The execution.id is the correlation key. Every log line emitted during an execution carries it. Every workflow-level metric and trace span should use the same ID.

Log enrichment

For logs emitted from within a workflow (Code nodes, etc.), add the same correlation fields manually.

// Code node: structured log
console.log(JSON.stringify({
  timestamp: new Date().toISOString(),
  level: 'info',
  message: 'fetched records',
  workflow_id: $workflow.id,
  workflow_name: $workflow.name,
  execution_id: $execution.id,
  records: $input.all().length,
}));

return $input.all();

In a queue-mode cluster, this lets you grep across all worker logs for execution_id=9182 and see exactly what happened in one go.

Log sampling

Don’t ship every log line at full volume. Workflows that run thousands of times an hour generate noise. Sample at the source:

export N8N_LOG_LEVEL=warn  # workers, drops info

Keep info on main for the audit trail, drop to warn on workers for steady-state.

3. OpenTelemetry Tracing

The third leg. n8n 1.78 doesn’t have first-class OTel support, but the Node.js auto-instrumentation works because n8n is a normal Node.js process. The trick is preloading the OTel SDK.

Init script

// otel-init.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'n8n',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.78.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.DEPLOY_ENV || 'prod',
    'n8n.component': process.env.N8N_COMPONENT || 'main',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-ioredis': { enabled: true },
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Preload it via NODE_OPTIONS:

export NODE_OPTIONS="--require /opt/otel/otel-init.js"
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc.cluster.local:4318

The auto-instrumentation captures HTTP requests, Postgres queries, and Redis commands. That’s most of the interesting work n8n does. You get a trace per HTTP request to the API or webhook, with a span tree showing the downstream calls.

Custom spans per node

Auto-instrumentation doesn’t know about n8n workflow nodes. To get per-node spans, instrument from within the workflow.

// Code node at start of workflow
const { trace, context } = require('@opentelemetry/api');
const tracer = trace.getTracer('n8n-workflow');

const span = tracer.startSpan('workflow-execution', {
  attributes: {
    'n8n.workflow.id': $workflow.id,
    'n8n.workflow.name': $workflow.name,
    'n8n.execution.id': $execution.id,
  },
});

const ctx = trace.setSpan(context.active(), span);
const carrier = {};
const propagator = require('@opentelemetry/api').propagation.inject(ctx, carrier);

return $input.all().map(item => ({
  json: {
    ...item.json,
    _otel: carrier,
  },
}));

Subsequent nodes carry the trace context in _otel. A final Code node ends the span.

A cleaner approach is a custom n8n node that wraps the trace lifecycle. I keep one in our internal nodes package called Trace Start and Trace End. The startup overhead per execution is around 1ms, negligible compared to the workflow body.

4. Correlation Across Signals

The three signals are useful individually. They become magical when correlated.

The pattern: execution.id is the shared key.

+------------+   execution.id   +------------+
|  Metric    |<---------------->|   Trace    |
|  labels    |                  |   span     |
+------------+                  +------------+
       ^                              ^
       |          execution.id        |
       +----------+----------+--------+
                  |
            +-----v-----+
            |    Log    |
            |   field   |
            +-----------+

In Grafana, a panel that surfaces “all signals for execution X” can be a single dashboard with three queries, each filtered by the execution ID. When the on-call sees a failed execution in the n8n UI, they paste the ID into Grafana and get the metric trail, the log lines, and the trace flame graph.

The wiring:

Metrics: include execution_id as a label only on a low-cardinality subset (current_active, last_failed). High-cardinality labels destroy Prometheus.
Logs: include execution_id on every line.
Traces: include n8n.execution.id as a span attribute. Use the OpenTelemetry collector’s tail_sampling to keep all spans for any execution that errored.

5. Dashboards That Actually Get Used

Two dashboards I keep in every n8n deployment.

Dashboard 1: Cluster Health

Five panels:

Executions per second (success / error / crashed), stacked.
Queue waiting (line per queue name).
Worker event loop lag p99 (line per pod).
Worker memory usage (line per pod, hide axis label).
Errors per workflow over last hour (table sorted desc).

The “scan in 10 seconds” dashboard. If anything is red, you see it immediately. The reverse is also true, if everything is green, you can move on.

Dashboard 2: Workflow Detail

Variables: workflow_name dropdown. Then per-workflow panels:

Executions per minute (success/error).
Duration p50/p95/p99.
Error rate as % of executions.
Recent failed executions table (execution_id, error message, link to trace).
Custom business metrics if any (records synced, etc.).

The deep-dive view. After an alert fires on a workflow, this is where the on-call lives.

6. Alerting Rules

Three categories of alert.

Infrastructure alerts

Fire on cluster-level issues that affect all workflows.

- alert: N8NWorkerEventLoopLagging
  expr: max by (pod) (n8n_event_loop_lag_seconds) > 1
  for: 5m
  annotations:
    summary: "Worker {{ $labels.pod }} event loop lagging >1s"

- alert: N8NQueueBacklogGrowing
  expr: n8n_queue_jobs_waiting > 200
  for: 10m
  annotations:
    summary: "Queue depth >200 for 10m, add workers"

- alert: N8NWorkerMemoryHigh
  expr: n8n_process_resident_memory_bytes / 1024 / 1024 / 1024 > 1.8
  for: 5m
  annotations:
    summary: "Worker {{ $labels.pod }} memory at 1.8Gi, near limit"

Workflow alerts

Fire on individual workflow degradation.

- alert: N8NWorkflowErrorRate
  expr: |
    sum by (workflow_id) (rate(n8n_workflow_executions_total{status="error"}[5m])) /
    sum by (workflow_id) (rate(n8n_workflow_executions_total[5m])) > 0.1
  for: 5m
  annotations:
    summary: "Workflow {{ $labels.workflow_id }} error rate >10%"

- alert: N8NWorkflowDurationP99
  expr: |
    histogram_quantile(0.99, sum by (workflow_id, le) (rate(n8n_workflow_execution_duration_seconds_bucket[5m]))) > 30
  for: 10m
  annotations:
    summary: "Workflow {{ $labels.workflow_id }} p99 duration >30s"

Business alerts

Fire on application-level KPIs. These are workflow-specific.

- alert: BillingSyncLagging
  expr: time() - n8n_sync_watermark_unix_seconds{sync="billing"} > 1800
  for: 5m
  annotations:
    summary: "Billing sync watermark >30 min behind"

7. The OpenTelemetry Collector in the Middle

Don’t ship to your backend directly from n8n. Run the OTel Collector as a pipeline layer.

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch: { timeout: 10s, send_batch_size: 1024 }
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 5000 }
      - name: sample
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlphttp/tempo:
    endpoint: http://tempo.observability:4318
  prometheusremotewrite:
    endpoint: http://mimir.observability/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlphttp/tempo]

The tail-sampling policy keeps all error traces, all slow traces, and 5 percent of the rest. Tracing volume drops 95 percent without losing the interesting traces.

For the canonical OTel Collector reference, the OpenTelemetry Collector docs cover every receiver, processor, and exporter.

Common Pitfalls

Four mistakes.

Adding execution_id as a Prometheus label. Cardinality explodes. Prometheus performance dies. Labels should be low cardinality (workflow name, component, status). Put execution_id in logs and traces, not metric labels.

Sampling traces at 100 percent. At a few thousand executions per minute, this generates 10s of GB per day of trace data. Tail-sample to 5-10 percent plus all errors. Keep what matters.

No correlation between logs and metrics. The dashboard has a panel of errors, the user wants the log line, and there’s no link. Always wire execution_id into logs and add log links in Grafana dashboards.

Alerting on raw error counts instead of rates. “Errors > 10 in 5 minutes” looks reasonable until you have a workflow that runs 100K times a day. Use error rate as a fraction, not absolute count.

Troubleshooting

Three issues.

Metrics endpoint returns 404. N8N_METRICS is false or unset. Check kubectl exec n8n-main-x -- env | grep METRICS. If the variable is set but the endpoint still 404s, the port-forward is to the wrong port (metrics share port 5678 with the main API).

OTel traces stop appearing after a deploy. The OTel SDK init script is loading but the exporter URL is wrong. Check the n8n process logs for OTLPTraceExporter errors. Usually a missing env var or a network policy blocking the collector.

Log messages from Code nodes don’t appear in Loki. Loki is parsing the log as JSON, the Code node console.log output isn’t JSON. Wrap with JSON.stringify({...}) so every emitted line is valid JSON.

For the official guide on the metrics catalog, the n8n metrics docs list every metric and label.

Wrapping Up

Observability for n8n in 2025 is three layers. Prometheus metrics from the native endpoint, structured JSON logs with execution_id correlation, OpenTelemetry traces from auto-instrumentation plus per-node spans. Correlate all three by execution ID. Build dashboards for cluster health and per-workflow detail. Alert on rates, not counts. Tail-sample traces.

That closes this series. From queue mode and Redis to Kubernetes deployment, custom nodes, enterprise data syncs, Jira integrations, credentials management, error handling, and finally observability, you have the production playbook for n8n in August 2025. Treat it like any other distributed system, not a magic automation box, and it will treat you back the same way.

1. Metrics, the Foundation

The metrics that matter

Custom metrics from within workflows

2. Structured Logs

Log enrichment

Log sampling

3. OpenTelemetry Tracing

Init script

Custom spans per node

4. Correlation Across Signals

5. Dashboards That Actually Get Used

Dashboard 1: Cluster Health

Dashboard 2: Workflow Detail

6. Alerting Rules

Infrastructure alerts

Workflow alerts

Business alerts

7. The OpenTelemetry Collector in the Middle

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

Observability for Edge Fleets at Scale, Patterns That Work

Rust Service Observability in 2024, Metrics, Logs, and Traces That Help

Tempo for Distributed Tracing

Monitoring n8n in Production

Self Hosted n8n on Kubernetes, A Production Setup

Error Handling and Retries for Production n8n Workflows

Managing Secrets and Credentials in n8n for Enterprise

n8n Queue Mode with Redis at Scale, A Production Walkthrough

Let’s Start a Project