Observability for n8n in 2025, Metrics, Logs, and Traces
TL;DR — n8n 1.78 exposes Prometheus metrics natively, writes structured logs if asked, and accepts OpenTelemetry SDK wiring with a small init script. Wire all three. Alert on event loop lag, queue depth, and execution error rate. Correlate via execution ID across logs and traces.
The first time an n8n workflow silently breaks in production, you’ll wish you had observability. The second time, you’ll have it. This article skips the first time.
We’ll cover metrics from the native Prometheus endpoint, structured JSON logs with correlation IDs, OpenTelemetry tracing with spans per node, the dashboards I actually use, and the alerting rules that catch real incidents. The stack assumes the Kubernetes deployment from the self-hosted on Kubernetes article and the queue mode setup from earlier in this series.
Opinions stated plainly. Metrics catch trends, logs catch facts, traces catch causation. You need all three. A workflow execution without a trace is a debugging nightmare across a queue-mode cluster. The default n8n logs are too terse and unstructured for any real log pipeline.
1. Metrics, the Foundation
n8n 1.78 exposes a /metrics endpoint with Prometheus-format metrics when N8N_METRICS=true. The metric set covers process internals (event loop, GC), HTTP server, and workflow executions.
export N8N_METRICS=true
export N8N_METRICS_PREFIX=n8n_
export N8N_METRICS_INCLUDE_DEFAULT_METRICS=true
export N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true
export N8N_METRICS_INCLUDE_NODE_TYPE_LABEL=true
export N8N_METRICS_INCLUDE_QUEUE_METRICS=true
export N8N_METRICS_INCLUDE_API_ENDPOINTS=true
Scrape config for Prometheus 3.0:
scrape_configs:
- job_name: n8n
kubernetes_sd_configs:
- role: pod
namespaces: { names: [n8n] }
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: n8n
action: keep
- source_labels: [__meta_kubernetes_pod_label_component]
target_label: component
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__address__]
target_label: __address__
regex: '(.+)'
replacement: '${1}:5678'
The relabeling captures the component label (main/webhook/worker), which is essential for slicing dashboards by process type.
The metrics that matter
A short list of metrics worth alerting on:
+-----------------------------------+--------------------------------+
| metric | what it tells you |
+-----------------------------------+--------------------------------+
| n8n_workflow_executions_total | success vs error rate |
| n8n_workflow_execution_duration | p50/p95/p99 latency |
| n8n_queue_jobs_waiting | backlog |
| n8n_queue_jobs_active | in-flight |
| n8n_event_loop_lag_seconds | worker health |
| n8n_process_resident_memory_bytes | memory growth |
| n8n_http_request_duration_seconds | API latency |
+-----------------------------------+--------------------------------+
Custom metrics from within workflows
The native metrics cover infrastructure. Business metrics (records synced, invoices processed, etc.) come from the workflow itself. Use a small custom node or HTTP Request to a Prometheus pushgateway.
// Code node: emit a business metric
const value = $input.first().json.invoiceCount;
const labels = `workflow="billing-sync",destination="warehouse"`;
await this.helpers.httpRequest({
method: 'POST',
url: 'http://pushgateway:9091/metrics/job/n8n_business',
body: `# TYPE n8n_invoices_synced_total counter\nn8n_invoices_synced_total{${labels}} ${value}\n`,
headers: { 'Content-Type': 'text/plain' },
});
return $input.all();
The pushgateway is the right pattern for short-lived workflows where Prometheus can’t pull. Don’t push generic metrics this way, just the business counters that don’t have an obvious scrape target.
2. Structured Logs
The default n8n log format is plain text. For a real log pipeline, switch to JSON.
export N8N_LOG_LEVEL=info
export N8N_LOG_OUTPUT=console
export N8N_LOG_FORMAT=json
Each log line becomes a JSON object with timestamp, level, message, and a few context fields. Pipe to Vector or Logstash, parse, and ship to Loki or Elasticsearch.
A typical log line:
{
"timestamp": "2025-08-22T09:14:33.211Z",
"level": "info",
"message": "Execution finished",
"workflow": { "id": "42", "name": "billing-sync" },
"execution": { "id": "9182", "status": "success", "duration": 1432 }
}
The execution.id is the correlation key. Every log line emitted during an execution carries it. Every workflow-level metric and trace span should use the same ID.
Log enrichment
For logs emitted from within a workflow (Code nodes, etc.), add the same correlation fields manually.
// Code node: structured log
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'info',
message: 'fetched records',
workflow_id: $workflow.id,
workflow_name: $workflow.name,
execution_id: $execution.id,
records: $input.all().length,
}));
return $input.all();
In a queue-mode cluster, this lets you grep across all worker logs for execution_id=9182 and see exactly what happened in one go.
Log sampling
Don’t ship every log line at full volume. Workflows that run thousands of times an hour generate noise. Sample at the source:
export N8N_LOG_LEVEL=warn # workers, drops info
Keep info on main for the audit trail, drop to warn on workers for steady-state.
3. OpenTelemetry Tracing
The third leg. n8n 1.78 doesn’t have first-class OTel support, but the Node.js auto-instrumentation works because n8n is a normal Node.js process. The trick is preloading the OTel SDK.
Init script
// otel-init.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'n8n',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.78.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.DEPLOY_ENV || 'prod',
'n8n.component': process.env.N8N_COMPONENT || 'main',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-ioredis': { enabled: true },
}),
],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
Preload it via NODE_OPTIONS:
export NODE_OPTIONS="--require /opt/otel/otel-init.js"
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc.cluster.local:4318
The auto-instrumentation captures HTTP requests, Postgres queries, and Redis commands. That’s most of the interesting work n8n does. You get a trace per HTTP request to the API or webhook, with a span tree showing the downstream calls.
Custom spans per node
Auto-instrumentation doesn’t know about n8n workflow nodes. To get per-node spans, instrument from within the workflow.
// Code node at start of workflow
const { trace, context } = require('@opentelemetry/api');
const tracer = trace.getTracer('n8n-workflow');
const span = tracer.startSpan('workflow-execution', {
attributes: {
'n8n.workflow.id': $workflow.id,
'n8n.workflow.name': $workflow.name,
'n8n.execution.id': $execution.id,
},
});
const ctx = trace.setSpan(context.active(), span);
const carrier = {};
const propagator = require('@opentelemetry/api').propagation.inject(ctx, carrier);
return $input.all().map(item => ({
json: {
...item.json,
_otel: carrier,
},
}));
Subsequent nodes carry the trace context in _otel. A final Code node ends the span.
A cleaner approach is a custom n8n node that wraps the trace lifecycle. I keep one in our internal nodes package called Trace Start and Trace End. The startup overhead per execution is around 1ms, negligible compared to the workflow body.
4. Correlation Across Signals
The three signals are useful individually. They become magical when correlated.
The pattern: execution.id is the shared key.
+------------+ execution.id +------------+
| Metric |<---------------->| Trace |
| labels | | span |
+------------+ +------------+
^ ^
| execution.id |
+----------+----------+--------+
|
+-----v-----+
| Log |
| field |
+-----------+
In Grafana, a panel that surfaces “all signals for execution X” can be a single dashboard with three queries, each filtered by the execution ID. When the on-call sees a failed execution in the n8n UI, they paste the ID into Grafana and get the metric trail, the log lines, and the trace flame graph.
The wiring:
- Metrics: include
execution_idas a label only on a low-cardinality subset (current_active, last_failed). High-cardinality labels destroy Prometheus. - Logs: include
execution_idon every line. - Traces: include
n8n.execution.idas a span attribute. Use the OpenTelemetry collector’stail_samplingto keep all spans for any execution that errored.
5. Dashboards That Actually Get Used
Two dashboards I keep in every n8n deployment.
Dashboard 1: Cluster Health
Five panels:
- Executions per second (success / error / crashed), stacked.
- Queue waiting (line per queue name).
- Worker event loop lag p99 (line per pod).
- Worker memory usage (line per pod, hide axis label).
- Errors per workflow over last hour (table sorted desc).
The “scan in 10 seconds” dashboard. If anything is red, you see it immediately. The reverse is also true, if everything is green, you can move on.
Dashboard 2: Workflow Detail
Variables: workflow_name dropdown. Then per-workflow panels:
- Executions per minute (success/error).
- Duration p50/p95/p99.
- Error rate as % of executions.
- Recent failed executions table (execution_id, error message, link to trace).
- Custom business metrics if any (records synced, etc.).
The deep-dive view. After an alert fires on a workflow, this is where the on-call lives.
6. Alerting Rules
Three categories of alert.
Infrastructure alerts
Fire on cluster-level issues that affect all workflows.
- alert: N8NWorkerEventLoopLagging
expr: max by (pod) (n8n_event_loop_lag_seconds) > 1
for: 5m
annotations:
summary: "Worker {{ $labels.pod }} event loop lagging >1s"
- alert: N8NQueueBacklogGrowing
expr: n8n_queue_jobs_waiting > 200
for: 10m
annotations:
summary: "Queue depth >200 for 10m, add workers"
- alert: N8NWorkerMemoryHigh
expr: n8n_process_resident_memory_bytes / 1024 / 1024 / 1024 > 1.8
for: 5m
annotations:
summary: "Worker {{ $labels.pod }} memory at 1.8Gi, near limit"
Workflow alerts
Fire on individual workflow degradation.
- alert: N8NWorkflowErrorRate
expr: |
sum by (workflow_id) (rate(n8n_workflow_executions_total{status="error"}[5m])) /
sum by (workflow_id) (rate(n8n_workflow_executions_total[5m])) > 0.1
for: 5m
annotations:
summary: "Workflow {{ $labels.workflow_id }} error rate >10%"
- alert: N8NWorkflowDurationP99
expr: |
histogram_quantile(0.99, sum by (workflow_id, le) (rate(n8n_workflow_execution_duration_seconds_bucket[5m]))) > 30
for: 10m
annotations:
summary: "Workflow {{ $labels.workflow_id }} p99 duration >30s"
Business alerts
Fire on application-level KPIs. These are workflow-specific.
- alert: BillingSyncLagging
expr: time() - n8n_sync_watermark_unix_seconds{sync="billing"} > 1800
for: 5m
annotations:
summary: "Billing sync watermark >30 min behind"
7. The OpenTelemetry Collector in the Middle
Don’t ship to your backend directly from n8n. Run the OTel Collector as a pipeline layer.
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch: { timeout: 10s, send_batch_size: 1024 }
tail_sampling:
decision_wait: 30s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 5000 }
- name: sample
type: probabilistic
probabilistic: { sampling_percentage: 5 }
exporters:
otlphttp/tempo:
endpoint: http://tempo.observability:4318
prometheusremotewrite:
endpoint: http://mimir.observability/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlphttp/tempo]
The tail-sampling policy keeps all error traces, all slow traces, and 5 percent of the rest. Tracing volume drops 95 percent without losing the interesting traces.
For the canonical OTel Collector reference, the OpenTelemetry Collector docs cover every receiver, processor, and exporter.
Common Pitfalls
Four mistakes.
Adding execution_id as a Prometheus label. Cardinality explodes. Prometheus performance dies. Labels should be low cardinality (workflow name, component, status). Put execution_id in logs and traces, not metric labels.
Sampling traces at 100 percent. At a few thousand executions per minute, this generates 10s of GB per day of trace data. Tail-sample to 5-10 percent plus all errors. Keep what matters.
No correlation between logs and metrics. The dashboard has a panel of errors, the user wants the log line, and there’s no link. Always wire execution_id into logs and add log links in Grafana dashboards.
Alerting on raw error counts instead of rates. “Errors > 10 in 5 minutes” looks reasonable until you have a workflow that runs 100K times a day. Use error rate as a fraction, not absolute count.
Troubleshooting
Three issues.
Metrics endpoint returns 404. N8N_METRICS is false or unset. Check kubectl exec n8n-main-x -- env | grep METRICS. If the variable is set but the endpoint still 404s, the port-forward is to the wrong port (metrics share port 5678 with the main API).
OTel traces stop appearing after a deploy. The OTel SDK init script is loading but the exporter URL is wrong. Check the n8n process logs for OTLPTraceExporter errors. Usually a missing env var or a network policy blocking the collector.
Log messages from Code nodes don’t appear in Loki. Loki is parsing the log as JSON, the Code node console.log output isn’t JSON. Wrap with JSON.stringify({...}) so every emitted line is valid JSON.
For the official guide on the metrics catalog, the n8n metrics docs list every metric and label.
Wrapping Up
Observability for n8n in 2025 is three layers. Prometheus metrics from the native endpoint, structured JSON logs with execution_id correlation, OpenTelemetry traces from auto-instrumentation plus per-node spans. Correlate all three by execution ID. Build dashboards for cluster health and per-workflow detail. Alert on rates, not counts. Tail-sample traces.
That closes this series. From queue mode and Redis to Kubernetes deployment, custom nodes, enterprise data syncs, Jira integrations, credentials management, error handling, and finally observability, you have the production playbook for n8n in August 2025. Treat it like any other distributed system, not a magic automation box, and it will treat you back the same way.