background-shape
Promtail Pipelines and Log Parsing
September 19, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Promtail pipelines transform logs before sending to Loki. Parse JSON, regex-extract fields, set labels, drop noisy messages. Per-job pipelines keep label cardinality controlled. The right pipeline turns chaotic logs into queryable data.

After Loki basics, the agent-side processing. Promtail is more than a log shipper; its pipeline stages transform logs.

What a pipeline does

A pipeline is a sequence of stages applied to each log line:

[Log line] → [JSON parse] → [Extract fields] → [Set labels] → [Drop if matches] → [Send to Loki]

Each stage either reads or modifies the line. Final output goes to Loki with assigned labels.

Parsing JSON logs

A common case: services log JSON. Promtail extracts fields, makes them queryable.

scrape_configs:
  - job_name: api
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/api(-\d+)?'
        action: keep
    pipeline_stages:
      - json:
          expressions:
            level:
            user_id:
            message:
            timestamp: ts
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339

Three stages:

  1. json — parse the line, extract level, user_id, message, timestamp (from ts field)
  2. labels — promote level to a label (queryable as {level="error"})
  3. timestamp — use the log’s own timestamp instead of ingestion time

Note: user_id is extracted but NOT promoted to a label. It’s available in | json queries but doesn’t bloat cardinality.

Regex extraction

For non-JSON logs (Nginx access logs, Postgres, etc.):

pipeline_stages:
  - regex:
      expression: '^(?P<ip>\S+) \S+ \S+ \[(?P<ts>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) [^"]*" (?P<status>\d+) (?P<bytes>\d+)'
  - labels:
      method:
      status:
  - timestamp:
      source: ts
      format: 02/Jan/2006:15:04:05 -0700

Named capture groups (?P<name>) become extractable fields. Pipe selected ones to labels.

Dropping noisy logs

Health-check logs are common noise:

pipeline_stages:
  - match:
      selector: '{container="api"}'
      stages:
        - regex:
            expression: '/healthz|/metrics'
        - drop:
            source: "matched"

Or simpler:

pipeline_stages:
  - drop:
      expression: 'GET /healthz'

Cuts log volume substantially. Health checks at 1Hz × 24 hours = 86,400 lines/day per service. Drop them.

Multi-line logs (stack traces)

Java/Python stack traces span multiple log lines. Group them:

pipeline_stages:
  - multiline:
      firstline: '^\d{4}-\d{2}-\d{2}'    # new log line starts with date
      max_wait_time: 3s

Now WARN ... \n at com.foo.bar is one log entry instead of N. Critical for Java/Python where stack traces ARE the diagnostic.

Tenant routing

For multi-tenant Loki:

pipeline_stages:
  - tenant:
      label: tenant

The tenant label routes the log to a specific tenant’s storage in Loki. Useful for SaaS where you want per-customer log isolation.

Static labels per job

Add labels every log gets, regardless of content:

scrape_configs:
  - job_name: api
    static_configs:
      - labels:
          env: production
          team: backend
        targets: ['localhost']

These labels apply to all logs from this scrape config. Common: env, cluster, region.

Sampling

For absurdly chatty logs, sample:

pipeline_stages:
  - sampling:
      rate: 0.1    # keep 10%

Loses data; valid for debug-level or very high-frequency info logs. Don’t sample error or warning logs.

A complete pipeline example

For a Go service emitting structured JSON via tracing post-style logging:

scrape_configs:
  - job_name: services
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(api|bff|worker|scheduler)(-.*)?'
        target_label: 'service'
        replacement: '$1'
      - action: keep
        regex: '/(api|bff|worker|scheduler).*'
        source_labels: ['__meta_docker_container_name']
    pipeline_stages:
      # Skip noise
      - match:
          selector: '{service="api"}'
          stages:
            - drop:
                expression: 'GET /healthz|GET /metrics'

      # Parse JSON structure
      - json:
          expressions:
            level:
            service:
            method:
            path:
            request_id:
            error:
            timestamp:

      # Promote to labels (bounded cardinality only)
      - labels:
          level:
          service:

      # Use log's timestamp
      - timestamp:
          source: timestamp
          format: RFC3339

The result in Loki: structured logs with {service, level, env} labels. request_id, path, error etc. available via | json in queries.

Label hygiene revisited

Cardinality math: streams ≈ product of label values.

{service=4 values} × {level=4 values} × {env=2 values} = 32 streams

That’s healthy. Add path (50 values) → 1600 streams. Add user_id (10K) → 16M streams. Loki dies.

Rule: labels should be bounded by what you query in dashboards/alerts. Anything else stays in the log line.

Common Pitfalls

JSON parse failures silently dropping logs. Fails silently in older versions. Promtail 2.6 logs the parse errors. Watch them.

Regex without named groups. Pipelines can’t extract unnamed captures. Always (?P<name>...).

Multiline misconfigured. Wrong firstline regex = each line is its own entry; stack traces fragmented.

Setting too many labels. Stream explosion. Promote carefully.

Pipeline order matters. Parse before label promotion. Drop early, not late.

Dropping too aggressively. Filtering health checks is fine; filtering all 200-status logs loses important data. Be specific.

Wrapping Up

Promtail pipelines transform raw logs into Loki-friendly structured data. JSON parse + label promotion + drop noise = clean dashboard inputs. Wednesday: Tempo for distributed tracing — the third pillar.