Building an Observability Stack in 2022

Observability article cover illustration on a gradient background

September 2, 2022 · 3 min read · by Muhammad Amal programming

TL;DR — The Grafana-stack: Prometheus for metrics, Loki for logs, Tempo for traces, Grafana as the UI, Alertmanager for routing. Self-host in Compose for small setups; managed Grafana Cloud past a scale point. Three pillars (metrics, logs, traces) unified under one query interface.

After August’s IIoT monitoring , September goes broader: the full observability stack. The factory project from August is the running example, but the patterns apply to any backend.

This first post is the map. What goes where, why this combination, what the month covers.

The three pillars

Industry-standard framing:

Metrics — numeric measurements over time. “Request count,” “memory usage,” “queue depth.” Aggregated. Low storage cost per point. → Prometheus.

Logs — discrete events with context. “User 42 failed login from 1.2.3.4.” High volume; cheap to ingest; expensive to retain. → Loki.

Traces — span-of-time recordings of distributed operations. “HTTP request crossed 6 services; here’s where the 800ms went.” Sampled. → Tempo.

Each answers different questions. Metrics: “is it broken?” Logs: “what happened?” Traces: “where exactly?”

The Grafana stack

Why all-Grafana:

One UI: dashboards across metrics, logs, traces from one place
Same query language family: PromQL (Prom), LogQL (Loki), TraceQL (Tempo) share structure
Cross-correlation: click a metric anomaly, jump to logs from the same time window
Self-host or managed: same stack, your choice

The alternatives (Datadog, NewRelic, Splunk) are competitive on UX. They’re priced per ingest. For our factory project at $30/month of self-hosted infrastructure, the Grafana stack is ~50× cheaper than equivalent SaaS. Different math at different scales.

A working Compose stack

For local dev or small production:

name: obs-stack

services:
  prometheus:
    image: prom/prometheus:v2.37.1
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:9.1.0
    depends_on: [prometheus, loki, tempo]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports: ["3000:3000"]

  alertmanager:
    image: prom/alertmanager:v0.24.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports: ["9093:9093"]

  loki:
    image: grafana/loki:2.6.1
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki
    ports: ["3100:3100"]

  promtail:
    image: grafana/promtail:2.6.1
    volumes:
      - /var/log:/var/log
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

  tempo:
    image: grafana/tempo:1.5.0
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    command: -config.file=/etc/tempo.yaml
    ports: ["3200:3200"]   # tempo HTTP
    ports: ["4317:4317"]   # OTLP gRPC

volumes:
  prom-data:
  grafana-data:
  loki-data:
  tempo-data:

Five containers. ~2 GB RAM total. Enough for a small backend; scales to several thousand series, several GB/day of logs.

What each piece does

Prometheus — pulls metrics from your services at intervals. PromQL queries them. Retention typically 15-30 days (longer storage via Mimir / Cortex / Thanos).

Alertmanager — receives alerts from Prometheus, groups/deduplicates, routes to Slack/PagerDuty/email.

Loki — accepts logs (pushed by Promtail or via API), indexes by labels only (not content), stores compressed. Cheap to scale.

Promtail — log shipper. Tails container/file/journal logs, attaches labels, sends to Loki.

Tempo — accepts OpenTelemetry traces, stores in object storage (S3, GCS, MinIO). Indexed by trace ID for retrieval.

Grafana — connects to all of them. Dashboards, alerts, explore mode for ad-hoc queries.

What the month covers

12 more posts:

Sep 5: Prometheus 101
Sep 7: Instrumenting Go for Prometheus
Sep 9: Instrumenting Node for Prometheus
Sep 12: Grafana dashboards that help
Sep 14: Alertmanager basics
Sep 16: Loki for log aggregation
Sep 19: Promtail pipelines
Sep 21: Tempo for distributed tracing
Sep 23: SLOs in practice
Sep 26: Error budgets and burn rate
Sep 28: Cardinality and cost control
Sep 30: Month retro

Common Pitfalls (preview)

One pillar at a time. Start with metrics. Add logs. Add traces. Don’t try to stand up all three on day one.
Stack without dashboards. Empty observability is decoration. Build dashboards for actual operational questions.
Retention defaults. Disk fills. Decide retention up front per pillar.
Cardinality explosion. One bad label blows up Prometheus memory. Cover the math Sep 28.
Self-host without an SRE. Operational cost of running observability is real. Budget for it.

Wrapping Up

Five-container Compose stack covers metrics, logs, traces. Monday: Prometheus fundamentals — the metrics layer first.

The three pillars

The Grafana stack

A working Compose stack

What each piece does

What the month covers

Common Pitfalls (preview)

Wrapping Up

Related posts

September Retro, One Stack to Watch Them All

Prometheus Cardinality and Cost Control

Alerting with Prometheus Alertmanager

Grafana Dashboards That Actually Help

Instrumenting Node.js Services for Prometheus

Instrumenting Go Services for Prometheus

Prometheus 101, Metrics, Scraping, and PromQL

Monitoring n8n in Production

Let’s Start a Project