background-shape
Building an Observability Stack in 2022
September 2, 2022 · 3 min read · by Muhammad Amal programming

TL;DR — The Grafana-stack: Prometheus for metrics, Loki for logs, Tempo for traces, Grafana as the UI, Alertmanager for routing. Self-host in Compose for small setups; managed Grafana Cloud past a scale point. Three pillars (metrics, logs, traces) unified under one query interface.

After August’s IIoT monitoring, September goes broader: the full observability stack. The factory project from August is the running example, but the patterns apply to any backend.

This first post is the map. What goes where, why this combination, what the month covers.

The three pillars

Industry-standard framing:

Metrics — numeric measurements over time. “Request count,” “memory usage,” “queue depth.” Aggregated. Low storage cost per point. → Prometheus.

Logs — discrete events with context. “User 42 failed login from 1.2.3.4.” High volume; cheap to ingest; expensive to retain. → Loki.

Traces — span-of-time recordings of distributed operations. “HTTP request crossed 6 services; here’s where the 800ms went.” Sampled. → Tempo.

Each answers different questions. Metrics: “is it broken?” Logs: “what happened?” Traces: “where exactly?”

The Grafana stack

Why all-Grafana:

  • One UI: dashboards across metrics, logs, traces from one place
  • Same query language family: PromQL (Prom), LogQL (Loki), TraceQL (Tempo) share structure
  • Cross-correlation: click a metric anomaly, jump to logs from the same time window
  • Self-host or managed: same stack, your choice

The alternatives (Datadog, NewRelic, Splunk) are competitive on UX. They’re priced per ingest. For our factory project at $30/month of self-hosted infrastructure, the Grafana stack is ~50× cheaper than equivalent SaaS. Different math at different scales.

A working Compose stack

For local dev or small production:

name: obs-stack

services:
  prometheus:
    image: prom/prometheus:v2.37.1
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:9.1.0
    depends_on: [prometheus, loki, tempo]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports: ["3000:3000"]

  alertmanager:
    image: prom/alertmanager:v0.24.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports: ["9093:9093"]

  loki:
    image: grafana/loki:2.6.1
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki
    ports: ["3100:3100"]

  promtail:
    image: grafana/promtail:2.6.1
    volumes:
      - /var/log:/var/log
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

  tempo:
    image: grafana/tempo:1.5.0
    volumes:
      - ./tempo-config.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    command: -config.file=/etc/tempo.yaml
    ports: ["3200:3200"]   # tempo HTTP
    ports: ["4317:4317"]   # OTLP gRPC

volumes:
  prom-data:
  grafana-data:
  loki-data:
  tempo-data:

Five containers. ~2 GB RAM total. Enough for a small backend; scales to several thousand series, several GB/day of logs.

What each piece does

Prometheus — pulls metrics from your services at intervals. PromQL queries them. Retention typically 15-30 days (longer storage via Mimir / Cortex / Thanos).

Alertmanager — receives alerts from Prometheus, groups/deduplicates, routes to Slack/PagerDuty/email.

Loki — accepts logs (pushed by Promtail or via API), indexes by labels only (not content), stores compressed. Cheap to scale.

Promtail — log shipper. Tails container/file/journal logs, attaches labels, sends to Loki.

Tempo — accepts OpenTelemetry traces, stores in object storage (S3, GCS, MinIO). Indexed by trace ID for retrieval.

Grafana — connects to all of them. Dashboards, alerts, explore mode for ad-hoc queries.

What the month covers

12 more posts:

Common Pitfalls (preview)

  • One pillar at a time. Start with metrics. Add logs. Add traces. Don’t try to stand up all three on day one.
  • Stack without dashboards. Empty observability is decoration. Build dashboards for actual operational questions.
  • Retention defaults. Disk fills. Decide retention up front per pillar.
  • Cardinality explosion. One bad label blows up Prometheus memory. Cover the math Sep 28.
  • Self-host without an SRE. Operational cost of running observability is real. Budget for it.

Wrapping Up

Five-container Compose stack covers metrics, logs, traces. Monday: Prometheus fundamentals — the metrics layer first.