Tempo for Distributed Tracing | Hi, I'm Muhammad Amal

September 21, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Tempo stores traces in object storage indexed only by trace ID. Cheap. Use OpenTelemetry SDK to instrument services; auto-instrumentation libraries handle HTTP/gRPC/DB. Sample to ~1% in prod. Click metric anomalies → jump to traces in Grafana.

After Promtail, the third observability pillar. Distributed tracing surfaces “which step in the request was slow.” Tempo is Grafana’s tracing backend.

What traces are

A trace = the recorded path of one request through your services.

[user] → [API] → [auth-service]
              → [billing-service] → [postgres]
                                  → [stripe API]

Each step is a “span” with start time, duration, attributes. Spans share a trace_id; one span (the root) has no parent.

Tracing shows you:

Where the latency is (visual waterfall)
Whether services are talking to each other correctly
Cross-service errors (which span threw, which propagated)

Tempo’s model

Unlike Jaeger or Zipkin, Tempo doesn’t index span attributes. It indexes only trace_id. To find traces:

From metrics/logs that include trace_id (most common)
By searching attributes via TraceQL (Tempo 2.0; not stable in Sep 2022)
Random sampling for browsing

The bet: most trace use is “I see this slow request in my metrics; show me its trace.” That’s a trace_id lookup, which is cheap.

For “find traces where customer_id=42 and duration > 1s” — Tempo can’t do that efficiently. Use Jaeger or commercial.

Setup

tempo:
  image: grafana/tempo:1.5.0
  command: -config.file=/etc/tempo.yaml
  volumes:
    - ./tempo-config.yaml:/etc/tempo.yaml
    - tempo-data:/tmp/tempo
  ports:
    - "3200:3200"   # HTTP query
    - "4317:4317"   # OTLP gRPC receiver

tempo-config.yaml:

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 168h    # 7 days

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

For production, switch backend to s3 or gcs. Object storage is cheap; traces live there.

Instrumenting Go with OpenTelemetry

OpenTelemetry is the standard. The Go SDK:

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.12.0"
    "google.golang.org/grpc"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exp, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("tempo:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil { return nil, err }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("api"),
            semconv.ServiceVersionKey.String("1.2.3"),
        )),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01)),  // 1%
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    ctx := context.Background()
    tp, _ := initTracer(ctx)
    defer tp.Shutdown(ctx)

    // ... use otelhttp or otel.Tracer
}

Auto-instrumentation libraries handle the common stuff:

otelhttp for net/http server + client
otelgrpc for gRPC
splunk/otel-collector-builder for many DBs
otelchi, otelmux for popular routers

Once wrapped, every HTTP request automatically gets a span.

Manual spans

For business operations:

tracer := otel.Tracer("billing")

ctx, span := tracer.Start(ctx, "process-refund",
    trace.WithAttributes(
        attribute.String("subscription_id", subID),
        attribute.Int64("amount_cents", amount),
    ))
defer span.End()

if err := processRefund(ctx, subID, amount); err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
    return err
}

The span shows up as a node in the trace. Attributes are queryable.

Pass ctx to downstream calls — they pick up the trace context automatically.

Node.js instrumentation

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://tempo:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

getNodeAutoInstrumentations() auto-wraps Express, Fastify, HTTP, gRPC, Postgres, MongoDB, Redis, and ~20 more. Most services get full coverage with one line.

Linking metrics, logs, traces

The win of one stack: cross-reference.

In Grafana, configure trace ID linking:

# Grafana data source (Loki) provisioning
- name: Loki
  type: loki
  jsonData:
    derivedFields:
      - name: TraceID
        matcherRegex: "trace_id=(\\w+)"
        url: "${__value.raw}"
        datasourceUid: tempo

Now any log line with trace_id=abc123 becomes a clickable link → opens the trace in Tempo.

Similarly, exemplars in Prometheus link metric spikes to specific traces.

Sampling

100% sampling generates too much trace data. Sample.

sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01))   // 1%

Tail-based sampling (smarter — sample 100% of error traces, 1% of others) requires the OTel Collector with tail-sampling processor in between services and Tempo. Worth the setup at scale.

For our factory project: 5% sampling. Trace volume manageable; we catch enough slow paths.

Common Pitfalls

No service name set. Spans appear as “unknown”. Set service.name resource attribute.

100% sampling in prod. Trace volume balloons. Use 1-5%.

Forgetting to propagate ctx. New context starts a new trace; spans don’t connect. Pass ctx down everywhere.

Sampling per request not per trace. Half the spans of a trace get sampled in. Use trace-ID-based sampling.

Span without End(). Hangs. Defer span.End() immediately after start.

Tracing the wrong thing. Database queries, external HTTP calls — those make sense. Tracing every internal function = noise. Pick the boundaries.

Wrapping Up

Tempo + OpenTelemetry SDK = distributed tracing at low cost. Click metrics → jump to traces → see the slow span. Friday: SLOs in practice.

What traces are

Tempo’s model

Setup

Instrumenting Go with OpenTelemetry

Manual spans

Node.js instrumentation

Linking metrics, logs, traces

Sampling

Common Pitfalls

Wrapping Up

Related posts

Rust Service Observability in 2024, Metrics, Logs, and Traces That Help

Structured Logging in Rust with tracing

Observability for n8n in 2025, Metrics, Logs, and Traces

Observability for Edge Fleets at Scale, Patterns That Work

Building an HTTP Service with Axum 0.7, From Zero to Tracing

September Retro, One Stack to Watch Them All

Prometheus Cardinality and Cost Control

Service-Level Objectives in Practice

Let’s Start a Project