background-shape
Tempo for Distributed Tracing
September 21, 2022 · 4 min read · by Muhammad Amal programming

TL;DR — Tempo stores traces in object storage indexed only by trace ID. Cheap. Use OpenTelemetry SDK to instrument services; auto-instrumentation libraries handle HTTP/gRPC/DB. Sample to ~1% in prod. Click metric anomalies → jump to traces in Grafana.

After Promtail, the third observability pillar. Distributed tracing surfaces “which step in the request was slow.” Tempo is Grafana’s tracing backend.

What traces are

A trace = the recorded path of one request through your services.

[user] → [API] → [auth-service]
              → [billing-service] → [postgres]
                                  → [stripe API]

Each step is a “span” with start time, duration, attributes. Spans share a trace_id; one span (the root) has no parent.

Tracing shows you:

  • Where the latency is (visual waterfall)
  • Whether services are talking to each other correctly
  • Cross-service errors (which span threw, which propagated)

Tempo’s model

Unlike Jaeger or Zipkin, Tempo doesn’t index span attributes. It indexes only trace_id. To find traces:

  • From metrics/logs that include trace_id (most common)
  • By searching attributes via TraceQL (Tempo 2.0; not stable in Sep 2022)
  • Random sampling for browsing

The bet: most trace use is “I see this slow request in my metrics; show me its trace.” That’s a trace_id lookup, which is cheap.

For “find traces where customer_id=42 and duration > 1s” — Tempo can’t do that efficiently. Use Jaeger or commercial.

Setup

tempo:
  image: grafana/tempo:1.5.0
  command: -config.file=/etc/tempo.yaml
  volumes:
    - ./tempo-config.yaml:/etc/tempo.yaml
    - tempo-data:/tmp/tempo
  ports:
    - "3200:3200"   # HTTP query
    - "4317:4317"   # OTLP gRPC receiver

tempo-config.yaml:

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 168h    # 7 days

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
    pool:
      max_workers: 100
      queue_depth: 10000

For production, switch backend to s3 or gcs. Object storage is cheap; traces live there.

Instrumenting Go with OpenTelemetry

OpenTelemetry is the standard. The Go SDK:

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.12.0"
    "google.golang.org/grpc"
)

func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
    exp, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("tempo:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil { return nil, err }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("api"),
            semconv.ServiceVersionKey.String("1.2.3"),
        )),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01)),  // 1%
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    ctx := context.Background()
    tp, _ := initTracer(ctx)
    defer tp.Shutdown(ctx)

    // ... use otelhttp or otel.Tracer
}

Auto-instrumentation libraries handle the common stuff:

  • otelhttp for net/http server + client
  • otelgrpc for gRPC
  • splunk/otel-collector-builder for many DBs
  • otelchi, otelmux for popular routers

Once wrapped, every HTTP request automatically gets a span.

Manual spans

For business operations:

tracer := otel.Tracer("billing")

ctx, span := tracer.Start(ctx, "process-refund",
    trace.WithAttributes(
        attribute.String("subscription_id", subID),
        attribute.Int64("amount_cents", amount),
    ))
defer span.End()

if err := processRefund(ctx, subID, amount); err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
    return err
}

The span shows up as a node in the trace. Attributes are queryable.

Pass ctx to downstream calls — they pick up the trace context automatically.

Node.js instrumentation

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: 'http://tempo:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

getNodeAutoInstrumentations() auto-wraps Express, Fastify, HTTP, gRPC, Postgres, MongoDB, Redis, and ~20 more. Most services get full coverage with one line.

Linking metrics, logs, traces

The win of one stack: cross-reference.

In Grafana, configure trace ID linking:

# Grafana data source (Loki) provisioning
- name: Loki
  type: loki
  jsonData:
    derivedFields:
      - name: TraceID
        matcherRegex: "trace_id=(\\w+)"
        url: "${__value.raw}"
        datasourceUid: tempo

Now any log line with trace_id=abc123 becomes a clickable link → opens the trace in Tempo.

Similarly, exemplars in Prometheus link metric spikes to specific traces.

Sampling

100% sampling generates too much trace data. Sample.

sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01))   // 1%

Tail-based sampling (smarter — sample 100% of error traces, 1% of others) requires the OTel Collector with tail-sampling processor in between services and Tempo. Worth the setup at scale.

For our factory project: 5% sampling. Trace volume manageable; we catch enough slow paths.

Common Pitfalls

No service name set. Spans appear as “unknown”. Set service.name resource attribute.

100% sampling in prod. Trace volume balloons. Use 1-5%.

Forgetting to propagate ctx. New context starts a new trace; spans don’t connect. Pass ctx down everywhere.

Sampling per request not per trace. Half the spans of a trace get sampled in. Use trace-ID-based sampling.

Span without End(). Hangs. Defer span.End() immediately after start.

Tracing the wrong thing. Database queries, external HTTP calls — those make sense. Tracing every internal function = noise. Pick the boundaries.

Wrapping Up

Tempo + OpenTelemetry SDK = distributed tracing at low cost. Click metrics → jump to traces → see the slow span. Friday: SLOs in practice.