Tempo for Distributed Tracing
TL;DR — Tempo stores traces in object storage indexed only by trace ID. Cheap. Use OpenTelemetry SDK to instrument services; auto-instrumentation libraries handle HTTP/gRPC/DB. Sample to ~1% in prod. Click metric anomalies → jump to traces in Grafana.
After Promtail, the third observability pillar. Distributed tracing surfaces “which step in the request was slow.” Tempo is Grafana’s tracing backend.
What traces are
A trace = the recorded path of one request through your services.
[user] → [API] → [auth-service]
→ [billing-service] → [postgres]
→ [stripe API]
Each step is a “span” with start time, duration, attributes. Spans share a trace_id; one span (the root) has no parent.
Tracing shows you:
- Where the latency is (visual waterfall)
- Whether services are talking to each other correctly
- Cross-service errors (which span threw, which propagated)
Tempo’s model
Unlike Jaeger or Zipkin, Tempo doesn’t index span attributes. It indexes only trace_id. To find traces:
- From metrics/logs that include trace_id (most common)
- By searching attributes via TraceQL (Tempo 2.0; not stable in Sep 2022)
- Random sampling for browsing
The bet: most trace use is “I see this slow request in my metrics; show me its trace.” That’s a trace_id lookup, which is cheap.
For “find traces where customer_id=42 and duration > 1s” — Tempo can’t do that efficiently. Use Jaeger or commercial.
Setup
tempo:
image: grafana/tempo:1.5.0
command: -config.file=/etc/tempo.yaml
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
ports:
- "3200:3200" # HTTP query
- "4317:4317" # OTLP gRPC receiver
tempo-config.yaml:
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
ingester:
trace_idle_period: 10s
max_block_bytes: 1_000_000
max_block_duration: 5m
compactor:
compaction:
block_retention: 168h # 7 days
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
pool:
max_workers: 100
queue_depth: 10000
For production, switch backend to s3 or gcs. Object storage is cheap; traces live there.
Instrumenting Go with OpenTelemetry
OpenTelemetry is the standard. The Go SDK:
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.12.0"
"google.golang.org/grpc"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exp, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("tempo:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil { return nil, err }
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("api"),
semconv.ServiceVersionKey.String("1.2.3"),
)),
sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01)), // 1%
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
ctx := context.Background()
tp, _ := initTracer(ctx)
defer tp.Shutdown(ctx)
// ... use otelhttp or otel.Tracer
}
Auto-instrumentation libraries handle the common stuff:
otelhttpfor net/http server + clientotelgrpcfor gRPCsplunk/otel-collector-builderfor many DBsotelchi,otelmuxfor popular routers
Once wrapped, every HTTP request automatically gets a span.
Manual spans
For business operations:
tracer := otel.Tracer("billing")
ctx, span := tracer.Start(ctx, "process-refund",
trace.WithAttributes(
attribute.String("subscription_id", subID),
attribute.Int64("amount_cents", amount),
))
defer span.End()
if err := processRefund(ctx, subID, amount); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
The span shows up as a node in the trace. Attributes are queryable.
Pass ctx to downstream calls — they pick up the trace context automatically.
Node.js instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: 'http://tempo:4317' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
getNodeAutoInstrumentations() auto-wraps Express, Fastify, HTTP, gRPC, Postgres, MongoDB, Redis, and ~20 more. Most services get full coverage with one line.
Linking metrics, logs, traces
The win of one stack: cross-reference.
In Grafana, configure trace ID linking:
# Grafana data source (Loki) provisioning
- name: Loki
type: loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: "trace_id=(\\w+)"
url: "${__value.raw}"
datasourceUid: tempo
Now any log line with trace_id=abc123 becomes a clickable link → opens the trace in Tempo.
Similarly, exemplars in Prometheus link metric spikes to specific traces.
Sampling
100% sampling generates too much trace data. Sample.
sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.01)) // 1%
Tail-based sampling (smarter — sample 100% of error traces, 1% of others) requires the OTel Collector with tail-sampling processor in between services and Tempo. Worth the setup at scale.
For our factory project: 5% sampling. Trace volume manageable; we catch enough slow paths.
Common Pitfalls
No service name set. Spans appear as “unknown”. Set service.name resource attribute.
100% sampling in prod. Trace volume balloons. Use 1-5%.
Forgetting to propagate ctx. New context starts a new trace; spans don’t connect. Pass ctx down everywhere.
Sampling per request not per trace. Half the spans of a trace get sampled in. Use trace-ID-based sampling.
Span without End(). Hangs. Defer span.End() immediately after start.
Tracing the wrong thing. Database queries, external HTTP calls — those make sense. Tracing every internal function = noise. Pick the boundaries.
Wrapping Up
Tempo + OpenTelemetry SDK = distributed tracing at low cost. Click metrics → jump to traces → see the slow span. Friday: SLOs in practice.