Observability for Go gRPC Services with OpenTelemetry
TL;DR — OpenTelemetry has converged enough in 2023 that it’s the default for new Go services / The
otelgrpcinterceptor handles span propagation across gRPC boundaries with zero handler-level code / Trace sampling is the most important config you’ll set; head sampling at 1-10% is usually fine for high-traffic services.
The observability story for Go has settled. OpenTelemetry won. The Go SDK reached 1.x in late 2022 and is stable enough to commit to. The contrib repo has interceptors for gRPC, HTTP, database/sql, pgx — all the things you’d actually instrument.
This post walks through wiring up a Go gRPC service with OpenTelemetry traces, metrics, and the propagation glue that connects them. I’ll skip the philosophy of observability — there’s plenty of writing on that — and focus on the code and the tuning that matters.
We’ve covered interceptors and connection pooling in earlier posts this month. Observability is where you confirm both are working in production.
The Setup: TracerProvider, Exporter, Propagator
OpenTelemetry has a few core abstractions you need to instantiate at startup:
- TracerProvider: produces tracers. Tracers produce spans.
- MeterProvider: produces meters. Meters produce instruments (counters, histograms).
- Exporter: ships spans/metrics out (OTLP, Jaeger, Prometheus).
- Propagator: serializes/deserializes trace context across boundaries.
In code:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
func initTracing(ctx context.Context, serviceName, version string) (func(context.Context) error, error) {
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(serviceName),
semconv.ServiceVersion(version),
),
resource.WithFromEnv(),
resource.WithProcess(),
)
if err != nil {
return nil, err
}
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.05),
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp.Shutdown, nil
}
Things to defend:
The resource attributes follow semantic conventions. Use the semconv package — don’t invent your own attribute names. Backends like Jaeger and Tempo expect specific keys for service.name, service.version, host.name. Custom keys won’t show up in default dashboards.
OTLP over gRPC is the right exporter. It’s the OpenTelemetry-native protocol, supported by every collector and most backends directly. Avoid the Jaeger/Zipkin direct exporters; they’re legacy.
The sampler is ParentBased(TraceIDRatioBased(0.05)). This means: if the incoming request already has a trace decision (from upstream), honor it. Otherwise, sample 5% based on the trace ID. This is the only correct way to sample in a distributed system — sampling decisions must be consistent across services for the same trace.
tp.Shutdown flushes pending spans on exit. Call it in your shutdown path. Lose it and the last few seconds of spans never leave the process.
gRPC Interceptors for Tracing
The otelgrpc package gives you interceptors that handle span creation and propagation.
Server side:
srv := grpc.NewServer(
grpc.ChainUnaryInterceptor(
otelgrpc.UnaryServerInterceptor(),
recovery.UnaryServerInterceptor(),
logging.UnaryServerInterceptor(logger),
auth.UnaryServerInterceptor(verifier),
),
grpc.ChainStreamInterceptor(
otelgrpc.StreamServerInterceptor(),
recovery.StreamServerInterceptor(),
logging.StreamServerInterceptor(logger),
auth.StreamServerInterceptor(verifier),
),
)
Note the order: tracing first, then recovery, then logging, then auth. Tracing first ensures that if recovery catches a panic, the span recorded the panic. Logging gets the trace ID from context for correlation. Auth being last lets you record auth failures in spans (codes.Unauthenticated shows up as a span status).
Client side, on the dial:
conn, err := grpc.Dial(target,
grpc.WithTransportCredentials(creds),
grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()),
)
This injects trace context into outgoing metadata. The server-side interceptor on the other end extracts it and continues the trace. End to end propagation, no handler-level code.
Adding Span Detail in Handlers
The interceptor gives you one span per RPC. For interesting work inside a handler, create child spans:
func (s *invoiceServer) CreateInvoice(
ctx context.Context,
req *billingv1.CreateInvoiceRequest,
) (*billingv1.Invoice, error) {
ctx, span := s.tracer.Start(ctx, "InvoiceService.process",
trace.WithAttributes(
attribute.String("customer.id", req.GetCustomerId()),
attribute.Int("line_items.count", len(req.GetItems())),
),
)
defer span.End()
total := computeTotal(req.GetItems())
span.SetAttributes(attribute.Int64("invoice.total_cents", total))
inv, err := s.repo.Insert(ctx, req, total)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "insert failed")
return nil, status.Errorf(grpccodes.Internal, "insert: %v", err)
}
return toProto(inv), nil
}
Attribute keys should follow semantic conventions where they exist. For domain-specific attributes, namespace them (customer.id, invoice.total_cents).
span.RecordError adds an exception event to the span. span.SetStatus(codes.Error, ...) marks the span as failed. Both are useful — RecordError preserves the error message, SetStatus shows red in the UI.
The repo layer should add its own spans, but most database libraries do this automatically once you wire up their instrumentation.
Database Instrumentation
For pgx v5, use otelpgx:
cfg, _ := pgxpool.ParseConfig(connString)
cfg.ConnConfig.Tracer = otelpgx.NewTracer()
pool, _ := pgxpool.NewWithConfig(ctx, cfg)
Every query becomes a span, with the SQL statement and database attributes set per semantic conventions. You’ll see the full chain: gRPC handler → application span → SQL query.
For database/sql, the equivalent is otelsql. Same idea, slightly different wiring.
Metrics: What to Measure
Traces tell you what happened on individual requests. Metrics tell you what’s happening in aggregate. You want both.
The standard four for RPC services (USE method for resources, RED method for requests):
- Rate: requests per second, per method, per status.
- Errors: count of errors, per method, per error code.
- Duration: histogram of latencies, per method.
- Saturation: queue depth, pool utilization, etc.
otelgrpc also publishes metrics if you pass a meter provider. The default metrics map to the OTel RPC semantic conventions.
mp := metric.NewMeterProvider(/* ... */)
otel.SetMeterProvider(mp)
srv := grpc.NewServer(
grpc.ChainUnaryInterceptor(
otelgrpc.UnaryServerInterceptor(
otelgrpc.WithMeterProvider(mp),
),
),
)
You get histograms for request duration, message sizes, and counts of completed RPCs. Wire your metric exporter (Prometheus is most common via the OTLP-Prometheus bridge or direct exporter) and you have dashboards out of the box.
Sampling: The Knob That Matters
A high-traffic service that traces every request will produce more data than you can afford to store. Sampling cuts this. Two choices:
Head sampling (decide at the start of a trace): cheap, deterministic per trace ID, biased against rare events. TraceIDRatioBased(0.05) is 5% head sampling.
Tail sampling (decide at the collector after seeing the whole trace): expensive, requires the collector to buffer, but can keep all errors and slow traces.
For most services, head sampling at 1-10% is fine. Combine with tail sampling at the collector for “always keep errors” rules:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1000}
- name: rate-limit
type: probabilistic
probabilistic: {sampling_percentage: 5}
The collector keeps 100% of error traces and 100% of slow traces (≥1s), and probabilistically keeps 5% of the rest. This gives you actionable data without breaking the budget.
Common Pitfalls
The set that recurs:
- Forgetting to set a propagator. Without
otel.SetTextMapPropagator, the gRPC interceptors don’t actually inject/extract trace context. Traces are per-service, not distributed. - Sampling inconsistency across services. If service A samples at 5% and service B at 100%, you’ll see traces that start in B with no parent and never extend into A. Use
ParentBasedeverywhere. - Reinventing semantic conventions. “service_name” instead of “service.name”; “http_status” instead of “http.status_code”. Use the
semconvpackage. - Not calling Shutdown. Spans batch internally; without Shutdown, the last batch is lost. Wire it into your signal handler.
- Span attributes containing PII. Customer IDs are probably fine; emails, full names, payment details are not. Audit your attributes.
- Sampling 100% in production. You will fill your storage. You will pay for it. Sample.
- Tracer/Meter from
otel.Tracer("")everywhere. The string is the instrumentation name. Use your package name. Otherwise you can’t filter by which library produced a span. - Mixing OpenCensus and OpenTelemetry. OpenCensus is deprecated. Some old libraries still use it. The bridge package exists but adds complexity. If you can, migrate everything to OTel.
Wrapping Up
Observability isn’t an add-on; it’s the contract between you and your future self at 3 AM. The OpenTelemetry SDK is mature enough now to commit to. The combination of OTLP exporter, gRPC interceptors, database instrumentation, and parent-based sampling gives you end-to-end traces with consistent metrics, all from a few lines of init code. The final post in this series ties everything together — testing strategies for the gRPC services we’ve been building.