OpenTelemetry for gRPC Services in Go, A Production Walkthrough
TL;DR — OpenTelemetry Go 1.32 stabilized the API surface for traces, metrics, and logs. Wiring all three through a single OTLP exporter and a small collector layer gives you the observability stack most teams build incrementally and badly. Here’s the end-to-end setup that worked for me.
I instrumented my first Go service with OpenTelemetry in 2020. The API churned every six months, the SDKs were beta, and half the integrations didn’t propagate context correctly. By 2023 the traces API was stable, but metrics was still moving. In 2025, with OTel Go 1.32, the three signals (traces, metrics, logs) are all GA, and the patterns are settled enough that you can wire them up once and not revisit for a year.
This is the setup I now use as a baseline for every Go gRPC service. It assumes you have an OTel Collector running somewhere reachable — in cluster as a DaemonSet for most teams. Backend choice (Tempo/Grafana, Honeycomb, Datadog, New Relic, anything OTLP-native) is orthogonal; the collector translates.
If you’re new to gRPC instrumentation specifically, the gRPC patterns post covers the underlying behavior the interceptors expose. This post focuses on the OTel side and the production knobs.
1. The Three Signals, and Why You Want All Three
+---------+ +-----------+ +----------+
| service | -- OTLP --> | collector | -- export -->| backend |
| traces | | batch | | (Tempo, |
| metrics | | filter | | Mimir, |
| logs | | sample | | Loki) |
+---------+ +-----------+ +----------+
- Traces answer “what happened in this one request?” Spans across services, linked by trace ID.
- Metrics answer “what is the system doing in aggregate?” Counters, histograms, gauges, exported on a schedule.
- Logs answer “what did the code say at this moment?” Linked to traces by trace/span ID for correlation.
In 2020 you had three different tools (Jaeger, Prometheus, syslog) with three different config surfaces. OTel collapses them into one wire protocol and one SDK. That’s the value.
2. SDK Setup
Add the dependencies:
go get \
go.opentelemetry.io/otel@v1.32.0 \
go.opentelemetry.io/otel/sdk@v1.32.0 \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.32.0 \
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc@v1.32.0 \
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc@v0.8.0 \
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.57.0
The init function:
package telemetry
import (
"context"
"fmt"
"os"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
sdklog "go.opentelemetry.io/otel/sdk/log"
sdkmetric "go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
type Shutdown func(context.Context) error
func Init(ctx context.Context, serviceName, version string) (Shutdown, error) {
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(serviceName),
semconv.ServiceVersion(version),
semconv.DeploymentEnvironment(os.Getenv("DEPLOY_ENV")),
),
resource.WithHost(),
resource.WithProcessPID(),
)
if err != nil {
return nil, fmt.Errorf("resource: %w", err)
}
tracerExp, err := otlptracegrpc.New(ctx)
if err != nil {
return nil, fmt.Errorf("trace exporter: %w", err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(tracerExp, sdktrace.WithBatchTimeout(5*time.Second)),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
metricExp, err := otlpmetricgrpc.New(ctx)
if err != nil {
return nil, fmt.Errorf("metric exporter: %w", err)
}
mp := sdkmetric.NewMeterProvider(
sdkmetric.WithResource(res),
sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExp,
sdkmetric.WithInterval(15*time.Second))),
)
otel.SetMeterProvider(mp)
logExp, err := otlploggrpc.New(ctx)
if err != nil {
return nil, fmt.Errorf("log exporter: %w", err)
}
lp := sdklog.NewLoggerProvider(
sdklog.WithResource(res),
sdklog.WithProcessor(sdklog.NewBatchProcessor(logExp)),
)
return func(ctx context.Context) error {
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
err1 := tp.Shutdown(ctx)
err2 := mp.Shutdown(ctx)
err3 := lp.Shutdown(ctx)
if err1 != nil {
return err1
}
if err2 != nil {
return err2
}
return err3
}, nil
}
The exporters default to reading endpoint and headers from environment variables:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer ...
OTEL_SERVICE_NAME=orders
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod
Setting them via env keeps the code library-agnostic. The pattern I follow: the SDK init takes only the few attributes that don’t make sense as env vars; everything else is env.
3. Wiring the gRPC Interceptors
The contrib package gives you both client and server interceptors:
import (
"google.golang.org/grpc"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
)
func newServer() *grpc.Server {
return grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()),
// other options...
)
}
func newClient(target string) (*grpc.ClientConn, error) {
return grpc.NewClient(
target,
grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
// other options...
)
}
StatsHandler is preferred over the older UnaryInterceptor/StreamInterceptor approach because it captures wire-level events (message sizes, stream lifecycle) that interceptors don’t see. The contrib package emits the standard semantic-convention attributes (rpc.system=grpc, rpc.service, rpc.method, etc.) which means your backend dashboards work out of the box.
4. Custom Spans Inside Handlers
The auto-instrumentation gives you one span per RPC. For interesting work inside, create child spans:
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("orders")
func (s *Server) GetOrder(ctx context.Context, req *pb.GetOrderRequest) (*pb.GetOrderResponse, error) {
ctx, span := tracer.Start(ctx, "load-order",
trace.WithAttributes(attribute.String("order.id", req.GetId())))
defer span.End()
order, err := s.db.Find(ctx, req.GetId())
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
span.SetAttributes(attribute.Int("order.line_count", len(order.Lines)))
return &pb.GetOrderResponse{Order: toProto(order)}, nil
}
A few discipline points:
- Span name is a low-cardinality verb-noun.
load-orderis fine;load-order-42is wrong. - High-cardinality data (IDs, user emails) goes in attributes, not the span name.
RecordError+SetStatus(Error)is the two-step that backends usually need to surface a span as failed.
5. Metrics with Histograms
For latency you want histograms, not gauges. OTel 1.32 supports explicit bucket boundaries:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/metric"
)
var (
meter = otel.Meter("orders")
orderLatency, _ = meter.Float64Histogram(
"orders.handler.latency",
metric.WithUnit("s"),
metric.WithDescription("Order handler latency"),
metric.WithExplicitBucketBoundaries(
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10,
),
)
ordersCreated, _ = meter.Int64Counter(
"orders.created.total",
metric.WithDescription("Orders created"),
)
)
func (s *Server) CreateOrder(ctx context.Context, req *pb.CreateOrderRequest) (*pb.CreateOrderResponse, error) {
start := time.Now()
defer func() {
orderLatency.Record(ctx, time.Since(start).Seconds(),
metric.WithAttributes(attribute.String("customer.tier", req.GetTier())))
}()
// ...
ordersCreated.Add(ctx, 1)
return &pb.CreateOrderResponse{}, nil
}
Pick boundaries that match your SLO. The default exponential boundaries are fine if you don’t have a strong opinion.
6. Log Correlation
OTel logs are still less mature than traces and metrics, but the bridge with log/slog is solid in Go 1.24. Wire the slog handler that ships with the contrib package:
import (
"log/slog"
"go.opentelemetry.io/contrib/bridges/otelslog"
)
func newLogger() *slog.Logger {
return otelslog.NewLogger("orders")
}
Every slog call now emits to the OTel log pipeline, and any log made with a context carrying an active span includes the trace and span IDs. In your backend, clicking a span to “see logs” works because they’re joined on those IDs.
slog.InfoContext(ctx, "order created", "order_id", id, "amount_cents", amt)
7. Sampling Strategy
Tracing every request at the SDK level is wasteful at scale. The pattern that works:
- Head sampling at 10% in the SDK (
TraceIDRatioBased(0.1)). - Tail sampling in the collector for the interesting 100%: errors, slow requests, specific routes.
Collector config:
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 5 }
The 5% probabilistic gives you a baseline view; errors and slow requests give you everything that matters.
8. Common Pitfalls
8.1 Forgetting Context Propagation
The otelgrpc handlers propagate context across gRPC boundaries automatically. HTTP boundaries are not automatic — wrap your HTTP client with otelhttp.NewTransport and your server with otelhttp.NewHandler. If a downstream span shows no parent, this is almost always the cause.
8.2 High-Cardinality Attributes
Every unique attribute combination is a new time series in your metrics backend. Putting a user_id on a counter blows up cardinality and your bill. Reserve attributes for low-cardinality dimensions (tier, region, route).
8.3 Synchronous Exporters in Hot Paths
sdktrace.WithSyncer blocks each span on export. Use WithBatcher (the default in my init above). Same for logs.
8.4 Trusting the SDK to Recover From Collector Outages
If the collector is unreachable, the exporter retries and eventually drops. Spans get lost. The fix is collector HA (two replicas with a service) and an in-pod sidecar collector that buffers locally. The SDK is not your buffer.
8.5 Ignoring Resource Attributes
service.name, service.version, and deployment.environment are the three attributes that make multi-service dashboards possible. Set them everywhere, consistently, ideally from env.
9. Troubleshooting
9.1 Spans Show Up Locally But Not in Backend
The collector is receiving but not exporting. Check otelcol_exporter_send_failed_* metrics on the collector. Common: wrong endpoint, missing auth header, TLS mismatch.
9.2 Trace IDs Don’t Match Across Services
The propagator isn’t set. The init above sets TraceContext{} and Baggage{}. If one service is set and the other isn’t, you’ll see broken traces that start fresh on the boundary.
9.3 Metrics Have Massive Cardinality Bills
You attached a high-cardinality attribute. The fix is to remove the attribute from metrics (still keep it on spans, where cardinality is fine) using a view:
sdkmetric.NewMeterProvider(
sdkmetric.WithView(sdkmetric.NewView(
sdkmetric.Instrument{Name: "orders.handler.latency"},
sdkmetric.Stream{AttributeFilter: attribute.NewAllowKeysFilter("customer.tier", "route")},
)),
)
10. Wrapping Up
OpenTelemetry in Go is finally the boring, settled thing it always wanted to be. Wire it up once with the init function above, instrument your gRPC handlers with the stats handler, add a few business-meaningful custom spans and counters, and you’re done. Resist the urge to over-instrument; the auto-spans plus a handful of curated metrics will cover 80% of your debugging needs.
The OpenTelemetry Go docs are the canonical reference. They’ve improved significantly in the last year and now mostly match what the SDK does. Next post in the series covers rate limiting and resilience — the pieces that keep your nicely-instrumented services from falling over under load.