OpenTelemetry for gRPC Services in Go, A Production Walkthrough

July 28, 2025 · 7 min read · by Muhammad Amal programming

TL;DR — OpenTelemetry Go 1.32 stabilized the API surface for traces, metrics, and logs. Wiring all three through a single OTLP exporter and a small collector layer gives you the observability stack most teams build incrementally and badly. Here’s the end-to-end setup that worked for me.

I instrumented my first Go service with OpenTelemetry in 2020. The API churned every six months, the SDKs were beta, and half the integrations didn’t propagate context correctly. By 2023 the traces API was stable, but metrics was still moving. In 2025, with OTel Go 1.32, the three signals (traces, metrics, logs) are all GA, and the patterns are settled enough that you can wire them up once and not revisit for a year.

This is the setup I now use as a baseline for every Go gRPC service. It assumes you have an OTel Collector running somewhere reachable — in cluster as a DaemonSet for most teams. Backend choice (Tempo/Grafana, Honeycomb, Datadog, New Relic, anything OTLP-native) is orthogonal; the collector translates.

If you’re new to gRPC instrumentation specifically, the gRPC patterns post covers the underlying behavior the interceptors expose. This post focuses on the OTel side and the production knobs.

1. The Three Signals, and Why You Want All Three

+---------+              +-----------+              +----------+
| service | -- OTLP -->  | collector | -- export -->| backend  |
| traces  |              |   batch   |              | (Tempo,  |
| metrics |              |   filter  |              |  Mimir,  |
| logs    |              |   sample  |              |  Loki)   |
+---------+              +-----------+              +----------+

Traces answer “what happened in this one request?” Spans across services, linked by trace ID.
Metrics answer “what is the system doing in aggregate?” Counters, histograms, gauges, exported on a schedule.
Logs answer “what did the code say at this moment?” Linked to traces by trace/span ID for correlation.

In 2020 you had three different tools (Jaeger, Prometheus, syslog) with three different config surfaces. OTel collapses them into one wire protocol and one SDK. That’s the value.

2. SDK Setup

Add the dependencies:

go get \
  go.opentelemetry.io/otel@v1.32.0 \
  go.opentelemetry.io/otel/sdk@v1.32.0 \
  go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc@v1.32.0 \
  go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc@v1.32.0 \
  go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc@v0.8.0 \
  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.57.0

The init function:

package telemetry

import (
	"context"
	"fmt"
	"os"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	sdklog "go.opentelemetry.io/otel/sdk/log"
	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

type Shutdown func(context.Context) error

func Init(ctx context.Context, serviceName, version string) (Shutdown, error) {
	res, err := resource.New(ctx,
		resource.WithAttributes(
			semconv.ServiceName(serviceName),
			semconv.ServiceVersion(version),
			semconv.DeploymentEnvironment(os.Getenv("DEPLOY_ENV")),
		),
		resource.WithHost(),
		resource.WithProcessPID(),
	)
	if err != nil {
		return nil, fmt.Errorf("resource: %w", err)
	}

	tracerExp, err := otlptracegrpc.New(ctx)
	if err != nil {
		return nil, fmt.Errorf("trace exporter: %w", err)
	}
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(tracerExp, sdktrace.WithBatchTimeout(5*time.Second)),
		sdktrace.WithResource(res),
		sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
	)
	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
		propagation.TraceContext{},
		propagation.Baggage{},
	))

	metricExp, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		return nil, fmt.Errorf("metric exporter: %w", err)
	}
	mp := sdkmetric.NewMeterProvider(
		sdkmetric.WithResource(res),
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExp,
			sdkmetric.WithInterval(15*time.Second))),
	)
	otel.SetMeterProvider(mp)

	logExp, err := otlploggrpc.New(ctx)
	if err != nil {
		return nil, fmt.Errorf("log exporter: %w", err)
	}
	lp := sdklog.NewLoggerProvider(
		sdklog.WithResource(res),
		sdklog.WithProcessor(sdklog.NewBatchProcessor(logExp)),
	)

	return func(ctx context.Context) error {
		ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
		defer cancel()
		err1 := tp.Shutdown(ctx)
		err2 := mp.Shutdown(ctx)
		err3 := lp.Shutdown(ctx)
		if err1 != nil {
			return err1
		}
		if err2 != nil {
			return err2
		}
		return err3
	}, nil
}

The exporters default to reading endpoint and headers from environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer ...
OTEL_SERVICE_NAME=orders
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod

Setting them via env keeps the code library-agnostic. The pattern I follow: the SDK init takes only the few attributes that don’t make sense as env vars; everything else is env.

3. Wiring the gRPC Interceptors

The contrib package gives you both client and server interceptors:

import (
	"google.golang.org/grpc"
	"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
)

func newServer() *grpc.Server {
	return grpc.NewServer(
		grpc.StatsHandler(otelgrpc.NewServerHandler()),
		// other options...
	)
}

func newClient(target string) (*grpc.ClientConn, error) {
	return grpc.NewClient(
		target,
		grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
		// other options...
	)
}

StatsHandler is preferred over the older UnaryInterceptor/StreamInterceptor approach because it captures wire-level events (message sizes, stream lifecycle) that interceptors don’t see. The contrib package emits the standard semantic-convention attributes (rpc.system=grpc, rpc.service, rpc.method, etc.) which means your backend dashboards work out of the box.

4. Custom Spans Inside Handlers

The auto-instrumentation gives you one span per RPC. For interesting work inside, create child spans:

import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("orders")

func (s *Server) GetOrder(ctx context.Context, req *pb.GetOrderRequest) (*pb.GetOrderResponse, error) {
	ctx, span := tracer.Start(ctx, "load-order",
		trace.WithAttributes(attribute.String("order.id", req.GetId())))
	defer span.End()

	order, err := s.db.Find(ctx, req.GetId())
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return nil, err
	}
	span.SetAttributes(attribute.Int("order.line_count", len(order.Lines)))
	return &pb.GetOrderResponse{Order: toProto(order)}, nil
}

A few discipline points:

Span name is a low-cardinality verb-noun. load-order is fine; load-order-42 is wrong.
High-cardinality data (IDs, user emails) goes in attributes, not the span name.
RecordError + SetStatus(Error) is the two-step that backends usually need to surface a span as failed.

5. Metrics with Histograms

For latency you want histograms, not gauges. OTel 1.32 supports explicit bucket boundaries:

import (
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/metric"
)

var (
	meter = otel.Meter("orders")

	orderLatency, _ = meter.Float64Histogram(
		"orders.handler.latency",
		metric.WithUnit("s"),
		metric.WithDescription("Order handler latency"),
		metric.WithExplicitBucketBoundaries(
			0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10,
		),
	)

	ordersCreated, _ = meter.Int64Counter(
		"orders.created.total",
		metric.WithDescription("Orders created"),
	)
)

func (s *Server) CreateOrder(ctx context.Context, req *pb.CreateOrderRequest) (*pb.CreateOrderResponse, error) {
	start := time.Now()
	defer func() {
		orderLatency.Record(ctx, time.Since(start).Seconds(),
			metric.WithAttributes(attribute.String("customer.tier", req.GetTier())))
	}()
	// ...
	ordersCreated.Add(ctx, 1)
	return &pb.CreateOrderResponse{}, nil
}

Pick boundaries that match your SLO. The default exponential boundaries are fine if you don’t have a strong opinion.

6. Log Correlation

OTel logs are still less mature than traces and metrics, but the bridge with log/slog is solid in Go 1.24. Wire the slog handler that ships with the contrib package:

import (
	"log/slog"
	"go.opentelemetry.io/contrib/bridges/otelslog"
)

func newLogger() *slog.Logger {
	return otelslog.NewLogger("orders")
}

Every slog call now emits to the OTel log pipeline, and any log made with a context carrying an active span includes the trace and span IDs. In your backend, clicking a span to “see logs” works because they’re joined on those IDs.

slog.InfoContext(ctx, "order created", "order_id", id, "amount_cents", amt)

7. Sampling Strategy

Tracing every request at the SDK level is wasteful at scale. The pattern that works:

Head sampling at 10% in the SDK (TraceIDRatioBased(0.1)).
Tail sampling in the collector for the interesting 100%: errors, slow requests, specific routes.

Collector config:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

The 5% probabilistic gives you a baseline view; errors and slow requests give you everything that matters.

8. Common Pitfalls

8.1 Forgetting Context Propagation

The otelgrpc handlers propagate context across gRPC boundaries automatically. HTTP boundaries are not automatic — wrap your HTTP client with otelhttp.NewTransport and your server with otelhttp.NewHandler. If a downstream span shows no parent, this is almost always the cause.

8.2 High-Cardinality Attributes

Every unique attribute combination is a new time series in your metrics backend. Putting a user_id on a counter blows up cardinality and your bill. Reserve attributes for low-cardinality dimensions (tier, region, route).

8.3 Synchronous Exporters in Hot Paths

sdktrace.WithSyncer blocks each span on export. Use WithBatcher (the default in my init above). Same for logs.

8.4 Trusting the SDK to Recover From Collector Outages

If the collector is unreachable, the exporter retries and eventually drops. Spans get lost. The fix is collector HA (two replicas with a service) and an in-pod sidecar collector that buffers locally. The SDK is not your buffer.

8.5 Ignoring Resource Attributes

service.name, service.version, and deployment.environment are the three attributes that make multi-service dashboards possible. Set them everywhere, consistently, ideally from env.

9. Troubleshooting

9.1 Spans Show Up Locally But Not in Backend

The collector is receiving but not exporting. Check otelcol_exporter_send_failed_* metrics on the collector. Common: wrong endpoint, missing auth header, TLS mismatch.

9.2 Trace IDs Don’t Match Across Services

The propagator isn’t set. The init above sets TraceContext{} and Baggage{}. If one service is set and the other isn’t, you’ll see broken traces that start fresh on the boundary.

9.3 Metrics Have Massive Cardinality Bills

You attached a high-cardinality attribute. The fix is to remove the attribute from metrics (still keep it on spans, where cardinality is fine) using a view:

sdkmetric.NewMeterProvider(
    sdkmetric.WithView(sdkmetric.NewView(
        sdkmetric.Instrument{Name: "orders.handler.latency"},
        sdkmetric.Stream{AttributeFilter: attribute.NewAllowKeysFilter("customer.tier", "route")},
    )),
)

10. Wrapping Up

OpenTelemetry in Go is finally the boring, settled thing it always wanted to be. Wire it up once with the init function above, instrument your gRPC handlers with the stats handler, add a few business-meaningful custom spans and counters, and you’re done. Resist the urge to over-instrument; the auto-spans plus a handful of curated metrics will cover 80% of your debugging needs.

The OpenTelemetry Go docs are the canonical reference. They’ve improved significantly in the last year and now mostly match what the SDK does. Next post in the series covers rate limiting and resilience — the pieces that keep your nicely-instrumented services from falling over under load.

1. The Three Signals, and Why You Want All Three

2. SDK Setup

3. Wiring the gRPC Interceptors

4. Custom Spans Inside Handlers

5. Metrics with Histograms

6. Log Correlation

7. Sampling Strategy

8. Common Pitfalls

8.1 Forgetting Context Propagation

8.2 High-Cardinality Attributes

8.3 Synchronous Exporters in Hot Paths

8.4 Trusting the SDK to Recover From Collector Outages

8.5 Ignoring Resource Attributes

9. Troubleshooting

9.1 Spans Show Up Locally But Not in Backend

9.2 Trace IDs Don’t Match Across Services

9.3 Metrics Have Massive Cardinality Bills

10. Wrapping Up

Related posts

Streaming gRPC for Real Time Data, A Hands On Guide

Connect Go for Browser Friendly gRPC, A Production Tutorial

gRPC Deep Dive in 2025, Patterns for High Throughput Services

gRPC for Internal Services in Go, A buf Powered Workflow

Observability for Go gRPC Services with OpenTelemetry

Observability for n8n in 2025, Metrics, Logs, and Traces

Rate Limiting and Resilience Patterns for Modern APIs

Schema First API Development with buf, A Step by Step Tutorial

Let’s Start a Project