background-shape
Rust Service Observability in 2024, Metrics, Logs, and Traces That Help
March 25, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — In 2024, the boring, correct way to instrument a Rust service is the metrics crate for Prometheus-style counters and histograms, the tracing crate for structured logs and spans, and opentelemetry plus its tracing bridge for distributed traces. Wire all three at startup, attach a trace ID to every log line, and never sample errors.

I’ve spent enough time on call to know what makes a Rust service debuggable at 3am: every request has a unique ID, that ID appears in every log line and every metric label that matters, and the trace for that request can be pulled up in a viewer with one click. Anything less and you’re guessing.

The Rust ecosystem now has all three legs of the observability stool well-supported. The work is gluing them together properly, which is what this post covers. If you’re using Axum specifically, the HTTP service and tracing post covers the HTTP layer instrumentation that complements what we’ll set up here.

What I Mean By “Observability”

Three signals, each answering a different question.

Metrics answer “is the system healthy right now, in aggregate?” Counters, gauges, histograms. Low cardinality. Scraped or pushed. Prometheus is still the lingua franca.

Logs answer “what exactly happened in this one request?” High cardinality, structured, searchable. JSON to stdout in production, pretty in dev.

Traces answer “where did the latency in this request go?” Spans across services, parent-child relationships, baggage. OpenTelemetry is the standard; the exporter can ship to Tempo, Jaeger, Honeycomb, etc.

Each signal has weaknesses. Logs without metrics drown you in volume. Metrics without traces tell you a percentile is high but not why. Traces without logs tell you a span took 800ms but not what it was doing. You need all three, correlated.

The Dependency List

These are the versions I’m running in production as of March 2024.

[dependencies]
tokio   = { version = "1.36", features = ["full"] }
axum    = "0.7"
tower   = "0.4"
tower-http = { version = "0.5", features = ["trace"] }

# Logs and spans
tracing             = "0.1"
tracing-subscriber  = { version = "0.3", features = ["env-filter", "json", "fmt"] }
tracing-opentelemetry = "0.23"

# Metrics
metrics             = "0.22"
metrics-exporter-prometheus = "0.13"

# OTel
opentelemetry        = { version = "0.22", features = ["trace"] }
opentelemetry_sdk    = { version = "0.22", features = ["rt-tokio"] }
opentelemetry-otlp   = { version = "0.15", features = ["tonic", "trace"] }

anyhow = "1"

The tracing and opentelemetry minor versions move together; mismatches produce baffling type errors. I pin both carefully.

Tracing Subscriber Setup

The subscriber is what tracing events flow into. In production I want JSON to stdout plus an OTel exporter. In development I want pretty-printed text. One init function handles both.

use opentelemetry::trace::TracerProvider as _;
use opentelemetry::KeyValue;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::{trace as sdktrace, Resource};
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::util::SubscriberInitExt;
use tracing_subscriber::EnvFilter;

pub fn init(service: &str) -> anyhow::Result<()> {
    let endpoint = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT")
        .unwrap_or_else(|_| "http://localhost:4317".into());

    let exporter = opentelemetry_otlp::new_exporter().tonic().with_endpoint(endpoint);
    let tracer_provider = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(exporter)
        .with_trace_config(
            sdktrace::config().with_resource(Resource::new(vec![
                KeyValue::new("service.name", service.to_string()),
                KeyValue::new("service.version", env!("CARGO_PKG_VERSION")),
            ])),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio)?;

    let tracer = tracer_provider.tracer(service.to_string());
    let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);

    let env_filter = EnvFilter::try_from_default_env()
        .unwrap_or_else(|_| EnvFilter::new("info,tower_http=debug,axum=debug"));

    let fmt_layer = if std::env::var("LOG_FORMAT").as_deref() == Ok("json") {
        tracing_subscriber::fmt::layer().json().with_current_span(false).boxed()
    } else {
        tracing_subscriber::fmt::layer().pretty().boxed()
    };

    tracing_subscriber::registry()
        .with(env_filter)
        .with(fmt_layer)
        .with(otel_layer)
        .init();

    Ok(())
}

Two things to call out. First, the OTel layer is added to the same subscriber as the fmt layer, so every info! and error! call automatically becomes a span event in the trace. Second, install_batch uses a background task to ship spans without blocking the request path. Don’t use install_simple outside of tests.

Metrics with the metrics Crate

metrics is a facade — like log for logs. You call macros, and a recorder behind the scenes does the work. The Prometheus recorder is the one I use 95% of the time.

use metrics_exporter_prometheus::{Matcher, PrometheusBuilder};

pub fn init_metrics() -> anyhow::Result<()> {
    PrometheusBuilder::new()
        .set_buckets_for_metric(
            Matcher::Full("http_request_duration_seconds".into()),
            &[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
        )?
        .with_http_listener(([0, 0, 0, 0], 9090))
        .install()?;
    Ok(())
}

That exposes /metrics on port 9090 in Prometheus exposition format. From anywhere in your code:

use metrics::{counter, histogram, gauge};

counter!("http_requests_total", "route" => "/api/v1/users", "method" => "GET").increment(1);
histogram!("http_request_duration_seconds", "route" => "/api/v1/users").record(elapsed.as_secs_f64());
gauge!("worker_queue_depth").set(queue.len() as f64);

The big rule: keep label cardinality low. Per-user-id labels will explode your Prometheus storage and tank query performance. Use trace IDs for per-request correlation; use metric labels for things with a bounded set of values (routes, status code classes, regions).

Connecting Logs to Traces

The thing that makes 3am debugging tolerable: every log line carries the trace ID of the request it’s part of. With tracing-opentelemetry installed, this is automatic for any log inside a span — but you need to actually open the span and the trace context has to be extracted from inbound headers.

use axum::middleware::Next;
use axum::http::Request;
use axum::response::Response;
use opentelemetry::trace::TraceContextExt;
use tracing::{info_span, Instrument};
use tracing_opentelemetry::OpenTelemetrySpanExt;

pub async fn trace_request<B>(req: Request<B>, next: Next<B>) -> Response
where B: Send {
    let parent_ctx = opentelemetry::global::get_text_map_propagator(|p| {
        p.extract(&HeaderExtractor(req.headers()))
    });
    let span = info_span!(
        "http_request",
        method  = %req.method(),
        route   = %req.uri().path(),
        trace_id = tracing::field::Empty,
    );
    span.set_parent(parent_ctx);
    let trace_id = span.context().span().span_context().trace_id().to_string();
    span.record("trace_id", &tracing::field::display(&trace_id));

    next.run(req).instrument(span).await
}

HeaderExtractor is a small adapter that lets the OTel propagator read from axum::http::HeaderMap. The OpenTelemetry docs at opentelemetry.io have the propagator boilerplate; copy-paste it once and forget about it.

Now every info!, warn!, error! inside a request handler will be tagged with that trace_id field. Your log aggregator can hyperlink straight to the trace in your tracing backend.

Errors Deserve Special Care

Two mistakes I see constantly. First, logging an error and then returning it — your caller logs it again, and the same error shows up three times in your dashboard. Pick one log site per error, usually the boundary where it becomes an HTTP response.

Second, swallowing context. ? is great until you propagate an error six layers up and lose all the locals that explain what happened. anyhow::Context fixes this cheaply.

use anyhow::Context;

async fn load_user(id: i64) -> anyhow::Result<User> {
    let row = sqlx::query!("SELECT * FROM users WHERE id = $1", id)
        .fetch_one(&pool)
        .await
        .with_context(|| format!("load_user failed for id={id}"))?;
    Ok(row.into())
}

When that error bubbles up, the error! at the HTTP boundary logs the full chain with all the IDs. Searchable, debuggable.

A Health Endpoint Worth Having

Two endpoints, not one. /healthz is liveness — is the process alive? Always returns 200 unless the process is wedged. /readyz is readiness — is the process able to serve traffic? Returns 503 if a critical dependency is down.

async fn readyz(State(state): State<AppState>) -> Response {
    let db_ok = sqlx::query("SELECT 1").fetch_one(&state.db).await.is_ok();
    if db_ok {
        (StatusCode::OK, "ok").into_response()
    } else {
        (StatusCode::SERVICE_UNAVAILABLE, "db down").into_response()
    }
}

Kubernetes uses these to decide when to restart a pod vs. when to take it out of the load-balancer pool. Don’t conflate them.

Common Pitfalls

Sampling errors. If your tracer samples at 1%, you’ll miss 99% of 500-class errors. Use a tail-sampling exporter or force-sample any span with an error status.

Forgetting to flush on shutdown. opentelemetry::global::shutdown_tracer_provider() must be called before process exit, or batched spans get dropped. Wire it into your signal handler.

Cardinality explosions. Tagging a Prometheus metric with user_id looks fine on day one and kills Prometheus on day 30. Audit labels in code review.

Mixing log and tracing. Many crates still emit log records. Use the tracing-log bridge once at init so they show up in your subscriber.

Async work outside a span. tokio::spawn without .instrument(Span::current()) loses the trace context. Always attach a span to spawned futures you care about tracing.

Wrapping Up

Observability is mostly setup work, paid once. Get the subscriber, metrics recorder, and OTel pipeline wired correctly at startup, and the rest of your service code just calls info! and counter! and histogram! and Things Work. When something does go wrong at 3am, you’ll spend ten minutes finding the cause instead of an hour.

The Rust ecosystem here is as good as any other language’s. The hard part is discipline about labels, error boundaries, and shutdown — not the tools.