LLM Observability in Practice, Logs, Traces, and a Useful Dashboard

Llm article cover illustration on a gradient background

November 22, 2023 · 6 min read · by Muhammad Amal ai

TL;DR — Treat LLM calls like any other RPC: instrument them with traces, log structured events, and alert on latency and error rates. / The LLM-specific things to watch are token usage per call, retrieval recall, and refusal / fallback rates. / You will not catch quality regressions from infrastructure metrics alone. You need the eval pipeline running too.

This morning Sam Altman was reinstated as OpenAI CEO after the most chaotic five days the AI industry has had. If you run anything on the OpenAI API in production, you spent the weekend updating contingency runbooks, evaluating Anthropic and Azure OpenAI as failovers, and answering questions from executives about vendor risk.

I want to write about something more durable: what you should be monitoring in your LLM application regardless of which provider you use. Because when the platform under you wobbles, your observability is what tells you whether your users felt it.

The Layers to Instrument

A production RAG application has at least four layers worth tracing:

The user-facing request handler (HTTP or Slack or whatever)
The retriever (embedding generation, vector search, reranking)
The LLM call (the chat completion or assistants run)
Any downstream tools the LLM invokes

Each layer has its own failure modes. Instrument them as separate spans. OpenTelemetry is fine for this; the libraries are mature and Datadog, Honeycomb, Jaeger all consume OTel cleanly.

# opentelemetry-api==1.21.0, opentelemetry-sdk==1.21.0
from opentelemetry import trace
from openai import OpenAI

tracer = trace.get_tracer(__name__)
client = OpenAI()

def answer_question(user_id: str, question: str) -> str:
    with tracer.start_as_current_span("rag.answer") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("question.length", len(question))

        with tracer.start_as_current_span("rag.retrieve") as retrieve_span:
            nodes = retriever.retrieve(question)
            retrieve_span.set_attribute("retrieve.node_count", len(nodes))
            retrieve_span.set_attribute("retrieve.top_score", nodes[0].score if nodes else 0)

        with tracer.start_as_current_span("rag.generate") as gen_span:
            response = client.chat.completions.create(
                model="gpt-4-1106-preview",
                messages=build_messages(question, nodes),
                temperature=0.0,
            )
            usage = response.usage
            gen_span.set_attribute("llm.model", "gpt-4-1106-preview")
            gen_span.set_attribute("llm.prompt_tokens", usage.prompt_tokens)
            gen_span.set_attribute("llm.completion_tokens", usage.completion_tokens)
            gen_span.set_attribute("llm.total_tokens", usage.total_tokens)

        return response.choices[0].message.content

The attributes matter. If you only span the boundaries without tagging them, you have shapes without information.

The Metrics That Matter

Beyond the standard latency and error rate, the LLM-specific metrics I keep on the dashboard:

Tokens per request, by model. Histogram. The p99 catches the cases where retrieval bloated the context. The mean tracks cost trends. Tag by model so when you migrate (see my GPT-4 Turbo migration post ) you can compare before and after.

End-to-end latency, broken out by phase. Retrieve, generate, tool calls. Generate dominates for most workloads, but I’ve seen retrieval blow out at p99 when the vector index hit memory pressure.

Retrieval top-score distribution. The cosine similarity (or RRF score) of the top retrieved chunk. When this drifts down over time, your corpus and your queries have diverged. Worth knowing before users complain.

Refusal rate. How often does the model say “I don’t have enough information”? If this is climbing, something changed — either content, or query distribution, or a prompt regression.

Fallback rate. If you have a fallback model (gpt-3.5 when gpt-4 fails or is rate-limited), track how often you use it. Spikes correlate with provider incidents.

Cost per request. Compute it from token usage and current price table. Surface a running monthly total. Surprise bills are an avoidable problem.

Logging the Full Trace, Carefully

You want enough logging to reproduce any answer the bot gave. You also want to not leak secrets or PII into your logging pipeline.

import structlog

log = structlog.get_logger()

def log_rag_call(user_id, question, retrieved_node_ids, response, usage, latency_ms):
    log.info(
        "rag.call",
        user_id=user_id,
        question_hash=hash_question(question),
        question_length=len(question),
        retrieved_node_ids=retrieved_node_ids,
        response_length=len(response),
        prompt_tokens=usage.prompt_tokens,
        completion_tokens=usage.completion_tokens,
        latency_ms=latency_ms,
        model="gpt-4-1106-preview",
    )

For audit purposes (see the security post ) you may need to log the full question and response too — but route that to a separate, more restricted log stream with shorter retention and stricter access controls. Don’t co-mingle audit logs with operational logs.

Hash the question content for operational logs. If a regression hits and you need to investigate, you can match the hash to the audit log entry without exposing user content in the dashboard surface.

Alerts Worth Paging On

The principle: alert on user impact, not on infrastructure. Most LLM infra metrics are non-actionable as alerts because the action is “wait for the provider” or “investigate later.”

5xx rate above 2% over 5 minutes. Page. Something is broken.

p95 latency above 12 seconds for 10 minutes. Page. Users are timing out.

Fallback rate above 25% for 10 minutes. Page. Primary provider is degraded.

Daily spend above 1.5x the 30-day moving average. Notify (not page). Could be legitimate traffic, could be a runaway loop, could be abuse.

Refusal rate above 40% over 1 hour. Notify. Quality may have regressed.

Retrieval top-score weekly average drops more than 10% week-over-week. Notify. Investigate corpus drift.

That’s six alerts. You should have a few more for the specifics of your stack, but the list should fit on a small page. More than that and people start ignoring them.

Provider Status as Part of Your Stack

Last weekend’s events were a reminder: your provider’s status page is part of your observability surface. The OpenAI status page is published as a structured feed. Scrape it. Surface it on your dashboard next to your own metrics.

If you see your error rate climb and the provider status is green, the bug is yours. If both are red, you’re waiting. Knowing which it is in the first 30 seconds of an incident saves you 30 minutes of misdirected investigation.

I now run a simple poller against the OpenAI status page and emit a metric. Same for Anthropic. It’s low-fidelity but it’s something.

Common Pitfalls

Logging full prompts to your standard log stream. Prompts contain retrieved chunks, which may contain PII, which now lives in your log retention.

Not tagging logs by model version. When you migrate from gpt-4 to gpt-4-turbo, you want to compare token usage and latency cleanly. Untagged logs make this painful.

Treating LLM latency like database latency. P99 of 15 seconds is normal for long-context generation. Setting an alert at 5 seconds will page you forever.

Sampling traces. OpenTelemetry default sampling at low rates means you lose the traces for the rare interesting cases. For LLM workloads at modest QPS, sample at 100% until cost forces otherwise.

Trusting a single metric for “quality.” Refusal rate is a useful signal but it’s not quality. Faithfulness and correctness come from the eval pipeline, not from the production metrics.

Ignoring the embedding API. Your embedding API is its own dependency with its own latency and rate limits. Trace it as a separate span. When users hit rate limits at the embedding step, the symptom looks like a retrieval bug.

What’s Next

I want to write about the agent layer next. Function calling, tool use, the cases where the LLM acts on its outputs. That’s where observability stops being optional and becomes the only thing keeping you sane.

Wrapping Up

Observability for LLM systems is mostly the same as observability for any other distributed system. The pieces that are different — token accounting, retrieval health, refusal rates, provider status — are not exotic. You just have to add them deliberately. Once you do, you’ll spend less time guessing whether something is wrong and more time fixing what is.

Last weekend’s industry chaos will fade. The discipline of knowing the state of your own system will keep paying off.

The Layers to Instrument

The Metrics That Matter

Logging the Full Trace, Carefully

Alerts Worth Paging On

Provider Status as Part of Your Stack

Common Pitfalls

What’s Next

Wrapping Up

Related posts

LLM Cost Control and Token Budgets

LLM Vendor Risk, A Failover Playbook After the OpenAI Weekend

LangChain LCEL vs LlamaIndex, Picking a Framework in Late 2023

Claude 2.1 vs GPT-4 Turbo, A Side-by-Side at 100K Context

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Hybrid Retrieval with pgvector and BM25, A Practical Walkthrough

Securing an Internal LLM Chatbot, Threats, Boundaries, and What I Got Wrong

The OpenAI Assistants API in Production, A Cautious Take

Let’s Start a Project