The LLM Observability Stack You Actually Need

Why traditional monitoring falls short

Traditional application monitoring answers: "Is the service up? Is it fast enough?" For LLM applications, the service can be up, fast, and returning 200 status codes — while producing outputs that are completely wrong, hallucinated, or harmful.

LLM observability needs to answer a harder question: "Is the output good?" — and it needs to answer it continuously, at scale, without a human reviewing every response.

The four layers of LLM observability

Layer 1 — Operational metrics

These are the basics that any production service needs:

Latency — Time to first token, total generation time, end-to-end response time
Throughput — Requests per second, tokens per second
Error rates — API failures, timeouts, rate limits
Availability — Uptime of inference endpoints, retrieval systems, and supporting services

If you're using a managed API, most of these are available through the provider's dashboard. For self-hosted models, you'll need to instrument your inference server.

Layer 2 — Cost metrics

LLM costs can spike unpredictably. Track:

Token consumption — Input and output tokens per request, per user, per feature
Cost per request — Broken down by model, by task type
Budget burn rate — Current spend vs. projected monthly spend
Anomaly detection — Alert on requests that consume unusually high token counts (often a sign of runaway agent loops or prompt injection)

Set cost alerts early We've seen multiple teams discover $10K+ unexpected charges because a single malfunctioning feature generated millions of tokens overnight. Set daily spend alerts from day one.

Layer 3 — Quality metrics

This is where LLM observability diverges from traditional monitoring. Quality metrics attempt to measure whether the model's output is actually good:

Format compliance — Does the output match the expected schema or structure?
Safety checks — Does the output contain harmful, biased, or policy-violating content?
Factual consistency — For RAG systems, does the output align with the retrieved sources?
Task completion — For agents, did the workflow complete successfully?

Most of these require either rule-based checks (for format and safety) or a secondary model call (for factual consistency and quality scoring). The secondary model call adds cost but is essential for meaningful quality monitoring.

Layer 4 — User feedback signals

The ultimate quality metric is whether users are satisfied. Track:

Explicit feedback — Thumbs up/down, ratings, correction submissions
Implicit signals — Retry rates, time spent with output, follow-up questions that suggest the first answer was inadequate
Escalation rates — How often users fall back to human support or manual processes

These signals are noisy individually but powerful in aggregate. A sudden spike in retry rates or a drop in thumbs-up rates is often the first signal of a quality regression.

Building your tracing strategy

For multi-step systems (RAG, agents, chains), per-request tracing is essential for debugging. A good trace captures:

The full input (prompt, system message, context)
Each intermediate step (retrieval results, tool calls, intermediate outputs)
The final output
Timing information for each step
Token counts and cost attribution per step

This lets you answer the question that comes up in every production incident: "Why did the model produce this output for this input?"

Alerting strategy

Not every metric deserves an alert. A practical alerting hierarchy:

P1 (page immediately): Service down, error rate above 10%, cost anomaly above 5× daily average
P2 (alert in Slack): Latency above SLA, quality score drop above 10%, safety check failure rate above threshold
P3 (review daily): Token consumption trends, user feedback trends, retrieval recall metrics

The most important alert is the one for quality regressions — and it's the hardest to get right because "quality" is inherently fuzzy. Start with simple checks (format compliance, length bounds) and add sophistication over time.

Tool choices in 2026

The LLM observability market has consolidated around a few approaches:

Dedicated LLM platforms — Tools like LangSmith, Langfuse, Braintrust, and Arize focus specifically on LLM monitoring. They offer the best out-of-the-box experience for tracing and evaluation.
Extended APM tools — Traditional monitoring platforms (Datadog, New Relic) have added LLM-specific features. Good if you already use them and want a unified dashboard.
Custom solutions — Built on top of standard logging (structured logs to your existing stack) with custom dashboards. More work but maximum flexibility.

For most teams, a dedicated LLM platform for tracing and evaluation combined with your existing APM for operational metrics is the right starting point.

The goal of observability isn't to collect data. It's to make problems visible before they become incidents. Invest in dashboards and alerts, not just logging.