Why traditional monitoring falls short
Traditional application monitoring answers: "Is the service up? Is it fast enough?" For LLM applications, the service can be up, fast, and returning 200 status codes — while producing outputs that are completely wrong, hallucinated, or harmful.
LLM observability needs to answer a harder question: "Is the output good?" — and it needs to answer it continuously, at scale, without a human reviewing every response.
The four layers of LLM observability
Layer 1 — Operational metrics
These are the basics that any production service needs:
- Latency — Time to first token, total generation time, end-to-end response time
- Throughput — Requests per second, tokens per second
- Error rates — API failures, timeouts, rate limits
- Availability — Uptime of inference endpoints, retrieval systems, and supporting services
If you're using a managed API, most of these are available through the provider's dashboard. For self-hosted models, you'll need to instrument your inference server.
Layer 2 — Cost metrics
LLM costs can spike unpredictably. Track:
- Token consumption — Input and output tokens per request, per user, per feature
- Cost per request — Broken down by model, by task type
- Budget burn rate — Current spend vs. projected monthly spend
- Anomaly detection — Alert on requests that consume unusually high token counts (often a sign of runaway agent loops or prompt injection)
Layer 3 — Quality metrics
This is where LLM observability diverges from traditional monitoring. Quality metrics attempt to measure whether the model's output is actually good:
- Format compliance — Does the output match the expected schema or structure?
- Safety checks — Does the output contain harmful, biased, or policy-violating content?
- Factual consistency — For RAG systems, does the output align with the retrieved sources?
- Task completion — For agents, did the workflow complete successfully?
Most of these require either rule-based checks (for format and safety) or a secondary model call (for factual consistency and quality scoring). The secondary model call adds cost but is essential for meaningful quality monitoring.
Layer 4 — User feedback signals
The ultimate quality metric is whether users are satisfied. Track:
- Explicit feedback — Thumbs up/down, ratings, correction submissions
- Implicit signals — Retry rates, time spent with output, follow-up questions that suggest the first answer was inadequate
- Escalation rates — How often users fall back to human support or manual processes
These signals are noisy individually but powerful in aggregate. A sudden spike in retry rates or a drop in thumbs-up rates is often the first signal of a quality regression.
Building your tracing strategy
For multi-step systems (RAG, agents, chains), per-request tracing is essential for debugging. A good trace captures:
- The full input (prompt, system message, context)
- Each intermediate step (retrieval results, tool calls, intermediate outputs)
- The final output
- Timing information for each step
- Token counts and cost attribution per step
This lets you answer the question that comes up in every production incident: "Why did the model produce this output for this input?"
Alerting strategy
Not every metric deserves an alert. A practical alerting hierarchy:
- P1 (page immediately): Service down, error rate above 10%, cost anomaly above 5× daily average
- P2 (alert in Slack): Latency above SLA, quality score drop above 10%, safety check failure rate above threshold
- P3 (review daily): Token consumption trends, user feedback trends, retrieval recall metrics
The most important alert is the one for quality regressions — and it's the hardest to get right because "quality" is inherently fuzzy. Start with simple checks (format compliance, length bounds) and add sophistication over time.
Tool choices in 2026
The LLM observability market has consolidated around a few approaches:
- Dedicated LLM platforms — Tools like LangSmith, Langfuse, Braintrust, and Arize focus specifically on LLM monitoring. They offer the best out-of-the-box experience for tracing and evaluation.
- Extended APM tools — Traditional monitoring platforms (Datadog, New Relic) have added LLM-specific features. Good if you already use them and want a unified dashboard.
- Custom solutions — Built on top of standard logging (structured logs to your existing stack) with custom dashboards. More work but maximum flexibility.
For most teams, a dedicated LLM platform for tracing and evaluation combined with your existing APM for operational metrics is the right starting point.
The goal of observability isn't to collect data. It's to make problems visible before they become incidents. Invest in dashboards and alerts, not just logging.