The bill nobody warns you about
When teams set up LLM observability, the conversation is almost always about what to monitor — token usage, latency, quality scores, error rates, prompt versions, retrieval recall. Whatever the chosen tool, it gets adopted, dashboards get built, and everyone agrees that the visibility is worth it. Then the next quarter's bill arrives, and somebody on the finance side asks an awkward question: why does our observability platform cost as much as our inference?
This isn't an exaggeration. I've seen teams where the LLM observability bill was 60% of the inference spend. I've seen one where it was higher than the inference spend. The reason is structural — LLM traffic generates an enormous amount of data per request, and most observability platforms charge by ingest volume, retention duration, or both. Without deliberate management, the cost compounds quickly and quietly.
The teams that keep this under control aren't the ones that monitor less. They're the ones that monitor intentionally, with a clear sense of which data is worth paying to store and which isn't.
Why LLM observability is unusually expensive
A traditional web service generates a few hundred bytes of structured logs per request: timestamp, endpoint, status code, latency, user ID, maybe a few custom fields. An LLM request, properly logged, generates orders of magnitude more:
- The full prompt — often several KB, sometimes much more for systems with heavy context
- The full response — KB to tens of KB depending on the use case
- Tool calls and results — variable, sometimes huge
- Retrieved documents — KB per chunk, often dozens of chunks per request
- Token-level metadata — counts, model used, finish reason, latency breakdown
- Trace structure — the call graph for agents, with intermediate steps
- Eval scores — produced asynchronously, joined back to the original request
Add it up and a single agent run can generate 50–200 KB of observability data. Multiply by a million requests a day and you're ingesting hundreds of GB daily. At the prices most observability vendors charge, that's serious money before anyone has even looked at a dashboard.
The four cost levers you actually have
Lever 1 — What to log at all
The most powerful lever, and the one most teams skip. Not every request needs full observability. A useful tiering strategy:
- Critical paths (production user-facing requests): full logging, every request
- Background and batch jobs: structured metrics only, no full payloads
- Internal tools and dev environments: sampled logging at 10% or less
- Failed requests: full logging always, regardless of tier
Just splitting your traffic this way and applying different retention policies typically cuts observability cost by 40–60%, with no meaningful loss of debugging capability.
Lever 2 — Sampling
For high-volume workloads, full logging of every request is overkill. A 10% sample is usually enough to spot patterns, debug common issues, and run statistical analyses. The key is smart sampling — keep 100% of failed requests, 100% of slow requests, 100% of requests from a debug header, and a small random sample of everything else.
The sampling logic matters more than the sampling rate. A naive 10% random sample loses the rare, important requests that you actually need to debug. A bias-correcting sample keeps the volume down without losing the signal.
Lever 3 — What to keep, and for how long
Retention costs add up because data doesn't get cheaper to store as it ages — it just sits there. The teams managing this well use a tiered retention model:
- Hot tier (7 days): full payload access, fast queries, used for active debugging
- Warm tier (30 days): full payload but slower queries, used for trend analysis
- Cold tier (90+ days): aggregated metrics only, individual payloads dropped, used for long-term analytics
- Compliance tier (as required): hashed or redacted, kept only because regulation demands it
Most observability platforms offer tiered storage. Most teams use it badly — they keep everything in the hot tier "just in case." That insurance is expensive, and the actual probability of needing a 6-month-old payload is low enough that the insurance rarely pays off.
Lever 4 — What to compress or hash
Some fields are valuable for analytics but don't need to be stored verbatim. The full text of retrieved documents, for example, can usually be replaced with a content hash and a reference to where the original lives. The full prompt template can be stored once and referenced by version ID rather than duplicated on every request. These tricks shrink per-request storage by 60–80% with no meaningful loss of debugging capability.
The questions to ask before adding any new dimension
Every new field, label, or dimension that gets added to your observability pipeline costs money to ingest, store, and query. The question to ask before adding any of them:
- What decision will this data help us make? If you can't name a specific decision, you don't need the data.
- How often will we look at it? Data that's queried weekly is worth more than data that's queried once a year.
- Can we derive it from other fields when we need it? A field you can compute on-demand from existing data doesn't need to be stored independently.
- What's the cost of not having it? Sometimes the answer is "nothing" — and that's a sign the field isn't worth the cost of having it.
Most observability bills grow not because any single decision was wrong, but because every individual addition seemed reasonable at the time. The discipline is in pushing back on reasonable-looking additions until they prove they earn their place.
What observability is actually for
It's worth stating the obvious: observability is a cost, not a goal. The goal is to be able to debug production issues quickly, understand quality trends, catch regressions before users do, and improve the product over time. Observability data that contributes to those goals is valuable. Data that just sits in a database in case someone might need it someday is overhead.
The teams with the best LLM observability practices aren't the ones with the most data. They're the ones whose dashboards get used, whose alerts fire on real issues, and whose engineers can find the answer to a debugging question in under five minutes. That's the actual measure of whether the spend is justified.
The right observability bill isn't zero. It's whatever amount produces the visibility you actually use, and not a dollar more. Most teams are spending two to three times that amount and have no idea.
A pragmatic starting point
If your observability spend has crept up and you don't know how to bring it down, the highest-leverage first step is the same one almost always: audit what you're logging that nobody ever queries. Pull a list of every dimension and field in your observability pipeline. For each one, check the last 30 days of query logs and count how many times it was actually used. Anything used zero times is a candidate for removal. Anything used once is a candidate for sampling or summarization. Anything used hundreds of times stays.
This audit usually surfaces 30–50% of the observability data as removable without complaints. The savings are immediate, and the debugging experience doesn't degrade because nobody was looking at that data anyway. It's the cheapest, fastest cost optimization most LLM teams have available — and one almost nobody runs until the bill forces them to.