The iceberg problem
Ask a team how much their AI system costs and they'll quote their API bill. That number is real, but it's typically 30–50% of the total cost. The rest is hidden in infrastructure, tooling, evaluation, and human time that doesn't show up on a single invoice.
This post breaks down every cost center in a production AI system so you can budget accurately and optimize where it actually matters.
Layer 1 — Inference costs
This is the obvious one: the cost of generating predictions or outputs from your model. The variables that drive inference cost are:
- Model size — Larger models cost more per token, often dramatically so
- Input length — Most providers charge per input token, and long context multiplies this
- Output length — Output tokens are typically 2–5× more expensive than input tokens
- Volume — Some providers offer volume discounts; others don't
For API-based deployments, this is a pure variable cost. For self-hosted models, it's amortized across GPU lease costs and utilization rates.
The single most impactful cost optimization is often the simplest: reduce the length of your prompts. Cutting a system prompt from 2,000 tokens to 500 tokens saves 75% on input costs for every request.
Layer 2 — Infrastructure beyond inference
Retrieval infrastructure
If you're using RAG, you need a vector database, embedding compute, and the infrastructure to keep both up to date. Costs include the vector store itself (which scales with the number of documents), the embedding API calls to populate it, and the compute for the retrieval layer.
Caching
Semantic caching — storing and reusing responses for similar queries — can reduce inference costs by 20–40% for applications with repeated query patterns. But the cache itself needs compute and storage, and the similarity logic needs tuning to avoid serving stale or incorrect cached results.
Queuing and orchestration
Agent-based systems need a queue to manage tool calls, retries, and multi-step workflows. This is typically a small cost but adds up in high-throughput environments.
Layer 3 — Evaluation and quality assurance
This is the cost center most teams underestimate. If you're serious about production quality, you need:
- Automated evaluation suites — Running your test cases against every model update, prompt change, or config change. If you use an LLM-as-judge approach, this is an inference cost on top of your production inference.
- Human evaluation — For tasks where automated metrics don't capture quality well, budget for periodic human review. This can be internal (engineer time) or external (contractor labeling).
- A/B testing infrastructure — If you're testing prompt variants or model versions, you need the tooling to run controlled experiments and measure outcomes.
Layer 4 — Data and training
If you're fine-tuning models:
- Data collection and labeling — Even with synthetic data, you need seed examples from real-world sources
- Training compute — GPU hours for fine-tuning runs, hyperparameter searches, and failed experiments
- Data storage — Training datasets, model checkpoints, evaluation results
- Training infrastructure — Tools like Weights & Biases, MLflow, or equivalent
The first fine-tuning run is always the most expensive because you're building the pipeline. Subsequent runs are cheaper — if you invested in automation.
Layer 5 — The hidden costs
Engineer time
The most expensive resource in any AI system is the engineers who build and maintain it. Prompt engineering, debugging hallucinations, investigating quality regressions, tuning retrieval, and managing model migrations all take time that doesn't show up on an infrastructure bill.
A rough heuristic: for every $1 spent on inference, a well-run team spends $2–3 on engineering time to make that inference production-ready.
Model migrations
Every model provider releases new versions on their own schedule. Each migration requires evaluation, prompt adjustment, and regression testing. Budget for 2–4 model migrations per year per provider.
Compliance and security
Data classification, access controls, audit logging, PII detection, and compliance documentation all have costs — in tooling and in engineering time.
Optimization strategies that actually work
- Right-size your models — Use the smallest model that meets your quality bar for each task
- Compress your prompts — Shorter prompts with the same information save on every request
- Cache aggressively — Semantic caching, prefix caching, and response caching all reduce inference volume
- Batch when possible — Batch APIs are typically 50% cheaper than real-time APIs
- Monitor and alert — A runaway agent loop can consume thousands of dollars in hours; set up cost alerts
A template for AI budgeting
For a production AI system processing ~1M requests per month using a frontier API, a realistic monthly budget breakdown looks roughly like:
- Inference: 40–50% of total cost
- Infrastructure (retrieval, caching, orchestration): 15–20%
- Evaluation and QA: 10–15%
- Engineering time (pro-rated): 20–30%
- Data and training: 5–10% (if fine-tuning)
Your actual numbers will vary, but if any category is zero, you're probably under-investing in it.