The thing every platform team eventually has to learn
If you're calling an LLM provider's API, multi-tenancy isn't your problem — it's theirs. They batch your requests with everyone else's, share GPU memory across thousands of users, and present you with a clean per-token bill. You don't have to think about it.
The moment you start hosting models yourself — for cost reasons, for latency reasons, for compliance reasons — multi-tenancy becomes your problem, and it's the kind of problem that catches teams off guard. Serving one user well on a single GPU is easy. Serving a thousand users on a fleet of GPUs, with different access patterns, different latency expectations, and different willingness to pay, is where a lot of self-hosted LLM platforms quietly fall apart.
The math doesn't behave the way intuition suggests. Let's walk through why.
The two numbers that determine everything
Multi-tenant LLM serving comes down to a tradeoff between two metrics:
- Time-per-request — how long any individual user waits for their response
- Tokens-per-GPU-second — how much useful work the hardware produces per unit time
These metrics fight each other. Optimizing one degrades the other, and the entire job of running a platform is finding the right point on the tradeoff curve for your specific users.
The thing that bridges them is batching. Modern LLM serving runtimes group multiple requests together so they can be processed in a single forward pass. A batch of 32 requests doesn't take 32 times longer than one — it takes maybe 1.5 to 2 times longer, because most of the cost is loading the model weights into compute units, not the per-request math.
This is where the economics come from. A GPU running one request at a time is wildly underutilized. The same GPU running batched requests can serve dozens of users at near-linear cost. The whole game is keeping batch sizes high without making any individual user wait too long.
Why naive batching breaks under real workloads
The simplest batching strategy is static batching: collect requests for a few milliseconds, group them, send them through the model together. This works in benchmarks. It breaks in production for a specific reason: requests in a batch don't finish at the same time.
In an LLM, generation happens token by token. A request that needs 50 output tokens finishes after 50 forward passes; a request that needs 500 output tokens occupies the batch for 10x as long. With static batching, the entire batch waits for the longest request. Your throughput collapses to whatever the slowest user happens to need.
The fix is continuous batching (sometimes called "in-flight batching"): the runtime drops finished requests out of the batch and adds new ones in their place, on every forward pass. Requests join and leave the batch dynamically, and the GPU stays busy regardless of how variable the response lengths are. This is the technique that made high-throughput LLM serving viable, and it's what frameworks like vLLM popularized.
If you're self-hosting and your runtime doesn't do continuous batching, you're probably running at less than half the throughput your hardware is capable of.
The fairness problem nobody plans for
Once batching is working, a new problem emerges: fairness. Some tenants generate short responses; others generate long ones. Some send bursts of traffic; others send a steady trickle. Without explicit fairness controls, the heavy users dominate the batches and the light users see their latency degrade — even though the system is technically working as designed.
The patterns that solve this:
Per-tenant token budgets
Each tenant gets a maximum number of in-flight tokens at any given time. When they hit the limit, new requests queue until older ones complete. This prevents one tenant from monopolizing the batch and keeps latency predictable for everyone else.
Priority lanes
Different tenants get different priority levels — usually tied to their pricing tier. High-priority requests jump to the front of the queue and get included in batches faster than low-priority ones. The infrastructure stays the same; the scheduling layer enforces the SLA.
Reserved capacity
For tenants who need guaranteed performance, dedicate a portion of your GPU pool to them exclusively. They pay a premium for the reservation; you get predictable utilization on the reserved hardware and avoid the unpredictability they would otherwise create on the shared pool.
Isolation: the harder problem
Fairness keeps tenants from stepping on each other's performance. Isolation keeps them from stepping on each other's data and security. This is the harder of the two, and it's the one most home-grown platforms get wrong.
The questions that actually matter:
- Can tenant A's requests influence the model's behavior on tenant B's requests? (Usually yes, through KV cache sharing or batch effects, in subtle ways.)
- Can a malicious prompt from tenant A leak into tenant B's response? (Should be no, but bugs happen, and the failure modes are embarrassing.)
- Are tenant logs, traces, and metrics properly segregated? (Often no, especially in early-stage platforms.)
- Can tenants with stricter compliance requirements be served on isolated infrastructure? (Almost always a requirement for enterprise deals.)
The cleanest answer to all of these is to run separate inference pools for separate trust domains — different GPUs, different processes, different networks. This sacrifices utilization for safety. Most mature platforms end up with a tiered architecture: a high-utilization shared pool for low-sensitivity workloads, dedicated pools for tenants with isolation requirements, and a billing model that covers the cost difference.
The reason multi-tenant LLM platforms are expensive isn't the GPUs. It's the layers of fairness, isolation, and observability that turn a single-tenant system into something businesses can actually trust.
What this means for build-vs-buy
For teams considering whether to self-host, multi-tenancy is the underrated cost. Building a single-tenant inference server is a weekend project. Building a multi-tenant platform with good batching, fairness, isolation, observability, and tenant-aware billing is a multi-quarter engineering investment, and one most teams dramatically underestimate at the start.
The honest version of the build-vs-buy decision is: if your workload is large enough that hosted inference is genuinely more expensive than self-hosting at the unit-economics level, you also need to be large enough to absorb the platform engineering work that real multi-tenancy requires. The break-even isn't where the per-token math says it is — it's higher than that, by a lot, and the gap is where most self-hosting projects discover they wish they hadn't started.