Mixture of Experts in Production: Why Active Parameters Matter More Than Total

The number that finally stopped meaning what you thought

For years, the size of a language model was the headline number. GPT-3 had 175B. Llama 2 had 70B. Every new release was measured against that single metric. Then MoE — Mixture of Experts — models started shipping, and the number on the tin stopped meaning what people thought it meant.

NVIDIA's Nemotron 3 Super, announced at GTC in March, is a clean example: 120 billion total parameters, but only 12 billion active on any given forward pass. Mistral's Large 3 is similar: a 675B total footprint that behaves closer to a 70B model at inference time. These aren't edge cases anymore. MoE has quietly become the default architecture at the frontier.

What MoE actually does

A traditional dense transformer passes every token through every parameter. A mixture-of-experts model does something different: each layer has multiple parallel "expert" sub-networks, and a small router network decides which experts each token should be sent to. Typically only 1–4 experts out of dozens are active for any given token.

The effect is counterintuitive. You get the capacity of a very large model — because the total parameter count is huge — but the computational cost of a much smaller one, because only a fraction of the parameters are doing work at any moment.

This decoupling of capacity from compute is the whole point. It's why MoE models can post benchmarks in the same neighborhood as dense models several times larger, at a fraction of the inference cost.

The two numbers you need to know

When evaluating an MoE model, the two numbers that actually matter are:

Total parameters — This tells you about the model's memory footprint and the capacity it can draw on. It determines how much GPU memory you need to load it.
Active parameters per token — This tells you about the inference compute cost and latency. It's what you pay for on every forward pass.

A 120B total / 12B active model needs to fit a 120B model in memory, but each token it generates costs roughly what a 12B dense model would cost. That's a fundamentally different tradeoff than a 120B dense model, and it leads to different deployment decisions.

If you're comparing MoE and dense models on cost, use the active parameter count. If you're comparing them on hardware requirements, use the total. Using the wrong one will lead you to wrong conclusions in both directions.

The deployment implications

Memory dominates

The thing that catches teams off guard when they first deploy an MoE model is memory. A 120B-total model still needs ~240GB of GPU memory in FP16. You can't run it on a single consumer GPU, and you often can't run it on a single data center GPU either. Multi-GPU setups with careful expert partitioning are typical.

This means the operational profile of MoE models is different from their dense equivalents. You're provisioning for memory, not compute. The hardware bill looks strange — lots of memory, relatively modest compute utilization — but that's exactly the point.

Load balancing is a real concern

MoE routers are supposed to distribute tokens evenly across experts, but in practice, some experts become "popular" and get overloaded while others sit idle. Good implementations handle this with auxiliary losses during training and load-balancing constraints at inference. Bad implementations produce latency spikes when traffic patterns shift the routing distribution.

If you're self-hosting an MoE model, monitor expert utilization as carefully as you monitor GPU utilization. Imbalance is a leading indicator of degraded performance.

The batch size surprise MoE models often have worse batch scaling than dense models, because tokens in the same batch may need different experts — reducing the efficiency of matrix multiplication. Expect throughput gains that are less dramatic than the active-parameter math suggests.

Fine-tuning is more delicate

Fine-tuning MoE models requires more care than fine-tuning dense models. You can easily break the router's learned distribution of tokens to experts, which causes subtle quality regressions that don't show up on small eval sets but bite you in production. Adapter-based methods like LoRA work, but the default hyperparameters often need adjustment.

When MoE is the right choice

MoE isn't automatically better than dense. It's better when:

You need frontier-level capability but can't afford frontier-level inference cost
Your workload has enough volume that the upfront complexity of multi-GPU deployment amortizes
You can live with the memory requirements and have infrastructure that supports them

For teams running moderate-volume workloads on well-defined tasks, a small dense model — fine-tuned well — is usually a simpler and cheaper answer. MoE earns its keep at the high end of the capability curve, where there's no dense alternative with comparable quality.

The direction the industry is going

Nearly every frontier model released in the past year has been MoE or a hybrid. The reason is straightforward: the architecture hits a better point on the capability-vs-cost curve than dense models, and the training recipes have matured enough that the quality gap has closed.

For most engineering teams, the practical takeaway is this: when you read a model spec sheet, stop looking at the total parameter count as the primary metric. Look for the active count. That's the number that determines what the model costs to run, how fast it responds, and whether it's the right choice for your use case. The headline number is increasingly just marketing.