The process of using a trained machine learning model to make predictions or generate outputs on new, unseen data — the production phase that follows training and deployment.
In Depth
Inference is the flip side of training. Where training involves iteratively adjusting model parameters to minimize prediction error over a training dataset, inference uses those fixed parameters to process new, unseen inputs and produce predictions or outputs. Training is like the education; inference is like the exam — and unlike training, which happens once (or periodically when models are retrained), inference happens continuously in deployed systems.
The computational profile of inference differs significantly from training. Training requires computing gradients and updating weights across billions of parameters — a write-intensive operation. Inference only performs the forward pass — computing predictions without gradient computation — making it faster and less memory-intensive. However, at scale (millions of requests per day for a production AI system), inference efficiency becomes the dominant cost. A model that costs $10M to train may cost $100M per year to serve at scale.
Inference optimization is a major engineering discipline. Techniques include quantization (reducing parameter precision from 32-bit to 8-bit or 4-bit floats), pruning (removing unnecessary weights), knowledge distillation (training a smaller 'student' model to mimic a larger 'teacher'), batch inference (processing multiple requests together to amortize GPU overhead), and hardware specialization (running models on TPUs, NPUs, or custom silicon optimized for matrix multiplication). For LLMs, KV-caching, speculative decoding, and flash attention dramatically reduce per-token inference cost.
Inference is the moment AI earns its keep — when the patterns learned during training meet real-world data. But at scale, inference efficiency can matter as much as model quality: a model that is too slow or costly to serve is effectively useless.

