A family of techniques that reduce the size, memory footprint, and computational cost of AI models while preserving as much performance as possible — enabling deployment on resource-constrained devices and reducing inference costs.
In Depth
Model compression addresses a fundamental tension in AI: the most powerful models are enormous (billions to trillions of parameters), but many deployment environments have strict constraints on size, speed, memory, and power consumption. A model running on a smartphone, an embedded sensor, or even a cost-sensitive cloud endpoint cannot afford the computational resources of the original training setup. Model compression techniques bridge this gap by producing smaller, faster models that retain most of the original's capability.
The three primary compression techniques are quantization, pruning, and knowledge distillation. Quantization reduces the numerical precision of model weights — from 32-bit floating point to 16-bit, 8-bit, or even 4-bit integers — dramatically reducing model size and speeding up computation on hardware that supports low-precision arithmetic. Pruning removes individual weights or entire neurons/layers that contribute little to the model's output, creating a sparser, more efficient network. Knowledge distillation trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model, transferring the teacher's knowledge into a compact form.
Model compression has become increasingly important as the AI industry matures from research to production. Running inference on frontier models like GPT-4 at scale costs millions of dollars — compression directly reduces these costs. The open-source community extensively uses quantization to run large language models on consumer hardware (running a 70-billion parameter model on a gaming laptop via 4-bit quantization). Edge AI deployment on phones, watches, and IoT devices would be impossible without aggressive compression. Techniques continue to advance: structured pruning, mixed-precision quantization, and compression-aware training increasingly achieve near-lossless compression.
Model compression shrinks AI models through quantization, pruning, and distillation — enabling efficient deployment on devices, reducing cloud costs, and making powerful AI accessible beyond data centers.