A technique that normalizes the inputs to each layer of a neural network within a mini-batch — stabilizing training, enabling higher learning rates, and allowing much deeper networks to converge reliably.
In Depth
Batch Normalization, introduced by Ioffe and Szegedy in 2015, addresses a fundamental challenge in training deep neural networks: as data passes through many layers, the distribution of inputs to each layer shifts during training, a phenomenon the authors called 'internal covariate shift.' This makes optimization difficult — each layer must constantly readjust to changing input distributions. Batch Normalization solves this by normalizing the inputs to each layer to have zero mean and unit variance, computed across the current mini-batch of data.
The mechanics are straightforward. For each feature in a layer's input, the algorithm computes the mean and variance across the mini-batch, normalizes the values to zero mean and unit variance, then applies two learned parameters (scale and shift) that allow the network to undo the normalization if that is optimal. During inference, the batch statistics are replaced with running averages accumulated during training. This simple operation has profound effects: training becomes more stable, higher learning rates can be used (speeding convergence by 5-10x), and networks with many more layers can be trained successfully.
Batch Normalization became nearly ubiquitous in deep learning architectures after its introduction, but it has limitations. Its dependence on batch statistics makes it problematic for small batch sizes or sequential models. Alternatives have emerged: Layer Normalization (preferred in Transformers) normalizes across features rather than the batch. Group Normalization and Instance Normalization offer different tradeoffs. Root Mean Square Normalization (RMSNorm) is increasingly popular in modern LLMs. Despite these alternatives, understanding Batch Normalization remains essential as it set the conceptual foundation for all subsequent normalization techniques.
Batch Normalization stabilizes deep network training by normalizing layer inputs — enabling faster convergence, higher learning rates, and the practical training of very deep architectures.