Batch Size in ML – How to Choose It

Definition

The number of training examples processed together in a single forward and backward pass during model training — a hyperparameter that balances training speed, memory usage, and gradient estimate quality.

In Depth

Batch size determines how many training examples the model processes before updating its parameters. In full-batch gradient descent, all training examples are processed before each update — producing the most accurate gradient estimate but being computationally expensive for large datasets. In stochastic gradient descent (SGD), a single example is processed per update — fast and memory-efficient but very noisy. Mini-batch gradient descent uses batches of 32-512 examples — the practical standard that balances gradient quality and computational efficiency.

Batch size has important effects on training dynamics. Small batches introduce noise into the gradient estimate — which can actually be beneficial, acting as regularization and helping the model escape local minima. Large batches produce stable, accurate gradient estimates but may converge to 'sharp minima' that generalize poorly. The 'linear scaling rule' suggests that when batch size is multiplied by k, the learning rate should also be multiplied by k to maintain similar training dynamics — though this relationship breaks down at very large batch sizes.

Memory constraints are the practical ceiling on batch size. GPU memory must hold the activations, gradients, and optimizer states for the entire batch simultaneously. Techniques like gradient accumulation (accumulating gradients over multiple small batches before updating weights) allow effectively larger batch sizes without the memory cost. Mixed-precision training (using 16-bit floats) approximately doubles the effective batch size within the same memory envelope.

Key Takeaway

Batch size is one of the most consequential training hyperparameters — it determines how often the model learns, how much memory is required, and even the character of the model it produces at convergence.

Real-World Applications

01 GPU memory optimization: choosing the largest batch size that fits in VRAM to maximize training throughput.

02 Distributed training: large batch sizes enabling data parallelism across hundreds of GPUs for LLM pre-training.

03 Fine-tuning with small datasets: small batch sizes (8-32) providing regularization through noisy gradients when limited labeled data is available.

04 Gradient accumulation: simulating large batches on memory-constrained hardware by accumulating gradients over multiple small forward passes.

05 Hyperparameter sweeps: studying the interaction between batch size and learning rate to identify optimal training configurations.

Frequently Asked Questions

How does batch size affect training?

Larger batches: more stable gradients, faster per-epoch training on GPUs, but can converge to sharp minima (worse generalization) and require more memory. Smaller batches: noisier gradients (which can help escape local minima), better generalization in many cases, but slower per-epoch training. Common sizes range from 16 to 512. The optimal batch size balances speed, GPU memory, and model quality.

What is the relationship between batch size and learning rate?

A widely used rule: when you increase batch size by a factor of k, increase the learning rate by a factor of √k (or k, depending on the method). Larger batches produce more accurate gradient estimates, so the optimizer can safely take larger steps. The linear scaling rule (with warm-up) is standard practice for scaling batch sizes across multiple GPUs in distributed training.

What batch size should I use with limited GPU memory?

Start with the largest batch size that fits in memory — use 'gradient accumulation' to simulate larger batches by accumulating gradients over multiple smaller batches before updating weights. For example, accumulating over 4 batches of 64 is functionally equivalent to one batch of 256. Mixed-precision training (FP16) can also double the effective batch size by halving memory per sample.

Batch Size

In Depth

Real-World Applications

Related Concepts

Frequently Asked Questions