ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub
Back to Glossary
Technical Concepts Intermediate Also: Mini-batch Size

Batch Size

Definition

The number of training examples processed together in a single forward and backward pass during model training — a hyperparameter that balances training speed, memory usage, and gradient estimate quality.

In Depth

Batch size determines how many training examples the model processes before updating its parameters. In full-batch gradient descent, all training examples are processed before each update — producing the most accurate gradient estimate but being computationally expensive for large datasets. In stochastic gradient descent (SGD), a single example is processed per update — fast and memory-efficient but very noisy. Mini-batch gradient descent uses batches of 32-512 examples — the practical standard that balances gradient quality and computational efficiency.

Batch size has important effects on training dynamics. Small batches introduce noise into the gradient estimate — which can actually be beneficial, acting as regularization and helping the model escape local minima. Large batches produce stable, accurate gradient estimates but may converge to 'sharp minima' that generalize poorly. The 'linear scaling rule' suggests that when batch size is multiplied by k, the learning rate should also be multiplied by k to maintain similar training dynamics — though this relationship breaks down at very large batch sizes.

Memory constraints are the practical ceiling on batch size. GPU memory must hold the activations, gradients, and optimizer states for the entire batch simultaneously. Techniques like gradient accumulation (accumulating gradients over multiple small batches before updating weights) allow effectively larger batch sizes without the memory cost. Mixed-precision training (using 16-bit floats) approximately doubles the effective batch size within the same memory envelope.

Key Takeaway

Batch size is one of the most consequential training hyperparameters — it determines how often the model learns, how much memory is required, and even the character of the model it produces at convergence.

Real-World Applications

01 GPU memory optimization: choosing the largest batch size that fits in VRAM to maximize training throughput.
02 Distributed training: large batch sizes enabling data parallelism across hundreds of GPUs for LLM pre-training.
03 Fine-tuning with small datasets: small batch sizes (8-32) providing regularization through noisy gradients when limited labeled data is available.
04 Gradient accumulation: simulating large batches on memory-constrained hardware by accumulating gradients over multiple small forward passes.
05 Hyperparameter sweeps: studying the interaction between batch size and learning rate to identify optimal training configurations.