The number of training examples processed together in a single forward and backward pass during model training — a hyperparameter that balances training speed, memory usage, and gradient estimate quality.
In Depth
Batch size determines how many training examples the model processes before updating its parameters. In full-batch gradient descent, all training examples are processed before each update — producing the most accurate gradient estimate but being computationally expensive for large datasets. In stochastic gradient descent (SGD), a single example is processed per update — fast and memory-efficient but very noisy. Mini-batch gradient descent uses batches of 32-512 examples — the practical standard that balances gradient quality and computational efficiency.
Batch size has important effects on training dynamics. Small batches introduce noise into the gradient estimate — which can actually be beneficial, acting as regularization and helping the model escape local minima. Large batches produce stable, accurate gradient estimates but may converge to 'sharp minima' that generalize poorly. The 'linear scaling rule' suggests that when batch size is multiplied by k, the learning rate should also be multiplied by k to maintain similar training dynamics — though this relationship breaks down at very large batch sizes.
Memory constraints are the practical ceiling on batch size. GPU memory must hold the activations, gradients, and optimizer states for the entire batch simultaneously. Techniques like gradient accumulation (accumulating gradients over multiple small batches before updating weights) allow effectively larger batch sizes without the memory cost. Mixed-precision training (using 16-bit floats) approximately doubles the effective batch size within the same memory envelope.
Batch size is one of the most consequential training hyperparameters — it determines how often the model learns, how much memory is required, and even the character of the model it produces at convergence.

