A hyperparameter that controls the size of the weight updates during gradient descent — determining how quickly or slowly a model learns from its training data.
In Depth
The learning rate is the single most critical hyperparameter in training a neural network. It controls the step size taken in the opposite direction of the gradient during each weight update. A learning rate of 0.01 means each weight is nudged by 1% of its gradient at each update step. Set it too high, and training diverges — the optimizer overshoots minima and the loss explodes. Set it too low, and training is painfully slow and may get stuck in shallow local minima.
Finding the right learning rate is typically the first and most impactful optimization made when training a new model. The learning rate range test — increasing the learning rate exponentially over a short training run and observing where the loss begins to decrease fastest — provides a practical starting point. Common defaults have evolved with experience: Adam optimizer typically works well with learning rates around 1e-3 to 3e-4; fine-tuning pre-trained models typically requires much lower rates (1e-5 to 5e-5) to avoid destroying learned representations.
Learning rate schedules vary the learning rate over the course of training. Warm-up schedules start with a very small learning rate and gradually increase to the target value, stabilizing early training when gradients are noisy and model weights are far from optimal. Decay schedules (step decay, cosine annealing, exponential decay) reduce the learning rate as training progresses, allowing finer parameter adjustments as the model approaches convergence. Modern optimizers like Adam adapt the learning rate per-parameter based on gradient history, partially automating this process.
The learning rate is the dial between 'learns nothing' and 'learns chaotically' — finding the right value, and scheduling how it changes over training, is often the difference between a model that converges and one that doesn't.

