Learning Rate

Definition

A hyperparameter that controls the size of the weight updates during gradient descent — determining how quickly or slowly a model learns from its training data.

In Depth

The learning rate is the single most critical hyperparameter in training a neural network. It controls the step size taken in the opposite direction of the gradient during each weight update. A learning rate of 0.01 means each weight is nudged by 1% of its gradient at each update step. Set it too high, and training diverges — the optimizer overshoots minima and the loss explodes. Set it too low, and training is painfully slow and may get stuck in shallow local minima.

Finding the right learning rate is typically the first and most impactful optimization made when training a new model. The learning rate range test — increasing the learning rate exponentially over a short training run and observing where the loss begins to decrease fastest — provides a practical starting point. Common defaults have evolved with experience: Adam optimizer typically works well with learning rates around 1e-3 to 3e-4; fine-tuning pre-trained models typically requires much lower rates (1e-5 to 5e-5) to avoid destroying learned representations.

Learning rate schedules vary the learning rate over the course of training. Warm-up schedules start with a very small learning rate and gradually increase to the target value, stabilizing early training when gradients are noisy and model weights are far from optimal. Decay schedules (step decay, cosine annealing, exponential decay) reduce the learning rate as training progresses, allowing finer parameter adjustments as the model approaches convergence. Modern optimizers like Adam adapt the learning rate per-parameter based on gradient history, partially automating this process.

Key Takeaway

The learning rate is the dial between 'learns nothing' and 'learns chaotically' — finding the right value, and scheduling how it changes over training, is often the difference between a model that converges and one that doesn't.

Real-World Applications

01 LLM pre-training: cosine annealing schedules that reduce learning rate from 1e-4 to near zero over hundreds of billions of training tokens.

02 Fine-tuning BERT: low learning rates (2e-5 to 5e-5) preventing catastrophic forgetting of pre-trained representations.

03 Learning rate finder: automatic tools in PyTorch Lightning and fastai that identify the optimal learning rate in a few hundred training steps.

04 Cyclical learning rates: varying the rate between bounds in cycles, enabling models to escape local minima and improve final performance.

05 Adaptive optimizers: Adam and AdaFactor automatically adjusting per-parameter learning rates based on gradient history during LLM training.

Frequently Asked Questions

What is a good starting learning rate?

Common defaults: 1e-3 to 3e-4 for Adam optimizer on most tasks. 1e-5 to 5e-5 for fine-tuning pre-trained models (to avoid destroying learned representations). For precise selection, use a learning rate finder: gradually increase the learning rate during a short training run and pick the rate where loss decreases most steeply. Libraries like PyTorch Lightning and fastai include automated finders.

What is a learning rate schedule?

A schedule varies the learning rate during training. Common schedules: cosine annealing (gradually decreases to near zero, widely used for LLMs), step decay (reduce by a factor at specific epochs), warm-up + decay (start low, ramp up, then decay — standard for Transformer training), and cyclical (oscillate between bounds). Schedules help balance fast early progress with fine-grained later convergence.

Why does fine-tuning require a lower learning rate?

Pre-trained models already contain valuable learned representations. A high learning rate would make large updates that destroy these representations — a phenomenon called catastrophic forgetting. Low learning rates (10-100x smaller than pre-training) make small, careful adjustments that adapt the model to the new task while preserving most of the pre-trained knowledge.

In Depth

Real-World Applications

Related Concepts

Frequently Asked Questions