Gradient Descent

Definition

An iterative optimization algorithm that minimizes a loss function by updating model parameters in small steps in the direction opposite to the gradient — progressively reducing prediction error.

In Depth

Gradient Descent is the optimization engine of deep learning. Its geometry is intuitive: imagine standing on a hilly landscape where altitude represents the loss (prediction error). Gradient Descent asks 'which direction is downhill?' at each step, and takes a small step in that direction. Repeat this enough times, and you descend from wherever you started toward a valley — a local or global minimum of the loss function.

In practice, computing the gradient over the entire training dataset (Batch Gradient Descent) is computationally prohibitive for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single random example per step — much faster, but noisy. Mini-batch Gradient Descent, the standard in practice, uses small batches (typically 32-512 examples) to balance speed and stability. Modern optimizers like Adam combine gradient descent with momentum and adaptive learning rates, converging faster and more reliably than plain SGD.

The Learning Rate hyperparameter controls the step size at each iteration. Too large, and the optimizer overshoots minima, causing training to diverge. Too small, and convergence is agonizingly slow. Learning rate schedules — which reduce the rate over time — and techniques like warm-up (starting with a small rate and gradually increasing it) are standard practices for training large models. Finding the right learning rate is often the single most impactful hyperparameter decision.

Key Takeaway

Gradient Descent is how neural networks correct their mistakes — repeatedly measuring which direction reduces error most and taking small steps in that direction, until prediction accuracy can't be meaningfully improved further.

Real-World Applications

01 Neural network training: adjusting billions of parameters using Adam-optimized gradient descent across thousands of GPU hours.

02 Linear regression optimization: finding the best-fit line by minimizing mean squared error through gradient descent on a closed-form problem.

03 Reinforcement learning: using policy gradient methods to optimize agent behavior by ascending the reward gradient.

04 Generative model training: minimizing reconstruction loss in VAEs or adversarial loss in GANs through gradient updates.

05 Hyperparameter optimization: differentiable hyperparameter search methods that gradient-descend through the hyperparameter space.

Frequently Asked Questions

What is the difference between Gradient Descent, SGD, and Adam?

Batch Gradient Descent computes the gradient over the entire dataset per step — accurate but slow. Stochastic Gradient Descent (SGD) uses one random sample — fast but noisy. Mini-batch SGD (the practical standard) uses small batches of 32-512 samples. Adam improves on SGD by adapting the learning rate per-parameter and incorporating momentum, making it faster and more reliable for most tasks.

What happens if the learning rate is too high or too low?

Too high: the optimizer takes large steps that overshoot the minimum, causing the loss to oscillate or explode — the model never converges. Too low: the optimizer takes tiny steps, making training extremely slow and potentially getting stuck in shallow local minima. The optimal learning rate enables fast convergence without instability.

Can gradient descent get stuck in local minima?

In theory, yes — gradient descent follows the steepest path downhill and can get trapped in local minima. In practice, for high-dimensional neural networks, true local minima are rare; most apparent traps are saddle points, which optimizers like Adam handle well. Techniques like learning rate scheduling, momentum, and stochastic noise from mini-batches help escape poor regions.

In Depth

Real-World Applications

Related Concepts

Frequently Asked Questions