An iterative optimization algorithm that minimizes a loss function by updating model parameters in small steps in the direction opposite to the gradient — progressively reducing prediction error.
In Depth
Gradient Descent is the optimization engine of deep learning. Its geometry is intuitive: imagine standing on a hilly landscape where altitude represents the loss (prediction error). Gradient Descent asks 'which direction is downhill?' at each step, and takes a small step in that direction. Repeat this enough times, and you descend from wherever you started toward a valley — a local or global minimum of the loss function.
In practice, computing the gradient over the entire training dataset (Batch Gradient Descent) is computationally prohibitive for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single random example per step — much faster, but noisy. Mini-batch Gradient Descent, the standard in practice, uses small batches (typically 32-512 examples) to balance speed and stability. Modern optimizers like Adam combine gradient descent with momentum and adaptive learning rates, converging faster and more reliably than plain SGD.
The Learning Rate hyperparameter controls the step size at each iteration. Too large, and the optimizer overshoots minima, causing training to diverge. Too small, and convergence is agonizingly slow. Learning rate schedules — which reduce the rate over time — and techniques like warm-up (starting with a small rate and gradually increasing it) are standard practices for training large models. Finding the right learning rate is often the single most impactful hyperparameter decision.
Gradient Descent is how neural networks correct their mistakes — repeatedly measuring which direction reduces error most and taking small steps in that direction, until prediction accuracy can't be meaningfully improved further.

