ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub
Back to Glossary
Deep Learning Intermediate Also: Steepest Descent

Gradient Descent

Definition

An iterative optimization algorithm that minimizes a loss function by updating model parameters in small steps in the direction opposite to the gradient — progressively reducing prediction error.

In Depth

Gradient Descent is the optimization engine of deep learning. Its geometry is intuitive: imagine standing on a hilly landscape where altitude represents the loss (prediction error). Gradient Descent asks 'which direction is downhill?' at each step, and takes a small step in that direction. Repeat this enough times, and you descend from wherever you started toward a valley — a local or global minimum of the loss function.

In practice, computing the gradient over the entire training dataset (Batch Gradient Descent) is computationally prohibitive for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single random example per step — much faster, but noisy. Mini-batch Gradient Descent, the standard in practice, uses small batches (typically 32-512 examples) to balance speed and stability. Modern optimizers like Adam combine gradient descent with momentum and adaptive learning rates, converging faster and more reliably than plain SGD.

The Learning Rate hyperparameter controls the step size at each iteration. Too large, and the optimizer overshoots minima, causing training to diverge. Too small, and convergence is agonizingly slow. Learning rate schedules — which reduce the rate over time — and techniques like warm-up (starting with a small rate and gradually increasing it) are standard practices for training large models. Finding the right learning rate is often the single most impactful hyperparameter decision.

Key Takeaway

Gradient Descent is how neural networks correct their mistakes — repeatedly measuring which direction reduces error most and taking small steps in that direction, until prediction accuracy can't be meaningfully improved further.

Real-World Applications

01 Neural network training: adjusting billions of parameters using Adam-optimized gradient descent across thousands of GPU hours.
02 Linear regression optimization: finding the best-fit line by minimizing mean squared error through gradient descent on a closed-form problem.
03 Reinforcement learning: using policy gradient methods to optimize agent behavior by ascending the reward gradient.
04 Generative model training: minimizing reconstruction loss in VAEs or adversarial loss in GANs through gradient updates.
05 Hyperparameter optimization: differentiable hyperparameter search methods that gradient-descend through the hyperparameter space.