A mathematical function applied to each neuron's output that introduces non-linearity, enabling neural networks to learn complex, non-linear patterns rather than just linear combinations of inputs.
In Depth
Without activation functions, a neural network with any number of layers would be mathematically equivalent to a single-layer linear model — no matter how deep the stack, the overall transformation would remain linear. Activation functions introduce non-linearity at each neuron, making it possible for networks to learn the complex, non-linear patterns that appear in real data: curved decision boundaries, hierarchical abstractions, and intricate feature interactions.
The most widely used activation today is ReLU (Rectified Linear Unit): max(0, x). It outputs the input if positive, zero otherwise. Despite its simplicity, ReLU trains faster and performs better than earlier alternatives like Sigmoid (which squashes outputs between 0 and 1) and Tanh (which squashes to -1 to 1) — partly because it avoids the 'vanishing gradient' problem that plagued deep networks with earlier activations. Leaky ReLU and GELU are common variants that address ReLU's own limitation of 'dying' neurons.
The choice of activation function is a hyperparameter that affects training dynamics significantly. Sigmoid and Softmax are still used in output layers — Sigmoid for binary classification (outputting a probability between 0 and 1), Softmax for multi-class classification (outputting a probability distribution across classes). Understanding which activation to use where is fundamental to designing functional neural architectures.
Activation functions are the source of a neural network's expressive power — without them, even the deepest network would be no more powerful than a simple linear regression.

