Activation Functions – ReLU, Sigmoid & More

Definition

A mathematical function applied to each neuron's output that introduces non-linearity, enabling neural networks to learn complex, non-linear patterns rather than just linear combinations of inputs.

In Depth

Without activation functions, a neural network with any number of layers would be mathematically equivalent to a single-layer linear model — no matter how deep the stack, the overall transformation would remain linear. Activation functions introduce non-linearity at each neuron, making it possible for networks to learn the complex, non-linear patterns that appear in real data: curved decision boundaries, hierarchical abstractions, and intricate feature interactions.

The most widely used activation today is ReLU (Rectified Linear Unit): max(0, x). It outputs the input if positive, zero otherwise. Despite its simplicity, ReLU trains faster and performs better than earlier alternatives like Sigmoid (which squashes outputs between 0 and 1) and Tanh (which squashes to -1 to 1) — partly because it avoids the 'vanishing gradient' problem that plagued deep networks with earlier activations. Leaky ReLU and GELU are common variants that address ReLU's own limitation of 'dying' neurons.

The choice of activation function is a hyperparameter that affects training dynamics significantly. Sigmoid and Softmax are still used in output layers — Sigmoid for binary classification (outputting a probability between 0 and 1), Softmax for multi-class classification (outputting a probability distribution across classes). Understanding which activation to use where is fundamental to designing functional neural architectures.

Key Takeaway

Activation functions are the source of a neural network's expressive power — without them, even the deepest network would be no more powerful than a simple linear regression.

Real-World Applications

01 Hidden layers in image classifiers: ReLU activations enabling the network to learn non-linear visual features across millions of images.

02 Binary classification outputs: Sigmoid activation producing calibrated probabilities for spam detection or medical diagnosis.

03 Multi-class prediction: Softmax activation in the final layer of a 1000-class image recognition model.

04 Transformer architectures: GELU (Gaussian Error Linear Unit) activation used in BERT, GPT, and other modern LLMs.

05 Recurrent networks: Tanh activation historically used in LSTM gates to bound neuron outputs and aid gradient flow.

Frequently Asked Questions

Why do neural networks need activation functions?

Without activation functions, a neural network is just a series of linear transformations — which can only model straight-line relationships, no matter how many layers it has. Activation functions introduce non-linearity, allowing the network to learn complex patterns like curves, edges, and abstract concepts. They are what give neural networks their expressive power.

Which activation function should I use?

ReLU (Rectified Linear Unit) is the default for most hidden layers — it's fast, effective, and avoids the vanishing gradient problem. Use Sigmoid for binary classification outputs, Softmax for multi-class outputs, and GELU for Transformer-based models (BERT, GPT). Tanh is sometimes used in RNNs. Leaky ReLU or Swish can help if standard ReLU leads to 'dead neurons.'

What is the vanishing gradient problem?

Sigmoid and Tanh activations saturate for large inputs — their gradients approach zero. During backpropagation through many layers, these tiny gradients multiply together and shrink to near zero, making early layers unable to learn. This is the vanishing gradient problem. ReLU largely solves it because its gradient is either 0 or 1, not a tiny fraction.

Activation Function

In Depth

Real-World Applications

Related Concepts

Frequently Asked Questions