Advanced Recurrent Neural Network variants that use gating mechanisms to selectively retain or forget information — overcoming the vanishing gradient problem and enabling learning of long-term dependencies in sequences.
In Depth
LSTM (Long Short-Term Memory), introduced by Hochreiter & Schmidhuber in 1997, was a breakthrough solution to the vanishing gradient problem that plagued standard RNNs. An LSTM cell contains three gates — input, forget, and output — that regulate what information flows into, persists in, and is read from the cell state. The forget gate decides what past information to discard; the input gate determines what new information to store; the output gate controls what to expose to the next layer. This selective memory allows LSTMs to maintain relevant context across hundreds of time steps.
GRU (Gated Recurrent Unit), introduced by Cho et al. in 2014, is a simplified variant with two gates — reset and update — that achieves similar performance to LSTMs with fewer parameters and faster training. The update gate combines the LSTM's forget and input gates; the reset gate controls how much past information to combine with the current input. GRUs are often preferred when computational efficiency is critical or when datasets are smaller.
Despite their power for sequential modeling, both LSTMs and GRUs have been largely surpassed by Transformers for NLP tasks, because Transformers process entire sequences in parallel and scale far more effectively with data and compute. However, LSTMs and GRUs retain relevance for real-time applications requiring sequential inference with low latency, edge device deployment where transformer scale is impractical, and time series modeling where their inductive bias toward sequential order is an advantage.
LSTMs and GRUs solved the vanishing gradient problem with learnable gates — giving RNNs a surgical, trainable memory that can preserve important context across long sequences.

