A neural network architecture that processes entire sequences in parallel using self-attention mechanisms — eliminating recurrence and enabling the large-scale training that underlies modern LLMs and many state-of-the-art AI systems.
In Depth
The Transformer architecture, introduced in the landmark 2017 paper 'Attention Is All You Need' by Vaswani et al. at Google, is arguably the most consequential deep learning innovation of the past decade. It replaced the sequential, step-by-step processing of RNNs with a fundamentally different approach: processing all elements of a sequence simultaneously, using self-attention to determine which elements are relevant to each other.
The core mechanism is multi-head self-attention. For each element in a sequence (each token in a sentence), the attention mechanism computes a weighted sum of all other elements' representations, where the weights reflect relevance. A word like 'it' can attend strongly to 'the cat' several sentences earlier if needed. This is computed in parallel for all tokens simultaneously — unlike RNNs, which process one token at a time — enabling efficient training on modern GPU hardware at scales that were previously impractical.
The Transformer's ability to scale with data and compute is its defining property. As datasets grow larger and compute budgets increase, Transformer-based models consistently improve — a 'scaling law' that has driven the development of GPT-2, GPT-3, GPT-4, Gemini, Claude, and every major frontier model. Transformers are now applied far beyond language: Vision Transformers (ViT) for images, Audio Spectrogram Transformers for sound, and multi-modal models that jointly process text, images, and video.
The Transformer's self-attention mechanism made it possible to train language and multimodal models of unprecedented scale — shifting AI from the era of narrow specialists to broad, powerful generalists.

