Encoder-Decoder Architecture

Definition

A neural network design pattern consisting of two components: an encoder that compresses input into a compact internal representation, and a decoder that expands that representation into the desired output format.

In Depth

The Encoder-Decoder architecture is a fundamental design pattern in deep learning for tasks where the input and output have different structures, lengths, or modalities. The encoder processes the entire input (a sentence, an image, a document) and compresses it into a fixed or variable-length internal representation — often called a latent vector, context vector, or hidden state. The decoder then takes this representation and generates the output step by step or all at once, depending on the architecture.

This pattern was originally popularized for machine translation using RNNs: the encoder reads a source sentence word by word and produces a context vector, then the decoder generates the target translation word by word from that vector. The addition of the Attention Mechanism was a breakthrough that allowed the decoder to selectively focus on relevant parts of the encoder's output at each generation step, dramatically improving translation quality. The Transformer architecture generalized this with self-attention in both encoder and decoder, leading to models like T5, BART, and the original GPT approach.

The encoder-decoder pattern extends far beyond text. In computer vision, U-Net uses an encoder-decoder structure for image segmentation — the encoder extracts features at decreasing resolutions and the decoder reconstructs a pixel-level segmentation map. Autoencoders use this pattern for unsupervised learning, compressing data to a low-dimensional bottleneck and reconstructing it. Variational Autoencoders (VAEs) and Diffusion Models also follow encoder-decoder principles. Understanding this architecture provides a unified lens for many seemingly different AI systems.

Key Takeaway

The Encoder-Decoder architecture compresses input into a compact representation and then decodes it into output — a versatile pattern that powers translation, summarization, image segmentation, and generative AI.

Real-World Applications

01 Machine translation: encoding a sentence in one language and decoding it into another, as in the original Seq2Seq and modern Transformer translation models.

02 Text summarization: encoding a long document and decoding a concise summary that captures the essential information.

03 Image segmentation: U-Net's encoder-decoder architecture produces pixel-level labels for medical imaging, satellite imagery, and autonomous driving.

04 Speech-to-text: encoding audio spectrograms and decoding them into text transcriptions using models like Whisper.

05 Image captioning: encoding an image through a vision model and decoding a natural language description of its contents.

In Depth

Real-World Applications

Related Concepts