ViqusViqus
Navigate
Company
Blog
About Us
Contact
System Status
Enter Viqus Hub
Back to Glossary
Deep Learning Advanced Also: Self-Attention, Scaled Dot-Product Attention

Attention Mechanism

Definition

A technique that allows neural networks to dynamically focus on the most relevant parts of an input when producing each element of the output — enabling models to capture long-range dependencies without recurrence.

In Depth

Attention was first introduced as a way to improve sequence-to-sequence models for machine translation. Instead of compressing an entire input sentence into a single fixed-length vector — a bottleneck that caused performance to degrade on long sentences — attention allowed the decoder to look back at all encoder states and weight their contributions at each decoding step. This simple addition dramatically improved translation quality.

Self-attention — the version used in Transformers — extends this idea to within a single sequence. Each element (token) in the sequence attends to all other elements, computing relevance scores via dot products of learned Query, Key, and Value vectors. High scores mean 'pay attention to this element'; low scores mean 'ignore it'. The output for each token is a weighted average of all other tokens' representations, weighted by these scores. This allows a word to gather context from anywhere in the sequence, regardless of distance.

Multi-head attention runs this process in parallel with multiple sets of learned Q/K/V matrices (heads), each attending to different types of relationships — syntactic, semantic, coreference, and more. The outputs of all heads are concatenated and projected, giving the model a rich, multi-perspective representation of each token. Combined with positional encoding (which tells the model the position of each token since there's no inherent order in the parallel computation), attention underpins all modern Transformer-based systems.

Key Takeaway

Attention is how AI models focus — dynamically weighting which parts of the input are most relevant to each step of the output, enabling Transformers to capture context over arbitrarily long sequences.

Real-World Applications

01 Machine translation: attention allowing the decoder to focus on different source words when generating each target word.
02 Document summarization: attention identifying the most salient sentences and phrases across a long document.
03 Question answering: attention focusing on the relevant passage segment when answering a question from a long context.
04 Image captioning with Vision Transformers: attention determining which image regions are most relevant to each generated word.
05 Protein structure prediction: attention in AlphaFold 2 modeling the interactions between pairs of amino acid residues.