Viqus Logo Viqus Logo
Home
Categories
Language Models Generative Imagery Hardware & Chips Business & Funding Ethics & Society Science & Robotics
Resources
AI Glossary Academy CLI Tool Labs
About Contact
Back to Glossary
Deep Learning Advanced Also: Self-Attention, Scaled Dot-Product Attention

Attention Mechanism

Definition

A technique that allows neural networks to dynamically focus on the most relevant parts of an input when producing each element of the output — enabling models to capture long-range dependencies without recurrence.

In Depth

Attention was first introduced as a way to improve sequence-to-sequence models for machine translation. Instead of compressing an entire input sentence into a single fixed-length vector — a bottleneck that caused performance to degrade on long sentences — attention allowed the decoder to look back at all encoder states and weight their contributions at each decoding step. This simple addition dramatically improved translation quality.

Self-attention — the version used in Transformers — extends this idea to within a single sequence. Each element (token) in the sequence attends to all other elements, computing relevance scores via dot products of learned Query, Key, and Value vectors. High scores mean 'pay attention to this element'; low scores mean 'ignore it'. The output for each token is a weighted average of all other tokens' representations, weighted by these scores. This allows a word to gather context from anywhere in the sequence, regardless of distance.

Multi-head attention runs this process in parallel with multiple sets of learned Q/K/V matrices (heads), each attending to different types of relationships — syntactic, semantic, coreference, and more. The outputs of all heads are concatenated and projected, giving the model a rich, multi-perspective representation of each token. Combined with positional encoding (which tells the model the position of each token since there's no inherent order in the parallel computation), attention underpins all modern Transformer-based systems.

Key Takeaway

Attention is how AI models focus — dynamically weighting which parts of the input are most relevant to each step of the output, enabling Transformers to capture context over arbitrarily long sequences.

Real-World Applications

01 Machine translation: attention allowing the decoder to focus on different source words when generating each target word.
02 Document summarization: attention identifying the most salient sentences and phrases across a long document.
03 Question answering: attention focusing on the relevant passage segment when answering a question from a long context.
04 Image captioning with Vision Transformers: attention determining which image regions are most relevant to each generated word.
05 Protein structure prediction: attention in AlphaFold 2 modeling the interactions between pairs of amino acid residues.

Frequently Asked Questions

What is the difference between attention and self-attention?

Attention (or cross-attention) relates elements from two different sequences — e.g., a decoder attending to encoder outputs in translation ('what French words are relevant to generating this English word?'). Self-attention relates elements within the same sequence — each word attends to all other words in the same sentence to build a context-aware representation.

What are Query, Key, and Value in attention?

These are three projections of the input. The Query represents 'what am I looking for?', the Key represents 'what do I contain?', and the Value represents 'what information do I provide?'. Attention scores are computed as the similarity between Query and Key vectors, then used to weight the corresponding Value vectors. The result is a context-enriched representation for each position.

Why use multi-head attention?

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention computations in parallel, each with different learned projections. One head might capture syntactic relationships (subject-verb), another semantic similarity, another positional patterns. The outputs are concatenated and combined, giving the model a richer, multi-perspective understanding of the input.