A neural network architecture specialized for grid-structured data (especially images) that uses learned filters to detect local features — edges, textures, shapes — in a hierarchical, spatially-aware manner.
In Depth
A Convolutional Neural Network is purpose-built for data with spatial structure — most famously images. Its core operation is the convolution: a small learned filter (kernel) slides across the input, computing a dot product at each position. This produces a feature map that highlights where that filter's pattern appears in the image. Early filters detect simple features like horizontal or vertical edges; deeper layers combine these into progressively complex patterns — textures, parts, objects.
CNNs have three key architectural advantages over plain neural networks for image tasks. First, weight sharing: the same filter is applied everywhere in the image, dramatically reducing the number of parameters compared to a fully connected network. Second, spatial invariance: a filter that detects an eye fires whether the eye appears in the top-left or bottom-right of the image. Third, local connectivity: neurons connect only to a small region of the input, reflecting the local structure of visual patterns.
The CNN revolution began with AlexNet's ImageNet victory in 2012, and architectures like VGG, ResNet, and EfficientNet have pushed the boundaries ever since. While Transformers (specifically Vision Transformers) are increasingly competitive for image tasks, CNNs remain the dominant choice for embedded systems, mobile applications, and scenarios where computational efficiency is critical. They also remain central to video analysis, medical imaging, satellite imagery, and any domain with spatial data.
CNNs process images the way human vision does — hierarchically, detecting local features first and combining them into complex objects — making them extraordinarily efficient at any task involving spatial or visual data.

