A computer vision task that classifies every pixel of an image into a semantic category, producing a dense map that labels each pixel as belonging to a road, building, sky, person, or any other class.
In Depth
Semantic Segmentation is the most granular form of scene understanding in computer vision. Rather than placing a bounding box around objects (detection) or labeling an entire image (classification), semantic segmentation assigns a class label to every single pixel. The output is a segmentation mask — an image-sized map where each pixel carries the identity of the object or region it belongs to: road, sky, car, pedestrian, vegetation.
The dominant architectural pattern for semantic segmentation is the encoder-decoder: a CNN or Transformer encoder progressively compresses the image into a rich, abstract feature representation, then a decoder progressively upsamples it back to full resolution, recovering spatial detail. U-Net (2015), originally developed for biomedical image segmentation, introduced skip connections between encoder and decoder stages that preserve fine spatial detail — a design now widely adopted. DeepLab and its variants use atrous (dilated) convolutions to maintain resolution without sacrificing receptive field size.
Instance Segmentation extends semantic segmentation by distinguishing individual instances of the same class — not just 'person' but 'person 1', 'person 2', 'person 3'. Mask R-CNN is the standard architecture for this task, adding a pixel-level mask prediction branch to the Faster R-CNN detection pipeline. Panoptic Segmentation combines both, labeling countable objects (instances) and uncountable stuff (sky, road) in a unified output — the most complete form of scene understanding available in computer vision today.
Semantic Segmentation gives AI a pixel-perfect understanding of scene structure — not just where objects are, but exactly which pixels belong to them — enabling applications that require precise spatial reasoning.

