A computer vision task that identifies and locates multiple objects within an image or video, typically outputting both a class label and a bounding box for each detected instance.
In Depth
Object Detection goes beyond image classification — instead of assigning one label to an entire image, it answers: what objects are in this image, and exactly where are they? For each detected object, the model outputs a class label (car, person, dog) and a bounding box specifying the object's location and size. This spatial, instance-level understanding is what makes object detection essential for applications that need to reason about the physical layout of a scene.
Two dominant paradigms exist. Two-stage detectors (Faster R-CNN, Mask R-CNN) first propose candidate regions that might contain objects, then classify and refine each proposal — accurate but relatively slow. Single-stage detectors (YOLO, SSD, DETR) predict class labels and bounding boxes directly in a single pass over the image — faster and suitable for real-time applications. DETR (Detection Transformer) replaced convolutional stages with attention, demonstrating Transformer applicability to detection tasks.
Object detection accuracy is measured using metrics like mAP (mean Average Precision), which evaluates both classification accuracy and localization precision across multiple IoU (Intersection over Union) thresholds. Modern models like YOLOv8 and DINO achieve real-time performance on standard hardware while handling complex scenes with dozens of overlapping objects. The frontier challenge is open-vocabulary detection — identifying objects of any class based on text descriptions, even categories unseen during training.
Object detection transforms images from visual scenes into structured inventories — telling a machine not just what is present, but where each object is, enabling spatial awareness in AI systems.
Real-World Applications
Frequently Asked Questions
What is the difference between object detection and image classification?
Image classification assigns one label to an entire image ('this is a photo of a dog'). Object detection identifies multiple objects and their locations within the image ('there is a dog at coordinates [x1,y1,x2,y2] and a cat at [x3,y3,x4,y4]'). Detection outputs both class labels and bounding boxes for each instance, enabling spatial reasoning about scenes.
What is YOLO and why is it popular?
YOLO (You Only Look Once) is a single-stage detector that predicts bounding boxes and classes in one pass — making it fast enough for real-time applications. YOLO models (v5, v8, v11) balance speed and accuracy, running at 30-150+ FPS on modern GPUs. This makes YOLO the default choice for applications needing instant results: autonomous driving, security, robotics, and live video analytics.
How is object detection accuracy measured?
The standard metric is mAP (mean Average Precision), which measures both how well the model classifies objects and how precisely it localizes them. It uses IoU (Intersection over Union) to determine whether a predicted bounding box sufficiently overlaps with the ground truth. mAP@0.5 means a 50% overlap threshold; mAP@0.5:0.95 averages across stricter thresholds for a more comprehensive evaluation.