A computer vision task that identifies and locates multiple objects within an image or video, typically outputting both a class label and a bounding box for each detected instance.
In Depth
Object Detection goes beyond image classification — instead of assigning one label to an entire image, it answers: what objects are in this image, and exactly where are they? For each detected object, the model outputs a class label (car, person, dog) and a bounding box specifying the object's location and size. This spatial, instance-level understanding is what makes object detection essential for applications that need to reason about the physical layout of a scene.
Two dominant paradigms exist. Two-stage detectors (Faster R-CNN, Mask R-CNN) first propose candidate regions that might contain objects, then classify and refine each proposal — accurate but relatively slow. Single-stage detectors (YOLO, SSD, DETR) predict class labels and bounding boxes directly in a single pass over the image — faster and suitable for real-time applications. DETR (Detection Transformer) replaced convolutional stages with attention, demonstrating Transformer applicability to detection tasks.
Object detection accuracy is measured using metrics like mAP (mean Average Precision), which evaluates both classification accuracy and localization precision across multiple IoU (Intersection over Union) thresholds. Modern models like YOLOv8 and DINO achieve real-time performance on standard hardware while handling complex scenes with dozens of overlapping objects. The frontier challenge is open-vocabulary detection — identifying objects of any class based on text descriptions, even categories unseen during training.
Object detection transforms images from visual scenes into structured inventories — telling a machine not just what is present, but where each object is, enabling spatial awareness in AI systems.

