SSD and YOLO
Last updated
Last updated
To balance accuracy and speed, the recent work in CNN has been focused on one-stage object detection, which aims to deliver real-time inference. The two most popular architectures are Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO). Those models reach detection with one shot or one pass of the network, thus considerably reduce the computational time for inference and thus are capable of delivering real-time classification and detection.
For illustration purposes, the architecture of SSD based on the VGG-16 base network is shown in Figure 2-48. The SSD network combines predictions from multiple feature maps of different resolutions to handle objects of various sizes. It eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network, providing a unified framework for both training and inference [60].
Figure 2-49 Single shot multibox detector. (Liu, et al., 2016)
YOLO was first introduced by Redmon, et al.[61], followed by YOLOv2 and YOLO9000 [62]. YOLO9000 can detect over 9000 object categories in real time. More recently, an incremental improvement was made, resulting in YOLOv3, which use Darknet-53, consisting of 53 convolutional layers with some skip connections, as feature extractor [63].