RetinaNet
Last updated
Last updated
Lin, et al. [64] discovered that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause for the inferior performance of single stage object detection models (such as YOLO and SSD) as compared to a two-stage approach (e.g., R-CNN). To address this class imbalance, they proposed a new loss function (Focal Loss) that reshape the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. For demonstration, a simple dense detector, RetinaNet, was designed and trained with the Focal Loss. The experiments showed that RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. The architecture of RetinaNet is shown in Figure 2-49.
Figure 2-50 The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture [64].
RetinaNet is composed of an FPN backbone network and two subnetworks. The FPN sits on top of a ResNet to generate a rich multi-scale convolutional feature pyramid. Two subnetworks are attached to this backbone, one for classification and one for localization.