Deformable Convolution Networks
Last updated
Last updated
A key challenge in visual recognition is to accommodate geometric variations arising from object scale, pose, viewpoint, and part deformation. The CNNs discussed so far lacks internal mechanisms to handle the geometric transformations because they use fixed geometric structures, i.e., fixed sampling locations by convolution units, fixed-ratio pooling, fixed spatial bins for RoI pooling, etc. In addressing this issue, Dai, et al. [39] introduced two new modules to enhance CNNs’ capability of modeling geometric transformations.
The first is deformable convolution, which enables free form deformation of the sampling grid by adding 2D offsets to the regular grid sampling locations in the standard convolution Figure 2-50). The second is deformable Region of Interest (RoI) pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling (Figure 2-51). Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes. Extensive experiments showed that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation.