Convolution Operation

Convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other. It is defined as the integral of the product of the two functions after one is reversed and shifted and can be written in Eq. 1 [37].

Since we mostly deal with discrete inputs, such as pixels for images, the convolution operation is simply an element-wise multiplication or dot product, as shown in Figure 2-24.

As illustrated in Figure 2-24, the pixel value (O55) in the original image becomes the pixel value (P55) in the filtered image after convolution. To see the convolution operation in action, an animation is viewable at [38].

Given the 3x3 filters in Figure 2-23, to retain the feature map in the same size of the input image, padding is needed. Normally a zero padding is used with zero-value pixels being added at the edges of the image as shown in Figure 2-25.

The same filter “scans” the image by sliding one pixel at a time (referred to as stride 1) horizontally and vertically till it scans the whole image. One step of horizontal scanning for the top row of the image is illustrated in Figure 2-26.

The filtered image in Figure 2-27 shows that t$he vertical edge has been extracted.

Similarly, Figure 2-28 shows the result of applying the horizontal filter.

Note the purpose of the example above is to illustrate how convolution and feature extraction work in general. Some variations have been proposed, including transposed convolution or deconvolution, separable convolution, deformable convolution [39], dilated convolutions or atrous convolution [40].

In CNN, the filters or weights are parameters to be learned from data during the training stage. A convolution layer often consists of a stack of filters, resulting in a stack of feature maps. This can be visualized in Figure 2-29 as each convolutional layer transforms a 3D input volume to a 3D output volume of neuron activations.

As shown in Figure 2-29, the input layer is an image, for which the width and height would be the dimensions of the image, and the depth would be 3 (i.e., Red, Green, Blue channels). The sequential convolution operations would commonly reduce the width and height of subsequent 3D volume. The depth of the subsequent 3D volume corresponds to the number of filters used, which is a hyperparameter of CNN. The number of filters determines the number of feature maps to generate from the 3D input volume in the previous layer. It should be pointed out that the connectivity for each filter along the depth axis is always equal to the depth of the input volume. In other words, each filter itself is a 3D shape with the depth equal to the depth of the input volume (see the square-based prism in dash lines in Figure 2-29). For classification problems, the last layer is typically a fully connected layer resulting in a 1x1xN shape volume (N is the number of classes).

PreviousIntroduction NextTypical Layer Structure

Last updated 1 year ago