Typical Layer Structure
Last updated
Last updated
Four major types of layers are commonly used to construct CNN. They are Convolution Layer, Activation Layer (e.g., ReLU layer), Pooling Layer, and Fully Connected Layer. Convolution layers constitute the main tissue of CNN while the other layers complement the convolutional layers to create a functional CNN for a specific application. We will discuss those layers in detail.
To be able to learn complex high-level features, which are naturally nonlinear, CNN structure should be designed to account for such nonlinearity. By resembling the mechanism of brain neurons, the nonlinearity can be achieved by an activation function. In modern CNN practice, the commonly used activation function is called rectified linear unit (ReLU), which can be written as:
fx=max(0,x) (2)
ReLU is linear for positive values and nonlinear for negative values as they are always suppressed to zero. It can be referred to as a piecewise linear function. Albeit simple, ReLU offers many advantages over its traditional counterpart (e.g., sigmoid or tanh functions), such as better gradient propagation, efficient computation, and scale invariant. Note the gradients of ReLU are very simple: zero for negative x and 1.0 for positive x, as shown in Figure 2-30. Some variants of ReLU (e.g., Parametric or Leaky ReLU, Exponential Linear Unit, etc.) have been proposed and may work better in certain situations. For this primer, we will focus on ReLU. Interested readers are referred to He, et al.[41] for Leaky ReLU and Clevert, et al. [42] for Exponential Linear Unit.
Figure 2-31 ReLU activation function
For example, for the filtered image in Figure 2-27, after applying ReLU, it becomes activation map or feature map, shown in Figure 2-31
Figure 2-32 Activation map
In CNN literature, ReLU is often treated as a layer, which will not change the shape of the input volume. It simply performs element-wise ReLU computation to the input volume. In practice, every convolution layer is typically followed by a ReLU layer, resulting in a stack of activation maps or feature maps (as shown in Figure 2-29). The number of feature maps is indicated by the depth of the 3D volume in Figure 2-29. An example of how those layers are organized in a CNN is shown in Figure 2-32.
Figure 2-33 Classification example using CNN (Source: (cs231n, n.d.); a live demo can be found at [43]
As noted in Figure 2-32, a pooling layer was inserted every two successive pairs of Conv/ReLU layers. What a pooling layer does is basically down-sampling, which reduce the dimension of the previous layer and allows the following layer to capture feature structure at a larger scale. Pooling layers also reduce sensitivity to small movements (e.g., shifting) of feature positions in the original image. This is due to the fact that the feature maps from convolutional layers keep track of the “precise” positions of features, which are sensitive to the location of the features in the original image. While the pooling layers retain relative (instead of precise) locations of the features (or the coarse structure of feature elements), which result in a lower resolution version of the feature maps that are more robust to changes in the position of the features in the original image. This is often referred to as local translation invariance.
It should be noted that down-sampling can also be achieved by larger strides of convolutional layers. However, use of a pooling layer is more common and robust for this purpose. Two typical pooling functions are max pooling and average pooling. Figure 2-33 shows an example of 2x2 average pooling, which outputs the average value of the pooled area. In practice, max pooling has been found to work better than average pooling for computer vision tasks.
Figure 2-34 Illustration of 2x2 averaging pooling
In summary, a stack of convolutional layers can extract features at different levels. The ReLU layers add “nonlinearity” to permit learning of complex features. The pooling layers can provide better local translation invariance and capture feature structures at larger scales. Figure 2-34 illustrates the working concept of CNN. Note that ReLU layers are not shown to reduce clutter. The use of the “peeping a leopard through a pipe” example in Figure 2-34 is to echo a famous Chinese metaphor for a biased, narrow-minded person who only has a partial view of a big problem.
Figure 2-35 Do you see a spot or the leopard?
In modern practice, if the training data set is large, a batch normalization (BatchNorm) layer is often used after each ReLU layer to scale the activations. Batch normalization provides an elegant way of reparametrization, which significantly reduces the problem of coordinating updates across many layers of a deep neural network [31].
It has been shown that BatchNorm makes the optimization landscape significantly smoother. This ensures, in particular, that the gradients are more predictive and thus allow for use of larger range of learning rates and faster network convergence [44]. Additionally, it makes network training less sensitive to the weight initialization. To prevent overfitting problems, dropout was proposed as an effective regularizer [45]. Typically, dropout is conducted in a random fashion which is equivalent to sampling many “thinned” networks from the original network. The word “thinned” arises from the fact that the original network is reduced because some of its neurons are dropped in a random fashion during training. Srivastava, et al. [45] found that as a side effect of dropout, the activations of hidden units become sparse, even when no sparsity inducing regularizers are present. An example of dropout is illustrated in Figure 2-35.
Figure 2-36 Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped [45].