Training of CNN
Last updated
Last updated
This section describes how learning occurs. The weights that connecting neurons in different layers are parameters of a network, which are learned through the training process. Training a CNN is similar to training a traditional multilayer neural network. Backpropagation algorithm with gradient descent is typically used. At start of training, all weights and biases are initialized by certain random schemes. A series of example pairs (with the input and the known output, i.e., the label) are presented to the network. The network propagates each input through all the layers (e.g., convolutional layers, ReLU layers, pooling layers, etc.) to reach the output, which is compared with the correct answer (i.e., label). The disagreement between the network output and the label is computed based on a loss function (e.g., Root-Mean-Square, Cross-entropy, etc.). For classification, cross-entropy loss is commonly used to measure the performance of a classification model whose output is a probability. Cross entropy measures the distance between model-predicted distribution and the true distribution. As an example, let us look at a simple binary case. Suppose we have a large image data set and each image in the data set contains either a cat or a dog. We want to compare the predicted distribution with the label to see the difference. The difference here is referred to as loss in the form of cross-entropy. For this dog and cat example, we will compute cross-entropy (CE) as follows:
CEP,Q=P(dog)-log Qdog +P(cat)-log Qcat (3)
where, Q is the probability distribution predicted by the model and P is the actual probability distribution according to the label.
Note Eq. 3 is purposely color-coded, where Red indicates the model-predicted probability distribution and Green indicates the true distribution by the label. For example, if predicted distribution for a particular image is:
Table 3-2-2 Example calculations of cross-entropy
As shown in Table 3-2-2, cross-entropy (loss) reduces as prediction distribution gets closer to true distribution (the label) and increases as prediction distribution is far away from the true distribution. The goal of training is to minimize this cross-entropy loss by adjusting weights and biases.
To generalize to multi-class (n classes) models, the cross-entropy can be written as:
A very large dataset is normally needed to train a CNN. For example, ImageNet [47] contains over 14 million images. Computing gradients based on the entire training set (referred to as batch optimization) is often impractical with limited memory and computing power. Another issue with batch optimization approach is that it is hard to incorporate new data in an ‘online’ setting. The simple strategy to solve those issues is to compute the gradient and update the parameters using only a single or a few training examples (i.e., a mini batch). For example, stochastic gradient descent (SGD) is a commonly used algorithm, which takes a mini batch of the training data set at a time. The “stochastic” part of the name comes from the notion that a mini batch is in fact a random sample of the training data set. Once a mini batch is processed, the total loss and gradient are computed, and the weights are updated layer by layer in a backward fashion (backpropagation). This continues until all data in the training set is processed, which is called an epoch. The algorithm stops when the loss reduces below a predetermined threshold or a maximum number of epochs is reached. The training and validation accuracy and loss curves can be plotted and compared to check for problems of overfitting or underfitting.
Different from traditional multilayer networks, which are shallow in terms of a small number of layers, CNN is often composed of many layers (i.e., deep structure), which make the training quite challenging. Figure 2-36 shows the architecture of VGG16 network [48], which predict 1000 categories in the ImageNet [47].
Figure 2-37 VGG16 network architecture [48, 49]
This deep structure requires a long path to back-propagate gradients from output layer all the way to the input layer. Each step backward, the gradient of the current layer is multiplied to the gradient of the previous layer according to the chain rule. If the gradients are very small (i.e., close to zero), the multiplication will lead to vanishing gradients, a common problem for training CNN with traditional sigmoid or tanh activation functions. As shown in Figure 2-37, when the input value gets large (either positive or negative, as indicated by the regions in red) the gradient approaches zero (red curve).
Figure 2-38 Activation Functions –Sigmoid and Tanh
Since the gradient determines the amount of adjustment to be made to the weights and biases, when gradients are approaching zero, minor or no adjustments will be made, thus learning stops. This is one of the main reasons that ReLU activation becomes so popular for training deeper modern neural networks, such as CNN. Krizhevsky, et al. [50] demonstrated that deep CNN with ReLUs train several times faster than their equivalents with tanh units.
Other common techniques for mitigating the vanishing gradient problem include batch normalization layers [46] and use of residual blocks [51]. Batch normalization normalizes the input to each layer so that most of it concentrate in the green region in Figure 2-37. Residual blocks reduce the number of activations by bypassing a bock of layers. Both techniques can lead to a much smoother loss landscape, which enables a broader range of learning rates and makes the training significantly faster.