Introduction
Last updated
Last updated
Convolutional Neural Networks, typically referred to as CNN or ConvNet, are perhaps the greatest success story of biologically inspired artificial intelligence [31]. Their success has been well-known for tasks, such as image classification, video analysis, object detection and tracking, natural language processing, etc. The domain of their applications continues to expand. CNN are constructed with convolutional layers that mimic mammalian visual cortexes containing neurons that individually respond to small regions of the visual field, discovered by Hubel and Weisel [33], who later received the Nobel Prize for their work. In a convolutional layer, each neuron receives input from a local subarea of the previous layer, named receptive field. For example, for an image, the receptive field is typically of a square shape (e.g., 3x3) with depth (e.g., 3 channels for RGB color images). CNN also uses a parameter-sharing scheme, in which each filter (i.e., same set of weights) scans or slides over the entire input area, resulting in significantly fewer parameters (or weights) than traditional multilayer neural networks, in which each neuron receive input from every element of the previous layer. When the size of the receptive field grows to the same size of the previous layer, the convolutional layer is the same as the fully connected or dense layer in traditional multilayer neural networks. In this sense, CNN can be considered as a more generic form that contains traditional multilayer neural networks as a special case. This built-in versatility makes CNN extremely scalable with large inputs (e.g., image data). Figure 2-18 illustrates the concept of parameter sharing with a simple one-dimensional (1D) example.
In practice, CNN has been widely used to deal with images. However, its applications are not limited to image data. CNN can discover or learn patterns from any data or signals with certain structures in spatial and/or temporal dimensions. For example, Abdoli, et al. [34] classified environmental sound based on a 1D CNN (Figure 2-19), which learns a representation directly from the audio signal.
Each convolutional layer typically has a number of “filters” to extract or assemble specific features from the previous layer. In the context of image processing, a filter is a set of numbers, typically oriented in the form of one or multiple square matrices. Each filter looks for a particular type of features, such as vertical edges or horizontal edges (note: we will have an example on this shortly). The filter in CNN is analogous to an optical filter, which selectively transmits light of different wavelengths. For example, to be able to identify a unique object such as a Stop Sign, a hierarchy of filters would be needed to extract specific features at different levels, including color (red and white), edges (vertical, horizontal, and slanting lines), shape (octagon), and four letters (STOP). Zeiler, et al. [35] showed that CNN can capture image information in a variety of forms: the low-level edges, mid-level edge junctions, high-level object parts and complete objects (see Figure 2-20).
The extracted features from images are normally general and can be used for many different tasks, such as classification, as shown in Figure 2-21.
Now, let us dig a bit deeper to understand how the “magic” happens. When applying a filter or kernel (the term “filter” is commonly used in CNN) to extract information from an image, a convolution operation is carried out between the filter and the image. For brevity without losing generality, we will use a gray-scale image, which has one channel as compared to three-channel RGB images. The pixel value or gray scale is represented by one byte (i.e., 8 bit) that codes 28 = 256 integer values, with 0 for Black and 255 for White. Any number in between represents Gray at an intermediate level. A simple 9×9 grayscale image is shown in Figure 2-22. The image has two distinct features: one vertical edge and one horizontal edge. How can we extract those features separately by applying two different filters?
To accomplish this task, we could use the two filters (i.e., matrices) in Figure 2-23. Those filters are also referred to as Sobel kernels [36].
To apply the filters, convolution operation is performed, which is discussed in detail in the following section.