Convolutional Neural Networks (CNNs) - 基於深度學習語義分割之城市道路汽車轉向操控

Chapter 2 Preliminaries

2.4 Convolutional Neural Networks (CNNs)

Convolutional Neural Network (CNN) is a type of feed-forward artificial neural network that its connectivity pattern between neurons is inspired by the organization of the animal visual cortex and biological processes. The response of an individual neuron in biological processes can be approximated mathematically by a convolution operation.

Different from typical artificial neural networks, CNNs exploit the spatially local correlation presented in natural images and have three distinguished features: 3D volumes of neurons, local connectivity, and shared weights. A CNN arranges neurons in a layer in three dimensions (Figure 2-4): width, height, and depth. Here depth means the third dimension of a feature map volume, while the depth of a Neural Network refers to the total number of layers in a network. The neurons in a layer are only connected to a small region of the layer precedes to it. Besides, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. Also, each convolutional kernel is applied across the same image. That is, the same convolutional kernel is applied at many locations in the image and the same weight is shared.

A typical CNN contains convolutional, pooling, and fully connected layers.

Different types of layers are connected locally and stacked together to form a CNN architecture. In CNNs, a convolutional layer convolves with input feature maps and generate output feature maps. A filter in a convolutional layer has size W × H × D, where W is the width of the filter, H is the height of the filter, and D is the number of input feature maps. For each input feature map, each W × H values of the W × H × D weights are used as a convolutional kernel for one particular input feature map. If we have N filters, each filter will produce an output feature map by convolving with the input feature maps. For example, in Figure 2-5, a W × H × 3 filter convolves with three input feature maps of size 32 × 32 and produces an output feature map of size 30 × 30.

Figure 2-5: A filter is used to produce a new feature map.

Instead of predefining by humans as common convolution kernels, CNNs learn the values of convolution kernels during the training process. To reduce the computational complexity of 2D convolution, a convolutional layer only connects each neuron in a feature map with a small region of the input. In Figure 2-6, multiple neurons at the same

location in the output feature maps are connected to the same region in the input feature map. This spatial connectivity is controlled by a filter size, for instance, the filter of size W × H × D defines the spatial connectivity between the input and output feature maps.

During the forward pass, each convolutional kernel convolves with any possible position in an input feature map and a new feature map will be generated.

Figure 2-6: Local connectivity of a convolutional layer.

Pooling layers perform similar operations as convolutional layers, but the operations of pooling layers find the maximum (max pooling) or the average value (average pooling) within the region defined by a 2D window (Figure 2-7 assumes a 2 × 2 window with stride 2 for pooling and each color represents the pooling position and the result.) It is common to periodically insert a pooling layer in-between convolutional layers in CNNs. The purpose of a pooling layer is decreasing the spatial size of a feature map and the number of parameters. In addition, computation in the network will also be reduced. Also, pooling layers are not involved in the training process of CNNs and have no parameters associated with them since the operations of them are fixed and

Figure 2-7: Illustration of the max pooling operation.

Finally, a fully connected layer is a layer that has full connections to the previous layer, i.e., each neuron in a fully connected layer has connections to all neurons in the previous layer. Figure 2-8 illustrates the idea of full connections. In Figure 2-8, the green one on the right is a fully connected layer that each neuron (the white circle) connects to all neurons in the previous layer (the blue one) and connections to these neurons are represented by lines with different colors. As a result, a fully connected layer has dense connections between neurons and a significant amount of storage and computation is required.

Figure 2-8: Illustration of a fully connected layer.

In addition to aforementioned layers, non-linear activations and normalization layers can also be found in a CNN. The non-linear activation is used to apply a non-linear transformation to the output of a convolutional or a fully connected layer.

Common non-linear functions used in CNNs are rectified linear unit (ReLU), sigmoid, and hyperbolic tangent (tanh). Figure 2-9 shows function plots for these non-linear functions. ReLU is defined as follows:

The sigmoid function is defined as

Figure 2-9: Plots for various non-linear activation functions.

(Left: ReLU, center: sigmoid, right: tanh)

A normalization layer is usually used to improve training efficiency and model accuracy by controlling the input distribution across layers. Batch normalization [28] is a common practice used in normalization layers, which performs the normalization for each training mini-batch. Usually, the distribution of the layer input is normalized to a zero mean and a unit standard deviation. In batch normalization, the normalized input value is scaled and shifted.

A common pattern of CNN architecture is stacking together a convolutional layer, a ReLU activation layer, and a pooling layer. The pattern repeats until the image has been downsampled to a small size. Finally, fully connected layers are appended to the end of CNN and the last fully-connected layer holds the output, such as the class scores.

Many deep CNNs have been proposed over the past two decades, and each proposed CNN has a different network architecture concerning the arrangement of layers, for example, type of layers, the order of layers, and the number of layers to be used. In addition, the configuration of filters separates one CNN from the other. The configuration of filters can be the width and height of the filter or the depth of the filter.

Figure 2-10: Architecture of VGG-16.

VGG-16 proposed by Simonyan et al. [29] is a widely adapted CNN architecture (Figure 2-10). It consists of 16 layers: 13 convolutional layers and 3 fully connected layers. The core idea of VGG-16 is using smaller filters (3 × 3) that have fewer weights

to achieve the same effect as larger 5 × 5 filters. Therefore, all convolutional layers in VGG-16 have the 3 × 3 filter size. In addition, the three fully connected layers at the end of the network have 4096, 4096, and 1000 neurons, respectively.

在文檔中基於深度學習語義分割之城市道路汽車轉向操控 (頁 23-29)