Car Steering Angle Prediction - 基於深度學習語義分割之城市道路汽車轉向操控

Chapter 3 Methodology

3.3 Car Steering Angle Prediction

After obtaining semantic segmentation from the previous stage, we are moving into the second stage. The result of the semantic segmentation is a 2D map that marks each pixel in the original input with a label, which is an integer value that represents a class. For example, a value of 2 represents the road class, and pixels in the input image are labeled with 2 if they are covered by roads. Also, since the input to the second stage is a 2D segmentation map, we utilize a CNN based approach to extract features from the map and predicting steering angles. In this stage, we use a CNN called the Control Network to predict steering angles from the segmentation result.

Figure 3-7: Overview of the second stage.

Figure 3-7 provides a high-level overview of the second stage. In our system, the Control Network serves as decision and control modules in an autonomous driving system because the system only solves the lane following problem and does not deal with other high-level decision problems, such as switching lanes or overtaking previous cars. The Control Network decides how to drive along the lane based on the input semantic segmentation and executes the decision by adjusting steering angles.

Since the segmentation result has removed many details from the input image, the design goal of the Control Network is compactness and efficiency. A compact CNN architecture is preferred since the segmentation map contains few features. Besides, we want to reduce the computation time and achieve promising results at the same time.

Therefore, the Control Network should have as fewer layers as possible and does not contain any complex structures.

The basic idea of the Control Network is to downsample the segmentation map and feed it to a fully connected layer for steering angle prediction. The Control Network architecture for mapping a semantic segmentation to a steering angle is shown in Figure 3-8.

Figure 3-8: Architecture of the Control Network.

The Control Network is composed of 4 blocks and each block has a convolutional layer with 16 filters of size 3 × 3 and a 2 × 2 max-pooling layer with stride 2. Following each convolutional layer, a max-pooling layer is appended to downsample the output feature maps of each block. Finally, two fully connected layers with 256 and 1 neuron, respectively, are attached to the end of the Control Network. The last fully connected layer will predict a steering angle value.

To meet our design goal, the 3 × 3 filter size is used in a convolutional layer. When using a small filter size such as 3 × 3 in a convolutional layer, finer and local information may be preserved during feature extraction. On the other hand, using larger filter sizes, such as 5 × 5 and 7 × 7 in a convolutional layer, more context will be considered, but finer and local information may be lost. Also, using smaller filter size can result in faster execution and smaller model size.

Another design choice of the Control Network is the number neurons in a fully connected layer. Basically, the number of neurons in a fully connected layer affects the execution time and model size. More neurons in a fully connected layer result in longer execution time and larger model size. In addition, the number of neurons decides the learning capacity of the network. When the number of neurons is large, the CNN has a better learning capability, but it also has more chances to overfit the training data.

Therefore, the final number of neurons of the second to last fully connected layer is 256, which is the optimal value found by experimenting different values.

In general, the parameters of the Control Network is driven by empirical results.

For example, we have conducted an experiment to change the number of convolutional

of a convolutional layer and a max-pooling layer bring the best outcome. Besides, from the experiment, using 16 filters of size 3 × 3 proves to have the best result. In addition, we have conducted experiments to replace each max-pooling layer with a convolutional layer of 16 3 × 3 and stride 2 filters. Both of these configurations can downsample a feature map by a factor of two, but using a max-pooling layer has a better result.

The output representation is an important design decision in the network architecture. In an autonomous driving system, the Control Module outputs a driving control, e.g., a steering angle. In general, the output driving controls should be continuous scalar values. Therefore, the output of the Control Network should represent a steering angle value that reflects the input semantic segmentation. Also, the output steering angles should be continuous and range from sharp left to sharp right. Therefore, we consider a single value representation for the output of the Control Network.

The single value representation uses a one-neuron fully connected layer to represent a steering angle value, which is shown in Figure 3-8. The goal of the Control Network is predicting a steering angle from a segmentation map which can be viewed as a single value regression problem. Therefore, in order to solve steering angle prediction, mean square error (MSE) is used as the loss function for training the Control Network.

MSE is defined as the following equation:

where ŷi is the prediction for i-th training sample, yi is the actual value for the i-th training sample, and N is the number of total training samples.

在文檔中基於深度學習語義分割之城市道路汽車轉向操控 (頁 40-44)