Optimization of Our CRCNN Framework - Experimental Results and Discussions

3.4 Experimental Results and Discussions

3.4.2 Optimization of Our CRCNN Framework

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Implementation Platform

CAFFE [125] is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. It is now a very popular deep learning platform, and we chose to implement our CRCNN framework based on it to give high extendibility for future practitioners to integrate their own implementations with our CRCNN framework.

Early and Late Fusion Schemes

We perform our mathematical comparative method with two different schemes: the early fusion and the late fusion [132]. The framework described in this study adopts the late fusion scheme, i.e., we extract features from the input image and each baseline sample separately and then fully connect all information into a final layer of the deep architecture. Alternately, the early fusion scheme first combines the input image with the baseline samples and then extracts information from both types of images simultaneously. Both fusion schemes are optimized, tested, and compared to the state-of-the-art results.

3.4.2 Optimization of Our CRCNN Framework

In this section, we present the optimization of our deep architecture to provide insights into the sensitivity of the parameters associated with our CRCNN framework. First, the performance of the comparative stage with different settings of the deep architecture parameters (e.g., fusion strategy, baseline, region detection, etc.) is shown in Figure 3.3. Each sub-figure

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 3.3 Optimization of our CRCNN approach: Performance for the different settings of the deep architecture’s parameters.

represents the performance of a parameter when in different values (or choices). Empirically optimal values of our CRCNN parameters obtained from the experiments are summarized in Table 3.1. Second, the sensitivity between the parameters is presented in Figure 3.4. Each sub-figure represents the correlation coefficient of a parameter and the others based on obtained

‧

Table 3.1 Optimized setting of our CRCNN method.

Deep architecture’s parameters Optimized value

Fusion Early

Number of baseline samples 5

Region detection Yes

Number of convolutional layers 3 Number of local-connected layers 0 Number of fully-connected layers 1

Batch size 32

performances (of the comparative stage). The lower the correlation coefficient is (close to 0), the more independent the two parameters are; the higher the correlation coefficient is (close to 1), the more dependency exists between them in terms of the performance of the comparative stage. For example, in Figure 3.4(g), the correlation coefficient of batch size (BS) and dropout (D) is less than 0.5 (weakly related), and the correlation coefficient of BS and itself is naturally 1 (perfectly related). Note that the raw image pixels are captured as the extracted features.

CRCNN Parameters

Fusion strategy (F): Early and late fusions are different in terms of weight sharing. In early fusion, both types of images (input and baseline) share the same set of weights, and in late fusion, each image has its own weight. As shown in Figure 3.3(a), the first value (88.3%) represents the accuracy when early fusion is applied to our CRCNN framework, and

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Fig. 3.4 Optimization of our CRCNN approach: Sensitivity of the deep architecture’s parameters.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the second value (83.9%) represents the accuracy when late fusion is applied. In other words, Figure 3.3(a) shows better accuracy when early fusion is applied. This observation intuitively corresponds to the fact that learning shared weights improves the inner relation between the input image and baseline. We observe in Figure 3.4(a) that the optimization of each fusion strategy depends on the entire deep architecture (i.e., convolutional layers, locally connected layers, and fully connected layers) and the value of the dropout.

Baseline (B): Each baseline sample is used as a reference to represent a range of possible ages (e.g., labels). In our optimization, we consider M baseline samples per label with M = 1 and 5. As expected and shown in Figure 3.3(b), a more robust computation is provided when M> 1 baseline samples represent each label. Correlations exist between this parameter and the region detection, and with several deep learning parameters, such as the momentum and the weight penalty (Figure 3.4(b)).

Region detection (R): We optimized our method with and without region detection. In other words, this optimization is equivalent to optimizing our CRCNN method by combining R-CNN [122] or classical CNN [111]. Figure 3.3(c) shows the results of this optimization, and it is clear that region detection Ψ^R can extract more robust features for improving performance. The performance of applying this detection depends on the setting of its input (e.g., baseline) and output (e.g., convolutional layers), as observed in Figure 3.4(c).

Convolutional layers (CL): We optimized the convolutional layers Ψ^C relating to the influence of the number of layers. Several numbers of layers have been experimented, and

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the results are shown in Figure 3.3(d). We observe that three convolutional layers provide the best results, and the number of layers is logically correlated with its previous and following layers (the region detector and the locally connected layer Ψ^C), and also with the value of dropout and the early/late fusion choice (Figure 3.4(d)).

Locally connected layers (LL): We optimized the locally connected layers Ψ^L. Figure 3.3(e) shows the results for a different numbers of layers. The most accurate result is provided when the convolutional layer Ψ^C is directly connected to the fully connected layer Ψ^F. Its influence on other parameters is the same as that of the convolutional layers (Figure 3.4(e)).

Fully connected layers (FL): The optimization of the fully connected layers Ψ^F is shown in Figure 3.3(f). We observe that only one fully connected layer is sufficient to yield the best results. Further, the optimization of the number of fully connected layers can be set independently (Figure 3.4(f)).

Batch size (BS): “Batch” learning accumulates contributions for all data points, and then updates the parameters. We use the “mini-batches” learning [133], where the parameters are updated after every n data points (i.e., this approach divides the dataset into piles and learns each pile separately). The computation time for learning the deep architecture depends on the number of epoches and the size of the batches. Figure 3.3(g) shows two different batch sizes. Empirically, we take BS = 32, and the batch size can be optimized independently (Figure 3.4(g)).

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Activation function (AF): The type of nonlinear activation function is typically chosen to be the logistic sigmoid (sigm) function or the rectified linear unit reLU function. We observe in Figure 3.3(h) that reLU has better accuracy than sigm. In general, reLU trains faster and outperforms the other activation functions. This parameter can also be set independently (Figure 3.4(h)).

Dropout (D): The dropout process is one wherein each hidden unit is randomly omitted from the deep architecture with a probability such that the hidden unit cannot rely on other hidden units being presented, based on the observation that this parameter is correlated with the deep architecture (Figure 3.4(i)). Previously, we observed the dependency between the influence of this parameter and the early/late fusion choice. Therefore, each fusion strategy leads to its own setting: D = 0.5 for early fusion (Figure 3.3(i)) and D = 0 for late fusion.

Learning rate (LR) and momentum (M): We continue the analysis with the learning rate and momentum. Each iteration updates the weight by the computed gradient. The learning rate indicates the convergence speed, and the momentum parameter introduces a dampingeffect on the search procedure, thereby avoiding oscillations in irregular areas of the error surface by averaging gradient components with opposite signs and accelerating the convergence in long flat areas. In our experiments, we observed that in Figures 3.3(j) and 3.3(k), the unit step and M both near to 1 converges better. Thus, we consider the LR = 1and M = 0.9, which are set dependently (Figures 3.4(j) and 3.4(k)). That is, the use of Min the age estimation task can help avoid the search procedure from being stopped at a

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

local minimum, and it helps improve the convergence of the back propagation algorithm in general.

Weight penalty (WP): The last parameter is a constraint on the updating weight, and we observe in Figures 3.3(l) and 3.4(l) that the penalty can be set as penalty = 1e-2;

further, this parameter will influence the setting of several parameters, such as the momentum, baseline, and fully connected layers.

In summary, the architecture of our CNN consists of three convolutional layers (CL), each of which is followed by rectification, max-pooling, and normalization; in addition, one fully connected layer (FL) is used. The network architecture is detailed as follows:

1. CL: The kernel size is 5 × 5, 1 stride - ReLU - pool 3 × 3, 2 stride - local response normalization (LRN).

2. CL: The kernel size is 5 × 5, 1 stride - ReLU - pool 3 × 3, 2 stride - LRN.

3. CL: The kernel size is 5 × 5, 1 stride - ReLU - pool 3 × 3, 2 stride - LRN.

4. FL.

5. SoftMax Loss Layer.

Computational cost

Given an input image, our comparative approach compares it with all k baseline samples;

however, it does not compare it with all N training samples. For example, in our experiments,

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

each age label is represented by one baseline sample, and we have nine labels, which makes k

= 9. In other words, we only need to compute the comparative relationship of the input image k times, where k can be a small number and much less than N. Therefore, the computational cost of our approach is reasonable.

在文檔中使用圖像和深度學習了解社交互動 - 政大學術集成 (頁 69-77)