Experiment – Network Configuration Selection

One difficulty of applying DCN in visual recognition is the large amount of meta-parameters in the model. While learning algorithms such as SVM usually have only 1∼2 meta-parameters in practice and can be selected using cross-validation, DCN has at least an order of magnitude more meta-parameters and these meta-parameters can’t be determined independently. Therefore, the time consuming training process makes it computationally infeasible to perform cross validation for the meta-parameters. Although [5] suggests random search over grid search, it still relies on cross validation for parameter search.

Instead, we believe that by studying the correlation between the properties of the target problem and the optimal or sub-optimal parameters, we can learn a heuristic that gives us a reasonable range of meta-parameters based on the problem itself, rather than the computation intensive parameter-search procedure.

To determine the meta-parameters for the frame-based video recognizer, we study the effect of meta-parameters on performance using Yahoo!-Flickr and ILSVRC2012 data sets. We set the meta-parameters using these experiences instead of performing cross-validation. In particular, we study the image resolution for input image and the required depth of the network, which have the greatest effect on computation cost of DCN. We also study how to sample the frames from video, which manifests the fact that a large training set with high diversity is necessary for training DCN.

5.5.1 Image Resolution

Early experiments of DCN are mostly based on data sets with very low resolution images. While it is claimed that these thumbnail images are still human recognizable [65], it is obvious that images with higher resolution contain more detailed information that may be helpful for visual recognition. In fact, most existing visual recognition benchmarks

Table 5.1: Performance comparison of different input image resolutions on the ILSVRC2012 and Yahoo!-Flickr data set. For ILSVRC2012, decreasing the image reso-lution significantly degrades the performance, while the performance on Yahoo!-Flickr is less sensitive to input image resolution. The results reflect the fact that ILSVRC2012 is an object level recognition data set, where the object may take only a small part of the image and decreasing the image resolution will make the object unrecognizable. Yahoo!-Flickr, on the other hand, is a scene level recognition data set, and the category depends on the entire image. Therefore, it is still recognizable using thumbnail images. Nevertheless, high resolution images always yield better performance. Also note the poor performance of 3-convolution-layer network with 32x32 input image resolution, which indicates the limit in depth imposed by input image resolution. Therefore, we use 256x256 input image resolution for following experiments.

Res. Depth ILSVRC2012 Yahoo!-Flickr Top-1 Top-5 Accuracy MAP

32x32 2-conv. 0.26 0.47 0.48 0.46

3-conv. 0.23 0.43 0.45 0.43

64x64 2-conv. 0.31 0.55 0.51 0.50

3-conv. 0.31 0.55 0.49 0.47

128x128 2-conv. 0.31 0.54 0.50 0.50

3-conv. 0.39 0.63 0.51 0.51

256x256 2-conv. 0.40 0.64 0.54 0.53

3-conv. 0.46 0.71 0.56 0.56

as well as real image corpora such as Flickr are comprised of images with much higher resolution [16, 71], and experiments on thumbnail images do not approximate real images.

However, higher resolution introduces higher computational cost, which grows roughly quadratically in DCN. Therefore, we try to investigate whether high resolution images are necessary for DCN in general visual recognition.

To study the performance of DCN under different image resolutions, we resize the images of Yahoo!-Flickr and ILSVRC2012 into four different resolutions ranging from 256x256 to 32x32. We then train DCNs with either 2 or 3 convolution layers on each res-olution. The results are in Table 5.1, where the two data sets show different responses with respect to image resolution. For the ILSVRC2012 data set, the performance is very sensi-tive to the image resolution, and we can consistently obtain 10%∼ 15% relative improve-ment by doubling the resolution. Also note that 3-convolution-layer network has worse performance than 2-convolution-layer network with 32x32 input image resolution, which implies the limit on depth imposed by image resolution. In fact, we have to abandon max

pooling after the second convolution layer of the network to have a large enough hidden layer, otherwise the network will be nearly unlearnable and have extremely bad perfor-mance. The Yahoo!-Flickr data set, on the other hand, shows only moderate performance degradation when reducing the resolution, and we can achieve reasonable performance even with 32x32 thumbnail images.

The difference of the two data sets stems from the fact that they are designed for dif-ferent purpose and have difdif-ferent properties. In particular, while the ILSVRC2012 data set is designed for object recognition where the object may be present in only a small part of the image, the Yahoo!-Flickr data set is designed for tag prediction, where the tags are mostly high level concepts that describe the entire image. The results indicate that we may use smaller images without significant loss of performance yet reduce the computa-tional cost quadratically. Nevertheless, increasing the resolution consistently yields better performance and enables the usage of deeper networks, so we use 256x256 resolution in following experiments.

5.5.2 Depth of Architecture

In this section, we compare the performance of DCNs with different numbers of con-volution layers. Previous works on ILSVRC use single depth for their networks [40, 42].

Although they mentioned a significant performance degradation with less layers [40], it is unclear whether the same conclusion holds across all situations. Since adding layers in the network significantly increases the computational cost, we would like to use networks as shallow as possible if the additional layers have no or even negative contributions on the performance.

To evaluate the effect of different depths on the performance, we learn convolution networks with 2∼4 convolution layers on Yahoo!-Flickr data set with 256x256 image resolution. The results are in Table 5.2. 3-convolution-layer network turns out to have the best performance, although the performance gain of the 3rd convolution layer is relative minor. We also carry out the same experiment on ILSVRC2012. The performance gain of the additional convolution layer is more significant than that in Yahoo!-Flickr, which

Table 5.2: Performance comparison of DCNs with different depths on the Yahoo!-Flickr and ILSVRC2012 data set. For the Yahoo!-Flickr data set, 3-convolution-layer network achieves the best performance, but the performance gain over 2-convolution-layer network is minor. The 4-convolution network has much worse performance, which shows that deeper network may even degrade the performance. For the ILSVRC2012 data set, on the other hand, the additional layers significantly improve the performance, and deep network is necessary to achieve state-of-the-art performance. The possible reason that causes the difference is the number of classes in each data sets, where Yahoo!-Flickr has only 10 classes and therefore 10 output units in the network, while ILSVRC2012 has 1,000 output units in the network. This postulation is supported by the results on the CCV data set, as shown in Table 5.5. Note that additional layers significantly increase the computational cost, so we use 2-convolution-layer networks in the following experiments to reduce the computation overhead, since the CCV data set has only 20 categories and should benefit less from the additional depths.

The difference is related to the innate properties of the data sets. In particular, the complexity of ILSVRC2012 lies in the large number of classes, while that of Yahoo!-Flickr lies in the intra-class diversity and the noisy data. We believe the requirement of depth mainly stems from the large number of output classes. Because our target CCV data set is more similar to Yahoo!-Flickr data set in terms of class number, 2-convolution-layer should have reasonable performance as justified in Table 5.5.

5.5.3 Training Data Number and Diversity

In this section, we discuss how to sample frames from videos for the DCN. Based on the results on ILSVRC2012 and Yahoo!-Flickr data sets, we train a 2-convolution-layer network as a frame level recognizer. Our first attempt is to use the keyframes for training, where a total of 8,508 keyframes from the CCV training set are used. These keyframes are resized to 256x256 resolution for training. The resultant performance, however, is poor

Table 5.3: The performance of 2-convolution-layer networks on Yahoo!-Flickr with dif-ferent numbers of training samples and cycles. When only 20k samples per class are used for training, the network suffers from significant overfitting and has poor performance on the test set. The results indicate that a large training set is necessary for learning a robust DCN. We choose 20k samples per class for comparison because the performance of linear SVM saturates around 20k training samples per class.

Cycles 5 10 20

Training Size 20k 150k 20k 150k 20k 150k Loss (Train) 0.62 1.27 0.33 1.30 0.19 1.12 Loss (Test) 1.65 1.54 1.67 1.49 1.72 1.44 Accuracy 0.44 0.51 0.44 0.52 0.43 0.54

with significant overfitting. Our postulation is that a larger training set with sufficient diversity is necessary. This will be an immediate obstacle for many visual recognition tasks, where obtaining large enough training set with ground truth may be challenging.

To verify the postulation, We study the effect of training set size on the Yahoo!-Flickr data sets. The reason why we choose Yahoo!-Flickr is that the study of performance depen-dency on training set size is part of the goal the data set is designed for. The ILSVRC2012 data set, on the other hand is not suitable for the study, because reducing the training set may either lead to very few samples for some classes or change the ratio of different classes.

We train 2-convolution-layer networks on Yahoo!-Flickr using both 20k and 150k training samples for each classes. We choose 20k training samples for comparison be-cause previous results show that the performance of linear classifier saturates with 20k training samples per-class. The results are shown in Table 5.3, which clearly shows that smaller training set leads to significant overfitting and therefore poor performance. For the CCV data set, we try to increase the number training data by uniform sampling on the video. We compare the results using 1-fps sampling and 4-fps sampling, which leads to roughly 400k training samples and 1.6 million training samples respectively. Although 4-fps leads to a much larger training set, experiment results show that the performance using 4-fps are nearly identical with 1-fps while it takes much more storage and computation.

The reason is that 4-fps sampling results in a large data set with small visual diversity, and the redundant training data are not helpful for learning. Therefore, we use 1-fps sampling

Table 5.4: The performance of SVM using features generated by DCN. The network is trained on the Yahoo!-Flickr data set and has two convolution layers and two fully con-nected layers. The outputs of the second convolution layer (Layer 2), the first (Layer 3) and second (Layer 4) fully connected layers are extracted as feature. The features extracted by DCN do not perform well, and optimizing the feature on target problem is necessary.

Feature Layer MAP

Layer 2 + Linear SVM 0.366 Layer 3 + Linear SVM 0.369 Layer 4 + Linear SVM 0.370 SIFT + Linear SVM 0.407 SIFT + χ² SVM 0.486³

to produce the training data from the CCV data set.

5.6 Experiment – Video Recognition with Transfer

在文檔中行動裝置大規模影像辨識 (頁 80-85)