Transfer Learning with Deep Convolution Network

In this section, we describe the transfer learning approaches we apply to utilize image information in video recognition. Transfer learning helps to learn a more generalizable DCN. This is important because DCNs are prone to overfitting, especially when only scarce training data is available. While increasing training data helps to solve the prob-lem, there are cases where collecting new data with complete ground truth is difficult.

Transfer learning solves the problem by using labeled data from other domains where a large number of training data is available to improve the network.

The goal is similar to the pre-training process in DBN and Stacked Auto-Encoder (SAE) in the sense that it improves the generalizability by learning a better intermediate representation [20]. A good representation does not necessarily optimize the loss during training, which is done by supervised backpropagation in Deep Neural Network; instead, it should capture important patterns that are general to all data. While DBN and SAE achieve this by performing unsupervised pre-training before supervised training, we learn the representations from other domains and then optimize the representation through trans-fer learning.

Image Label Video Label

Shared Representation

Figure 5.2: Mixing training set for transfer learning. The DCN is trained with both images and video frames simultaneously. Because the middle layers are shared by all output units, the internal representation benefits from both the image and video domains and learns a more robust network. This helps to avoid overfitting and learn general visual patterns in natural images.

5.2.1 Feature Extraction with Neural Network

Motivated by [45, 62], which shows that the intermediate values of DBNs may be extracted as discriminative image features even if the network is trained on another non-overlapping data set, the first transfer learning approach is to use a pre-trained network as a feature extractor. Because the number of training samples in the target data set is too small to learn a new DCN, this approach utilizes DCNs trained on other data sets that are large enough. The intermediate values, or activations of the middle layers for each image are taken as image features. Given the assumption that important visual patterns are similar across all natural images, the pre-trained network should have learned these patterns and the extracted features should be universal, as suggested by the result of [45].

5.2.2 Mixing Data Sets

The second approach for transfer learning is to mix image data sets with the video data set. Or equivalently, we train a DCN that simultaneously recognizes images from the image data sets and frames from the video data set, as illustrated in Fig. 5.2. This is made

Figure 5.3: First convolution layer kernels learned from Yahoo!-Flickr and ILSVRC2012 data sets. Despite being learned from non-overlapping data sets, some kernels are visually similar, which supports the assumption that some important image patterns are shared across different data sets provided enough training data.

possible since the intermediate layers are shared by all output units in Neural Networks and may benefit from the training data of other classes. If the lower layers can learn general visual patterns that are shared across different data sets, the additional classes and training data can help to learn better low and middle level features and avoid overfitting.

5.2.3 Transfer Mid-level Features

The third approach for transfer learning is to transfer the learned feature from images to videos by initializing the learnable parameters of DCNs using the network pre-trained on image domain. The network is then fine-tuned using the target data set to optimize the features for the target data set or domain. This approach is motivated by the fact that image features rely on important visual patterns that are shared across all natural images, so they can be used to characterize images outside the training set. Because the convolution kernels in DCN learn important low level visual patterns in natural images [42], these convolution kernels serve as the low level features used in traditional visual recognition, and they may also be similar across all natural images and data sets. This can be seen in

Fig. 5.3, where some of the first layer kernels are very similar even if they are learned from two non-overlapping data sets. Therefore, the network parameters learned from one data set may be useful for another data set.

The fine-tuning step is performed to further optimize the feature. This is especially important for the higher layers in DCN, such as the fully connected layer, because these layers capture more complex patterns [42] that may not generalize well to other domains.

For example, while lines and corners are common to all natural images, a pattern of face will appear in only specific domains and may not be useful in all problems. Because training neural networks as an optimization problem is non-convex, different initial values of the learnable parameters will lead to different networks. In fact, it is known that good initialization will lead to better performance [64]. Therefore, by initializing the learnable parameters with learned visual patterns, we should be able to learn more generalizable networks, as suggested by the pre-training process.

在文檔中行動裝置大規模影像辨識 (頁 73-76)