Feature Extraction by DCN - Experiment – Video Recognition with Transfer Learning

5.6 Experiment – Video Recognition with Transfer Learning

5.6.1 Feature Extraction by DCN

In this section, we examine the features extracted by the pre-trained DCN. To see if DCN can extract useful features from images of different data sets, we use the 2-convolution-layer network trained on Yahoo!-Flickr and extract the middle layers’ outputs as image features. We then learn a linear SVM as recognizers using these image features.

The results are in Table 5.4, where we can see that the features extracted by DCN are not as good as SIFT features. This indicates that the intermediate features learned by DCNs from the image domain is not discriminative for video recognition, and optimizing the learned features on the target problem is necessary.

5.6.2 Mixing Data Sets

Next, we examine the approach of mixing image and video data sets. In practice, we subsample roughly the same number of images as video frames (0.4 million) from

Table 5.5: Experiment results on the CCV data set. Depth indicates the architecture of the network, which is described in details in section 5.4. The results indicate additional depth introduces only minor improvement on MAP but significant computational overhead. Ini-tialization indicates how the learnable parameters in the network are initialized: Random means random initialization, which is standard for neural networks, while Yahoo!-Flickr and ILSVRC2012 indicate initialization with the pre-trained network parameters on each data set. Because the networks learn important patterns for natural images from the image data sets, pre-trained networks start with better intermediate representation and has better generalizability. Training set indicates whether we use only the CCV video data set or mix it with the image data set for training. Because the intermediate layers are shared by all output units, they can learn simultaneously from the image and video domains, which leads to better performance. Update policy indicates which parameters are updated during training, where FC means fully connected layers and CONV means convolution layers.

By restricting the update of convolution layers, we reduce the number of learnable pa-rameters and avoid overfitting. The results show that our transfer learning approaches improve the performance of DCN, which suffers from significant overfitting when only the standard training process is adopted.

Depth Initialization Training Set Update

Policy

MAP

2-convolution-layers

Random CCV FC + CONV 0.445

CCV + Yahoo!-Flickr FC + CONV 0.469

Yahoo!-Flickr

CCV FC + CONV 0.484

FC 0.497

CCV + Yahoo!-Flickr FC + CONV 0.494

FC 0.499

ILSVRC2012 CCV FC + CONV 0.489

FC 0.490

5-convolution-layers Random CCV FC + CONV 0.469

CCV + Yahoo!-Flickr FC + CONV 0.482

the Yahoo!-Flickr data set and mix these images with the CCV data set to create a new training set. The label of each sample is kept the same as their original label. The results are shown in Table 5.5. Although the two data sets are from very different sources and share little visual similarity, mixing the image from Yahoo!-Flickr indeed improves the performance. This may be explained by the first convolution layer kernels, as in Fig. 5.5, which shows that by mixing images into video data set, the network successfully learns the high frequency signals that are ignored when training the network on CCV only. These signal are known to be important, and ignoring them may degrade the generalizability.

Figure 5.6: The first convolution layer kernels of networks using a pre-trained network as initialization. Although the kernels are both learned from the CCV data set, they show different visual patterns which are more similar to their initialization points respectively.

Note the clear 1-to-1 correspondence between the kernels above and those in Fig. 5.3. In fact, most of the kernels do not change significantly after fine-tuning with CCV data sets, which indicates the patterns learned from either Yahoo!-Flickr or ILSVRC2012 are helpful for recognizing CCV videos, i.e. data sets of different domains.

5.6.3 Transfer Mid-level Features

In this section, we examine the approach of transferring mid-level features. In prac-tice, we initialize the network for CCV by the network trained on Yahoo!-Flickr and ILSVRC2012 respectively. During training, we either update all the parameters in the network or update only the fully connected layers and keep the convolution kernels un-changed. The reason why we keep the convolution kernels unchanged is to reduce the number of learnable parameters to avoid overfitting. If the lower layer convolution kernels do learn important patterns such as lines or corners, they should be similar and reusable over different data sets, so keeping them unchanged should not degrade the performance.

The result are in Table 5.5. Initializing the network with pre-trained networks improves the performance significantly, and updating only the fully connected layers is better than updating all parameters. This indicates that when the training data is not enough, avoid-ing to update the low level features in DCNs will reduce the overfittavoid-ing problem and lead

to better performance. Note that the networks learn different sets of convolution kernels using different initialization, as shown in Fig. 5.6, while both of them achieve good per-formance.

Finally, we combine the second and third approaches. That is, we initialize the net-work with the one trained on the Yahoo!-Flickr data set and fine-tune the netnet-work with both images and video frames. The recognition performance is further improved by the combination.

Chapter 6 Conclusion

In this work, we address the emerging challenge of scalable mobile visual classifica-tion. Although scalable visual classification has been addressed in previous works, and applications based on mobile visual recognition are gaining attention for practical applica-tions, the technical challenges of combining the two have not been discussed. Our analysis shows the intrinsic limitations of mobile visual classification and the drawbacks of apply-ing existapply-ing techniques on the problem.

We propose two solutions that work under different environment settings. When mo-bile network is not accessible to the system, we develop a novel linear dimension reduc-tion algorithm, KPP, that extends multidimensional scaling based on feature map meth-ods and ensures the classification performance. The performance of KPP benefits from the correlations between dimensions since it is a linear distance metric, as discussed in [38, 56, 48, 32]. Because the exact correlation is unknown, we try to learn it through ap-proximating RBF kernel matrix, as discussed in section 3.2.4. Our experimental results on three popular data sets (Scene, Caltech-256 and ImageNet) show that it outperforms the state-of-the-art linear hashing algorithms widely adopted in mobile visual search, both su-pervised and unsusu-pervised, and the dimension reduction algorithm is compliant to mobile computation framework.

When the network is available under limited bandwidth, we conduct a systematic eval-uation on various strategies for mobile visual recognition under client-server framework in terms of recognition bitrate. In particular, we compare the recognition bitrate of

thumb-nail images, various image features and feature signatures. The result shows that even a tiny image contains sufficient information for visual recognition, and by utilizing multi-ple features extracted from thumbnail images, the recognition bitrate of thumbnail image is much lower than raw image features. Although fusing multiple image signatures may achieve lower bitrate than thumbnail images, extracting multiple (local) features on mobile devices may not be feasible due to CPU and battery constraints. We further recommend to combine single local feature signature and the scaled-down thumbnail image, which achieves near optimal performance under the constraint of current mobile environment.

Using the strategy, we significantly reduce the average data transmission from 102,570 bytes to 4,661 bytes (i.e., thumbnail images scaled down to 1/8 of the original size and the 8,192 bits signature generated by random projection from Hessian affine feature with 64 centers VLAD descriptor), while the recognition accuracy only decreases from 0.67 to 0.59, which is still better than any single raw feature.

Finally, we investigate the properties of DCNs. Our preliminary study reveals the correlation between meta-parameters and performance given different data set properties.

In particular, we study the effect of depth and image resolution to provide a heuristic for meta-parameter selection in DCNs. We aslo tackle the lack-of-training-sample problem by transfer learning, where transfer learning makes the network starts with a better interme-diate representation that improves generalizability. The transfer learning process makes training DCN with scarce training data possible. Given the public available image corpus, we can use these corpus to facilitate the learning process on scarce training data, where we achieve reasonable performance using only 4k videos for training. These experiences provide the basis for the future study on how DCNs effect mobile visual recognition.

Bibliography

[1] D. Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Sys. Sci., 66(4):671–687, June 2003.

[2] T. Ahonen et al. Face recognition with local binary patterns. In ECCV, 2004.

[3] H. Bay et al. Surf: Speeded up robust features. In ECCV, 2006.

[4] A. Berg, J. Deng, and F.-F. Li. Large scale visual recognition challenge 2010.

[5] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. JMLR, 13:281–305, 2012.

[6] E. Bingham and H. Mannila. Random projection in dimensionality reduction: ap-plications to image and text data. In Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pages 245–250, 2001.

[7] A. Bosch et al. Representing shape with a spatial pyramid kernel. In ACM CIVR, 2007.

[8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

[9] V. Chandrasekhar et al. Comparison of local feature descriptors for mobile visual search. In ICIP, 2010.

[10] V. Chandrasekhar et al. Compressed histogram of gradients: A low-bitrate descriptor.

Int. J. Comput. Vision, 96(3):384–399, 2012.

[11] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), May 2011.

[12] D. Chen et al. Residual enhanced visual vectors for on-device image matching. In Asilomar Conference on Signals, Systems, and Computers, 2011.

[13] L.-C. Dai et al. Imshare: instantly sharing your mobile landmark images by search-based reconstruction. In ACM MM, 2012.

[14] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks.

In NIPS, 2012.

[15] J. Deng, A. C. Berg, K. Li, and F.-F. Li. What does classifying more than 10,000 image categories tell us? In Proc. of the 11th European Conf. Computer Vision, pages 71–84, 2010.

[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 248–255, 2009.

[17] J. Deng et al. Large scale visual recognition challenge 2012.

[18] T. Deselaers, S. Hasan, O. Bender, and H. Ney. A deep learning approach to ma-chine transliteration. In Proceedings of the Fourth Workshop on Statistical Mama-chine Translation, 2009.

[19] M. Douze et al. Evaluation of gist descriptors for web-scale image search. In CIVR, 2009.

[20] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? JMLR, 11:625–660, 2010.

[21] M. Everingham et al. The pascal visual object classes challenge 2007 (voc2007) results, 2007.

[22] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge 2009 (voc2009) results. http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html.

[23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, June 2008.

[24] E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders. Convex reduction of high-dimensional kernels for visual classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3610–3617, 2012.

[25] B. Girod, V. Chandrasekhar, D. Chen, N.-M. Cheung, R. Grzeszczuk, Y. Reznik, G. Takacs, S. Tsai, and R. Vedantham. Mobile visual search. IEEE Signal Processing Mag., 28(4):61–76, July 2011.

[26] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011.

[27] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.

[28] G. Griffin et al. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.

[29] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and S.-F. Chang. Mobile product search with bag of hash bits and boundary reranking. In Proc. IEEE Conf.

Computer Vision and Pattern Recognition, pages 3005–3012, 2012.

[30] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In Proc.

IEEE Conf. Computer Vision and Pattern Recognition, pages 2957–2964, 2012.

[31] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[32] Y. Hong, Q.-N. Li, J.-Y. Jiang, and Z.-W. Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In Proc. 13th IEEE Int. Conf.

Computer Vision, pages 906–913, 2011.

[33] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3304–3311, 2010.

[34] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.

[35] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding, 2013.

[36] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video under-standing: A benchmark database and an evaluation of human and machine perfor-mance. In ICMR, 2011.

[37] A. Joly and O. Buisson. Random maximum margin hashing. In Proc. IEEE Conf.

Computer Vision and Pattern Recognition, pages 873–880, 2011.

[38] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity. In Proc. Conf. Advances in Neural Information Processing Systems, pages 657–664, 2003.

[39] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.

Computer Science Department, University of Toronto, Tech. Rep., 2009.

[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012.

[41] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 2169–2178, 2006.

[42] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng.

Building high-level features using large scale unsupervised learning. In ICML, 2012.

[43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[44] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area v2. In NIPS, 2007.

[45] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[46] J.-G. Li et al. Face recognition using feature of integral gabor-haar transformation.

In ICIP, 2007.

[47] S. Litayem, A. Joly, and N. Boujemaa. Hash-based support vector machines ap-proximation for large scale prediction. In Proc. British Machine Vision Conference, 2012.

[48] W. Liu, S.-Q. Ma, D.-C. Tao, J.-Z. Liu, and P. Liu. Semi-supervised sparse metric learning using alternating linearization optimization. In Proc. 16th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pages 1139–1148, 2010.

[49] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Com-put. Vision, 60(2):91–110, 2004.

[50] S. Maji and A. C. Berg. Max-margin additive classifiers for detection. In Proc. 11th IEEE Int. Conf. Computer Vision, pages 40–47, 2009.

[51] K. Mikolajczyk et al. A comparison of affine region detectors. Int. J. Comput. Vision, 65(1-2):43–72, 2005.

[52] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors.

Int. J. Comput. Vision, 60(1):63–86, 2004.

[53] A. Mohamed, G. E. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. Trans. Audio, Speech and Lang. Proc., 20(1):14–22, 2012.

[54] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision, 42(3):145–175, 2001.

[55] F. Perronnin, J. Sánchez, and T. Mensnik. Improving the fisher kernel for large-scale image classification. In Proc. of the 11th European Conf. Computer Vision, pages 143–156, 2010.

[56] G.-J. Qi, J.-H. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang. An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regular-ization. In Proc. 26th Int. Conf. Machine Learning, pages 841–848, 2009.

[57] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, Dec. 2000.

[58] O. Russakovsky et al. Large scale visual recognition challenge 2013.

[59] J. Sánchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1665–1672, 2011.

[60] M. Schmidt. Graphical Model Structure Learning with L1-Regularization. PhD thesis, Univ. British Columbia, 2010.

[61] J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proc. 9th IEEE Int. Conf. Computer Vision, pages 1470–1477, 2003.

[62] K. Sohn, D. Y. Jung, H. Lee, and A. O. Hero. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011.

[63] Y.-C. Su, T.-H. Chiu, G.-L. Wu, C.-Y. Yeh, F. Wu, and W. Hsu. Flickr-tag prediction using multi-modal fusion and meta information. In ACM MM, 2013.

[64] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.

[65] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach.

Intell., 30(11):1958–1970, 2008.

[66] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms, 2008.

[67] J. Wang and S.-F. Chang. Sequential projection learning for hashing with compact codes. In Proc. 27th Int. Conf. Machine Learning, pages 1127–1134, 2010.

[68] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 3360–3367, 2010.

[69] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009.

[70] G.-L. Wu, Y.-H. Kuo, T.-H. Chiu, W. H. Hsu, and L. Xie. Scalable mobile video retrieval with sparse projection learning and pseudo label mining. IEEE Multimedia, 20(3):47–57, 2013.

[71] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

[72] M.-Q. Xu, X. Zhou, Z. Li, B.-Q. Dai, and T. Huang. Extended hierarchical gaussian-ization for scene classification. In Proc. 17th IEEE Conf. Image Processing, pages 1837–1840, 2010.

[73] X. Yang and K.-T. Cheng. Accelerating surf detector on mobile devices. In ACM MM, 2012.

[74] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent advances of large-scale linear classifi-cation. Proc. IEEE, 100(9):2584–2603, 2012.

在文檔中行動裝置大規模影像辨識 (頁 86-99)