Testing Details - Experiment Settings - 藉由多邊界盒多任務學習網路辨識遠距離動作

4.2 Experiment Settings

4.2.3 Testing Details

We follow the testing procedure of previous work [26]. For each testing video, we uniformly sample 25 frames and the corresponding features including RGB frames and stacking optical flows. Then we crop these frames and their flipping at 4 corners and the center with a (224, 224) square. The model predicts an action for each square; thus, a video has 250 predicted action probabilities. We average these probabilities across all fra-mes; then, the maximum probability among the actions represents the action of the video.

Therefore, we can compare predicted results and ground truth to evaluate an accuracy of the model. The overall accuracy is video wise. That is, the number of correctly classified videos over the number of all testing videos.

In every experiment, we use the same condition for fairness. We only test the perfor-mance of CNN model. This condition is illustrated in Figure 4.10.

4.3 Results

First, we want to choose a baseline CNN for our task. Our candidate models contain DenseNet [11], Inception V3 [34] and ResNet [10]. We use UCF101 split 1 and HMDB51 split 1 to evaluate these models, and we compare the performance on different features

Figure 4.10: For fairness, we use the same setting to test our model.

Figure 4.11: An illustration to show how accuracy is evaluated.

such as rgb, optical flow and Two-stream model. Results are shown in Table 4.2 and Table 4.3.

Table 4.2: Evaluation of CNN models on UCF101 CNN model RGB(%) Optical flow(%) Two stream(%)

DenseNet-121 79.62 75.68 86.10

DenseNet-169 82.66 76.16 87.95

DenseNet-201 83.32 76.45 88.08

Inception-V3 82.87 75.52 87.87

ResNet-50 78.83 78.59 86.34

Table 4.3: Evaluation of CNN models on HMDB51 CNN model RGB(%) Optical flow(%) Two stream(%)

DenseNet-121 40.20 37.32 44.44

DenseNet-169 41.31 41.83 46.27

DenseNet-201 41.44 41.70 46.54

Inception-V3 43.14 42.29 49.61

ResNet-50 40.13 44.05 47.51

As can be seen, Inception V3 has relatively good performance on human action recog-nition datasets. The performance of Two-stream is 87.87% for UCF101 and 49.61% for HMDB51. Furthermore, the total training time of Inception V3 is about 12 hours while the other networks take about 24 hours. Therefore, we choose Inception V3 for our task due to the training efficiency and accuracy.

In the next evaluation, we demonstrate that multitask learning framework is effective.

We take UCF101 split 1 as source data, and HMDB51 split 1 as target data. Under MTL framework, recognition performance on HMDB51 can be improved. The improvement is better than finetuning and training directly on UCF101. We use Inception V3, and we set loss weights w_S = 0.375 and w_T = 1.0 according to the relative portion of both datasets.

The comparison result is shown in Table 4.4.

In addition, we demonstrate that MTL is effective for Two-stream networks [26]. A two-stream network contains 2 major parts including a spatial network and a temporal network. A spatial network takes RGB frames as input, while a temporal network takes optical flows as input. The outputs of both networks are fused to predict the action. We

Table 4.4: Evaluation of MTL on HMDB51 dataset CNN model HMDB51(%)

Inception-V3 43.14 finetuning 46.21

MTL 49.54

use MTL to improve the accuracy of a temporal network and Two-stream networks. Both networks are based on Inception V3 and UCF101 is used to increase the performance of HMDB51. The result is shown in Table 4.5.

Table 4.5: Evaluation of Two-stream on HMDB51 dataset under MTL framework

CNN model HMDB51(%) MTL(%)

Spatial Network 43.14 49.54 Temporal Network 42.29 44.71

Two-stream 49.61 56.14

We observe that a two-stream model can be improved under MTL framework. Before we used MTL, the accuracy of HMDB51 is 49.61%. With the benefits of MTL, the overall accuracy increase to 56.14%.

We evaluate our model on Drone dataset. There are four settings for this experiment.

First, we directly train Inception V3 on Drone dataset. Next, we show that detecting acti-ons can improve recognition performance on Inception V3. Third, we combine MTL and Inception V3 to show the effectiveness of MTL and compare the result with the previ-ous setting. Finally, we combine detection and MTL to increase the generalization of our model. The result is shown in Table 4.6.

Table 4.6: Evaluation of the detection step and the MTL step on Drone dataset

CNN model Drone dataset(%)

Inception-V3 36.61

Detection+Inception-V3 41.96 Inception-V3+MTL 42.86

All 44.64

4.4 Discussion

In this section, we discuss the relation between the MTL performance and the weig-hts w of each loss function. In the MTL step, the weight of each loss function must be set before training. We determine how the weight portions between multiple loss functi-ons can affect recognition performance. To verify this, we use UCF101 data to improve HMDB51 task in the MTL step. We set W_T = 1.0 and change the parameter W_S from 0.125 to 0.875. The HMDB51 performance is shown in Figure 4.12.

Figure 4.12: HMDB51 performance versus weights of loss functions.

Before we investigate this factor, we assume that the weights should be used to balance the data size from different sources. For example, the training data size of UCF101 is about 9.5k and HMDB51 is about 3.5k. We set weights of each loss function according to the reciprocal of their data size. That is, _9.5k¹ : _3.5k¹ ≃ 0.37 : 1.0. So we set WS = 0.375 and W_T = 1.0. Although in Figure 4.12, the peak value 50.59% appears at W_S = 0.25.

This value is close to the value 49.54% at W_S = 0.375. In sum, the weight of each loss function can be set depending on the reciprocal of data sizes from different sources.

Another issue is that the detector could fail to detect human in some cases. For exam-ple, in Figure 4.13, the angle of the camera is at the top of the human. In this situation, the

detector is unlikely to detect human because the shapes of human are tiny circles instead of a human body shapes. A possible way to improve is to use more similar cases in the training.

Figure 4.13: Failed cases for detectors.

In addition, optical flows of drone videos are extremely difficult for training a temporal network. The purpose of temporal networks is to detect human motion in videos; however, optical flows of drone videos mainly detect the moving background instead of human motion because the drone camera itself is dynamic. Figure 4.14 simply illustrates this problem. There is a human running in the center but optical flows mainly contain useless background information. Moving camera detects moving directions of the background from the side of camera, and information of human objects becomes weak. Thus, it is challenging to train a temporal network with this issue.

Figure 4.14: An example of optical flows from drone videos.

Chapter 5 Conclusion

In conclusion, we present a new learning framework that can improve the recognition accuracy on action recognition problem for drones. This learning framework is two-stage including the detection step and the MTL step. The detection step helps a CNN model focus on human objects, and the MTL step enhances the accuracy on limited drone data.

Furthermore, we propose a new human action dataset of drones. The dataset has 14 dif-ferent action categories. This dataset is challenging due to small human objects and data scarcity.

In future work, we plan to apply two new human action datasets recently proposed in our problem. The first dataset called Kinetics [15] is proposed by Google Deepmind.

This dataset contains 400 action categories. The second dataset is SLAC [43] presented by Facebook Research and MIT jointly. We prefer to use the pretrained models since training on these datasets is time-consuming. We will study these datasets for our task after pretrained models are released. In addition, we want to extend our drone dataset from 14 actions to 20 actions including some anomalous human actions. The final goal is to detect anomalous behaviors in real time with drone technologies. However, some action samples such as shooting and stealing are difficult to collect. In order to solve this issue, we will use virtual world data instead of real world data to perform the detection task. Then, we transfer the model to drones in real world to detect human actions.

Bibliography

[1] M. Barekatain, M. Martí, H.-F. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger. Okutama-action: An aerial view video dataset for concurrent human action detection. In 1st Joint BMTT-PETS Workshop on Tracking and Surveillance, CVPR, pages 1–8, 2017.

[2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image net-works for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3042, June 2016.

[3] N. F. Davar, T. de Campos, D. Windridge, J. Kittler, and W. Christmas. Domain adaptation in the context of sport video action recognition. In Domain Adaptation Workshop, in conjunction with NIPS, 2011.

[4] C. R. de Souza, A. Gaidon, Y. Cabon, and A. M. López. Procedural generation of videos to train deep action recognition networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2594–2604, July 2017.

[5] C. R. de Souza, A. Gaidon, E. Vig, and A. M. López. Sympathy for the details:

Dense trajectories and hybrid classification architectures for action recognition. In ECCV, pages 697–716. Springer, 2016.

[6] A. Diba, V. Sharma, and L. V. Gool. Deep temporal linear encoding networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1541–1550, July 2017.

[7] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468–3476, 2016.

[8] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7445–7454, July 2017.

[9] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, June 2016.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.

[11] G. Huang, Z. Liu, L. Maaten, and K. Q. Weinberger. Densely connected convolutio-nal networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, July 2017.

[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[13] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred lay-ers tiramisu: Fully convolutional densenets for semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pa-ges 1175–1183, July 2017.

[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Confe-rence on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014.

[15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Vi-ola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset.

arXiv preprint arXiv:1705.06950, 2017.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.

[17] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large vi-deo database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, Nov 2011.

[18] Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam. Deep local video feature for action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Re-cognition Workshops (CVPRW), pages 1219–1225, July 2017.

[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec 1989.

[20] G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN fisher vectors for action recognition and image annotation. In ECCV, pages 833–850. Springer, 2016.

[21] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Attention transfer from web images for video recognition. In ACM MM, pages 1–9. ACM, 2017.

[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD:

Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[23] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. To-derici. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4694–4702, June 2015.

[24] Z. Qiu, T. Yao, and T. Mei. Deep quantization: Encoding convolutional activations with deep generative model. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4085–4094, July 2017.

[25] Z. Shen, Z. Liu, J. Li, Y. G. Jiang, Y. Chen, and X. Xue. DSOD: Learning deeply supervised object detectors from scratch. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1937–1945, Oct 2017.

[26] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action re-cognition in videos. In NIPS, pages 568–576, 2014.

[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[28] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01, UCF Center for Research in Computer Vision, 2012.

[29] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, pages 843–852, 2015.

[30] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu. Transfer learning for video recognition with scarce training data for deep convolutional neural network.

arXiv preprint arXiv:1409.4127, 2014.

[31] L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4597–4605, Dec 2015.

[32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.

[33] C. Szegedy, W. Liu, Y.-Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-houcke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Confe-rence on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015.

[34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, June 2016.

[35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, Dec 2015.

[36] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018.

[37] H. Wang, A. Kläser, C. Schmid, and C. L. Liu. Action recognition by dense trajec-tories. In CVPR 2011, pages 3169–3176, June 2011.

[38] H. Wang and C. Schmid. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision, pages 3551–3558, Dec 2013.

[39] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305–4314, June 2015.

[40] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36. Springer, 2016.

[41] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.

[42] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks.

In ECCV, pages 818–833. Springer, 2014.

[43] H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba. SLAC: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374, 2017.

[44] F. Zhu and L. Shao. Enhancing action recognition by cross-domain dictionary lear-ning. In BMVC. Citeseer, 2013.

[45] W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. A key volume mining deep framework for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1991–1999, June 2016.

在文檔中藉由多邊界盒多任務學習網路辨識遠距離動作 (頁 31-44)