Results - 以對抗式物件合成訓練輔助動作辨識模型以減輕其偏差

EPIC-KITCHENS. Table 5.1 show the results of our AdvOST method and the Falla-cious Object Reliance metrics (FOR) on the three different backbones, i.e. I3D, TSN, and

Table 5.1: Performance of seen validation set of EPIC-KITCHENS dataset. After apply-ing AdvOST, the baselines constantly get higher performance and have lower Fallacious Object Reliance scores in the most case.

Split Method Top1 Acc.↑ Top5 Acc.↑ FORcorr ↓ FORslope ↓

Seen I3D 47.59% 79.70% 0.5346 0.8701

I3D+AdvOST 49.21% 80.58% 0.5145 0.8095

TSN 40.68% 79.64% 0.4865 0.7112

TSN+AdvOST 41.02% 79.64% 0.4610 0.6735

SlowFast 56.48% 82.53% 0.5984 0.9548

SlowFast+AdvOST 56.05% 82.56% 0.5879 1.024

unseen I3D 43.05% 74.24% 0.8231 1.122

I3D+AdvOST 43.43% 74.14% 0.8230 1.114

TSN 33.36% 71.59% 0.7383 0.9422

TSN+AdvOST 34.31% 72.21% 0.7674 0.9490

SlowFast 49.57% 77.83% 0.8747 1.193

SlowFast+AdvOST 51.60% 78.02% 0.8632 1.297

SlowFast. For the seen and unseen validation sets of EPIC-KITCHENS, all backbones have intermediate or strong FOR scores, indicating that even the recent state-of-the-art action models have the FOR issue. Also, the FOR scores in the unseen validation set is much higher than the ones in the seen validation set. It’s because the unseen validation set contains much different object-action joint distribution, which demonstrates the FOR metrics we propose can exploit and reflect the issue we found.

After applying our AdvOST architecture, almost all the three backbones have Top1 accuracy improved compared to the ones without AdvOST, showing our proposed training procedure does enhance the robustness of backbones.

For the FOR scores, we found that in the seen dataset, most of the backbones have decreased FOR scores after using AdvOST. However, in the unseen dataset, only the two SOTA models I3D and SlowFast have reduced FOR scores, while the TSN ones increase.

This result may imply that the TSN model cannot achieve high accuracy as other models in the unseen dataset because it does not learn enough valid object association.

Moment in Time. As the Moment in Time dataset has a better variety of objects in the training data, the FOR issue is less severe. Still, after applying our AdvOST, we still get

Table 5.2: Performance of Moments in Time dataset.

Method Top1 Acc.↑ Top5 Acc.↑ FORcorr ↓ FORslope ↓

I3D 25.60% 51.38% -0.2410 -0.4197

I3D+AdvOST 27.05% 53.73% -0.2262 -0.3968

Table 5.3: Performance for the HMDB51 dataset. Our method could improve both the action recognition accuracy and FOR scores in most cases.

Split Method Top1 Acc. ↑ Top5 Acc. ↑ FORcorr↓ FORslope ↓

1 I3D 59.80% 88.69% 0.7392 0.6797

I3D+AdvOST 60.26% 88.23% 0.7239 0.6796

2 I3D 60.84% 87.40% 0.8221 0.7698

I3D+AdvOST 61.11% 87.64% 0.7675 0.6868

3 I3D 60.91% 87.45% 0.8236 0.8457

I3D+AdvOST 61.50% 88.16% 0.7845 0.8626

accuracy improvement as AdvOST forces the model to learn more about the motion itself while staying low for the object dependency.

HMDB51. We also test AdvOST on a backbone I3D using HMDB51. Table 5.3 shows that for the three testing sets, AdvOST consistently improves the performance in terms of the Top1 accuracy and mitigates the FOR issue.

Table 5.4: Ablation study on the EPIC-KITCHENs dataset with I3D as backbone.

Synthesizer Discriminator Lflow_overlap Top1 Acc.↑ Top5 Acc.↑ FORcorr ↓ FORslope ↓

47.59% 79.70% 0.5346 0.8701

✓ ✓ 48.27% 79.67% 0.4823 0.8386

✓ ✓ 47.35% 80.01% 0.5097 0.8619

✓ ✓ ✓ 49.21% 80.58% 0.5145 0.8095

5.4 Ablation Study

To validate each component of AdbOST, we conducted an ablation study on the EPIC-KITCHENs dataset with I3D as a backbone. From Table 5.4 we could observe that all the

three components are necessary for AdbOST. Without the discriminator, the synthesizer will generate unreasonable augmented videos so that the performance has a significant drop. Also, without the flow overlap loss, it is trivial for the synthesizer to fool the model by blocking the motion part.

Chapter 6 Discussion

6.1 Per-class Improvement and Confusion Matrix

In order to dig deeper into what AdvOST contributes, we visualize the per-class score improvement and the difference of confusion matrix of model I3D after applying AdvOST in Fig. 6.1. The per-class improvement figure in (a) demonstrated the power of AdvOST, especially for those actions that often occur in particular locations or with specific objects such as ”peel”, ”pour”, ”dry”, and ”roll”, because for these classes, the original I3D may spot the surrounding objects repeatedly and learn to make use of these hints.

The confusion matrix difference before and after applying AdvOST can be seen in 6.1 (b). This figure reveals more details hidden in (a). For example, in the left black dashed box, we understand the improvement f1 score of action ”dry” is due to the decreased misclassification to ”put”, ”open”, and ”close”, the actions that can take place in more general scenes. The right black dashed box exposes similar information that after applying AdvOST, the fallacious association of objects and actions is alleviated. Note that we merge the seen and unseen validation sets in the two figures and have filtered out those classes with less than 20 examples in the validation set for clearer visualization.

6.2 Grad-CAM Comparison

We show several grad-CAM visualizations in Fig 6.2 to compare the interior behaviors of pure I3D and I3D with AdvOST and we could see the effectiveness of AdvOST that guides action models to put more awareness on motion.

(a) Per-class Improvement (b) Comfusion Matrix Difference

Figure 6.1: (a) The per-class f1 improvement and (b) the changes of the confusion ma-trix after applying AdvOST on I3D with EPIC-KITCHENS validation sets. (a) shows AdvOST helps our classifier improves most classes, especially for those actions subject to certain places or particular objects. The changes in the confusion matrix after apply-ing AdvOST (b) demonstrate where the improvement comes from in detail. Please refer session 6.1 for in-depth discussion.

(a) Input Video (b) CAM of I3D (c) CAM of I3D with AdvOST

Figure 6.2: Grad-CAM visualizations of pure I3D and I3D with AdvOST. Compared to pure I3D, I3D with AdvOST focuses more on the hands performing that action instead of the subjects of the action. Besides, as we can see in the first two rows, I3D with AdvOST can pay attention on both hands if they are present, while pure I3D can not.

Chapter 7 Conclusion

In this paper, we propose the Object Reliance Level and Fallacious Object Reliance (FOR) to measure action recognition models’ erroneous dependency on objects in videos, based on our observations on the dataset’s object bias and CNN model’s invalid object-dependent behavior. Furthermore, we propose a novel model-agnostic approach, Adversarial Object Synthesis Training (AdvOST), to reduce the models’ FOR score by increasing the object diversity of the training dataset with an object synthesizer. Experiments on the EPIC-KITCHEN and HMDB51 datasets suggest that our method could effectively improve the accuracy of SOTA action recognition models including TSN, I3D, and SlowFast.

Bibliography

[1] J. A. Buolamwini. Gender shades: intersectional phenotypic and demographic eval-uation of face datasets and gender classifiers. PhD thesis, Massachusetts Institute of Technology, 2017.

[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.

[3] C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Stojanov, and J. M. Rehg.

Unsupervised 3d pose estimation with geometric self-supervision. In The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), June 2019.

[4] J. Choi, C. Gao, J. C. Messou, and J.-B. Huang. Why can’t i dance in the mall? learn-ing to mitigate scene bias in action recognition. In Advances in Neural Information Processing Systems, pages 851–863, 2019.

[5] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vi-sion (ECCV), pages 720–736, 2018.

[6] D. Drover, R. MV, C.-H. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh. Can 3d pose be learned from 2d projections alone? In The European Conference on Computer Vision (ECCV) Workshops, September 2018.

[7] N. Dvornik, J. Mairal, and C. Schmid. Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pages 364–380, 2018.

[8] D. Dwibedi, I. Misra, and M. Hebert. Cut, paste and learn: Surprisingly easy syn-thesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1301–1310, 2017.

[9] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recog-nition. In Proceedings of the IEEE International Conference on Computer Vision, pages 6202–6211, 2019.

[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on com-puter vision and pattern recognition, pages 1933–1941, 2016.

[11] G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka. Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836, 2017.

[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[13] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something some-thing” video database for learning and evaluating visual common sense. In ICCV, volume 1, page 5, 2017.

[14] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.

[15] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision, pages 793–811. Springer, 2018.

[16] N. Hussein, E. Gavves, and A. W. Smeulders. Timeception for complex action recog-nition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 254–263, 2019.

[17] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”

. Computer Vision and Image Understanding, 155:1–23, 2017.

[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.

[20] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. In The DAVIS Challenge on Video Object Segmentation, 2017.

[21] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.

[22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Com-puter Vision, pages 2556–2563. IEEE, 2011.

[23] Y. Li, Y. Li, and N. Vasconcelos. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), pages 513–528, 2018.

[24] C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial trans-former generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464,

2018.

[25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.

[26] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Com-puter Vision, pages 5533–5541, 2017.

[27] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–

626, 2017.

[28] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Holly-wood in homes: Crowdsourcing data collection for activity understanding. In Euro-pean Conference on Computer Vision, pages 510–526. Springer, 2016.

[29] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.

[30] Y. Song, R. Shu, N. Kushman, and S. Ermon. Constructing unrestricted adversarial examples with generative models. In Advances in Neural Information Processing Systems, pages 8312–8323, 2018.

[31] K. Soomro and A. R. Zamir. Action recognition in realistic sports videos. In Com-puter vision in sports, pages 181–208. Springer, 2014.

[32] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.

[33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fer-gus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[34] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.

[35] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.

[36] S. Tripathi, S. Chandra, A. Agrawal, A. Tyagi, J. M. Rehg, and V. Chari. Learning to generate synthetic data via compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 461–470, 2019.

[37] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.

[38] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Pro-ceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.

[39] X. Wang, A. Shrivastava, and A. Gupta. A-fast-rcnn: Hard positive generation via adversary for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2606–2615, 2017.

[40] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018.

[41] B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.

[42] L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

在文檔中以對抗式物件合成訓練輔助動作辨識模型以減輕其偏差 (頁 24-37)