Ablation studies - 跨模態共注意視聽事件定位

In this section, we will verify the design and contributions of different global information for the decoder LSTM in AVSDN [28]. Besides, we also verify our intra and inter-frame visual encoders by using co-attention mechanisms [40, 24, 2, 34]. This would support the learning and exploitation of intra or inter-frame visual representation for audio-visual event localization.

Global representation in AVSDN. In Table 4.5, first-two rows show the results which initial state is only from visual or audio content respectively. Further, we want to ex-plore whether encoder LSTM can learn global event information or not. With the last hidden states from visual and audio modality individually, these hidden states are guided by video-level labels through a simple multilayer perceptron (MLP). The third row in Ta-ble 4.5 shows the result is improved compared with only one modality. However, extra loss functions for guiding the last hidden states are not needed. The last hidden states are well-learned during the training of our AVSDN [28].

Intra-frame visual representation. We note that, existing co-attention mechanisms typically operate at each single frame while associating visual and audio information. Re-call that we have notations that v¹ ∈ R⁴⁹^×128denotes visual feature, v_r ∈ R¹^×128 denotes

1note that we omit the superscript t which indicates time step for simplicity here

Table 4.6: Comparisons of recent audio-visual co-attention mechanisms [40, 24, 2, 34]

with/without integrating our intra-frame visual encoder (Intra-V) in fully supervised set-ting (i.e., all ground truth y_tobserved during training). The numbers in bold indicate the best results (i.e., with our cross-modality co-attention).

Attention Mechanism Method

Table 4.7: Comparisons of recent audio-visual co-attention mechanisms [40, 24, 2, 34]

with/without our intra-frame visual encoder (Intra-V) in weakly supervised manners (i.e., only ground truth Y observed during training). The numbers in bold indicate the best results (i.e., with our cross-modality co-attention).

Attention Mechanism Method

visual feature of the rth region and a∈ R¹^×128denotes audio feature. One way to com-pute the attention map is to directly measure the attention score through inner products between v_rand a for every r, which produces an attention map M ∈ R¹^×49. This type of attention can be interpreted as calculating the cosine similarity between the visual and au-dio features for each video frame, while taking their similarities as the attention weights.

The other way to generate attention map is to add a on each v_r, and feed the added features as the inputs to a MLP. Then, this MLP would output an attention map M ∈

R¹^×49. Both of these two co-attention methods can jointly work with our intra-frame visual encoding component as long as v is replaced with intra-frame visual feature. In Table 4.6 and 4.7, we show and compare these two types of co-attention mechanisms (denoted as dot and add respectively) with our full model, with and without utilizing the intra-frame visual features (denoted as Intra-V). More specifically, we apply three types of co-attention (Dot, Add, CM-CoAtt) along with the baseline model without any co-attention mechanism. For each method of interest, two types of classifiers (AVEL, AVSDN) are deployed.

From the final results listed in Table 4.6 (for supervised settings) and Table 4.7 (for weakly supervised settings), we can see that the exploitation of intra-frame visual features for encoding and observing local image regions would be preferable. The main reason is that dot-based co-attention directly computes the attention scores between visual regions and audio feature. It implies that the attention scores are simply determined based on the semantic relation between audio feature and image features. Our intra-frame visual rep-resentation additionally exploits local image regions for observing local and consecutive semantic information, and thus it would contain more information during the attention process. We note that, intra-frame visual features using AVSDN resulted in slightly de-graded performances. This is probably due to the fact that it is generally more difficult to train additional network models with more parameters like LSTMs for calculating at-tention scores. Nevertheless, from the above results, we can confirm that the exploitation of intra-frame visual features for encoding and observing local image regions would be preferable.

Inter-frame visual representation. As to study the effects of learning inter-frame visual representations for cross-modality co-attention, we consider different methods to model such inter-frame visual features. To model across frames visual representation, we utilize 3D convolutional networks [41] (Conv3D) and LSTM [20] network in our work. We note that, for standard Convolutional Neural Network [36] and the recent I3D Network [7], both based on consecutive video frames and optical flow, are also able to perform such modeling. In this ablation study, for fair comparisons, we only consider Conv3D and LSTM which do not require calculation of optical flow information. As for Conv3D, the

Table 4.8: Ablation studies on the exploiting inter-frame visual information (Inter-V) in different temporal relational modules. Note that fully supervised settings are considered in this table, and the nubmers in bold indicate the best performances.

Temporal mechanism Method

inter-frame visual features can be modeled by Conv3D directly. However, LSTM only receives 1D embedding over times. Thus, we use the same location at every video frame as 1D embedding vector sequence, then the LSTM is applied to model temporal feature until every location across frames are processed.

We note that, the visual features derived from Conv3D and LSTM are able to be uti-lized in current co-attention [40, 24, 34, 2] methods. There are two typical co-attention mechanisms: add and dot co-attention. Therefore, we not only present different methods to encode inter-frame visual features but also test them on the two co-attention methods.

As shown in Table 4.8, our cross-modality co-attention performs favorably against other models with inter-frame visual encoding. In this table, the suffix of temporal mechanism is the co-attention method (e.g., add [40, 24] and dot [34, 2]). It is also worth noting that, our method also performed against different co-attention mechanisms. Another advan-tage of our approach is that, since our inter-frame visual features are calculated by MLPs, whose computation cost is lower than the models using Conv3D and LSTM. Based on the above results and observations, we can also confirm the learning of inter-frame visual features would be preferable in our cross-modality co-attention model, which would result in satisfactory event localization performances.

Chapter 5 Conclusion

In this work, we present Audio-Visual sequence-to-sequence dual network (AVSDN) for video event localization, which can be learned in fully or weakly supervised fashions.

Our network takes both audio and visual local features, together with integrated global representation, to perform event localization in a sequence to sequence manner.

Besides, we presented a deep learning framework for cross-modality co-attention which can be applied on current method and our AVSDN [28], with the goal of addressing the task of audio-visual event localization in fully or weakly supervised learning settings. Our model jointly exploits intra and inter-frame visual representation while observing audio features. Together with a self-attention based mechanism, co-attention across the above feature modalities can be performed. In addition to promising performances on event lo-calization, our model additionally allows instance-level attention, which is able to attend the proper image region (at the instance level) associated with the sound/event of interest.

From our experimental results and ablation studies, the use and design of our proposed framework can be successfully verified.

Bibliography

[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.

[2] R. Arandjelović and A. Zisserman. Objects that sound. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV), 2018.

[3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representa-tions from unlabeled video. In Advances in Neural Information Processing Systems (NIPS), 2016.

[4] Y. Bai, J. Fu, T. Zhao, and T. Mei. Deep attention neural tensor network for visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

[5] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi. Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.

[6] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image net-works for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3042, 2016.

[7] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.

[8] L. Chen, S. Srivastava, Z. Duan, and C. Xu. Deep cross-modal audio-visual genera-tion. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 349–357. ACM, 2017.

[9] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[10] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2009.

[12] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.

[13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 2018.

[14] R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[15] R. Gao and K. Grauman. 2.5d-visual-sound. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[16] J. F. Gemmeke, D. P. W. Ellis, et al. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

[17] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. Lstm:

A search space odyssey. CoRR, abs/1503.04069, 2015.

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[19] S. Hershey, S. Chaudhuri, et al. Cnn architectures for large-scale audio classifica-tion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. 9:1735–80, 12 1997.

[21] D. Hu, X. Li, et al. Temporal multimodal learning in audiovisual speech recognition.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[22] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

[23] D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient large-scale multi-modal classification. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.

[24] J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim, J. Ha, and B. Zhang. Multimodal resid-ual learning for visresid-ual QA. In Advances in Neural Information Processing Systems (NIPS), 2016.

[25] K. Kim, S. Choi, J. Kim, and B. Zhang. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[26] G. Lev, G. Sadeh, B. Klein, and L. Wolf. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 833–850, 2016.

[27] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. M. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding (CVIU).

[28] Y.-B. Lin, Y.-J. Li, and Y.-C. F. Wang. Dual-modality seq2seq network for audio-visual event localization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

[29] D.-K. Nguyen and T. Okatani. Improved fusion of visual and language representa-tions by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[30] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multi-sensory features. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[31] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman.

Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[32] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.

[33] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems (NIPS), 2017.

[34] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[35] Y. Shi, T. Furlanello, S. Zha, and A. Anandkumar. Question type guided attention in visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[36] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), pages 568–576. Curran Associates, Inc., 2014.

[37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[38] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), 2014.

[39] Y. Tian, C. Guan, J. Goodman, M. Moore, and C. Xu. An attempt towards inter-pretable audio-visual video captioning. CoRR, abs/1812.02872, 2018.

[40] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu. Audio-visual event localization in un-constrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[41] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video analysis. CoRR, abs/1412.0767, 2014.

[42] D. Tran, J. Ray, Z. Shou, S. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. CoRR, abs/1708.05038, 2017.

[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017.

[44] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceed-ings of the European Conference on Computer Vision (ECCV), 2016.

[45] O. Wiles, A. S. Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[46] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[47] X. Yang, P. Ramesh, R. Chitta, S. Madhvanath, E. A. Bernal, and J. Luo. Deep multimodal representation learning from temporal data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[48] D. Yu, J. Fu, T. Mei, and Y. Rui. Multi-level attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[49] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[50] B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

[51] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. Talking face generation by adversar-ially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.

[52] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating nat-ural sound for videos in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[53] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and de-tection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2923–2932, 2017.

[54] M. Zolfaghari, K. Singh, and T. Brox. ECO: efficient convolutional network for on-line video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

在文檔中跨模態共注意視聽事件定位 (頁 40-53)