結論 - 基於視覺和聽覺的教學影片內容分析與分類

5.1 結論

本研究提出一個針對教學影片綜合影像與語音分析的方法，萃取出講者的姿態，以及講者的聲音語意等特徵，分析、擷取出教學重點以及其他資訊。

本研究流程可以分為幾步驟 :

一、講者姿態分析，找出講者頭、手區塊，紀錄相關特徵並建構一決策樹，

利用 SVM-Based 的方式分類。

二、講者語音分析，先利用基本的聲音特徵，進行語音情緒分類，再藉由語音情緒來分析講者的教學狀況。

三、整合上述兩點，混合出一學習注意力的模型。

根據實驗結果，可以看到本研究在講者肢體分析上有不錯的結果，語音分析上則是會依據情緒辨識上的優缺，語音情緒辨識程度較佳者，對於整理系統的正確率有一定的提升，然而若語音情緒辨識率不好，可能因此使的系統的可信度下滑。

5.2 實驗貢獻

本研究主要貢獻在於以下幾項 :

一、建構一基於人類注意力的黑板教學影片結構化方法，由於黑板影片畫

攝影機拍攝方法的改良 :

(1) 傳統方法為了偵測講者肢體，因此攝影機皆為架設固定距離，且鏡頭無法拉進拉遠，使得無法講老師講解的重點區塊放大給學習者觀看，這亦會降低學習的效率，位來是否能夠在不影響講者辨別的情況下，提供此應用。

(2) 目前拍攝方法無法提供黑板的全景給使用者，只單獨跟著講者移動，變成可能會有部分重要的區塊無法拍攝進來，未來是否可以在加入多台的攝影機，提供更多元的服務。

參考文獻

[1] Ying Li, Shrikanth Narayanan, C.-C. Jay Kuo, “Content-Based Movie Analysis and Indexing Based on AudioVisual Cues,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO.8 , AUGUST 2004.

[2] C. Krishna Mohan, B.Yegnanarayana , “Classification of sport videos using edge-based features and autoassociative neural network models,” Signal, Image and Video Processing, 4, 1: 61-73.

[3] Cannon, W.B. , “Again the James-Lange theory of emotion: a critical examination and an alternative theory”, Am J. Psychol, 39.106-24,1931.

[4] Cornelius R.R., “A THEORETICAL APPROACHES TO EMOTION”, ISCA Workshop on Speech and Emotion, Vassar College Poughkeepsie, NY USA, 2000.

[5] Picard R.W., “Toward Machine Emotional Intelligence: Analysis of Affective Physiological State”, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol 23,no. 10.October 2002.

[6] B. Schuller, G. Rigoll and M. Lang(2003).“Hidden Markov Model-based Speech Emotion Recognition”, Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, vol. 2, pp. 1-4.

[7] D. Ververidis, C. Kotropoulos and I.Pitas(2004).“Automatic Emotional Speech Classification,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, vol. 1, pp. 593-596.

[8] B. Schuller, G. Rigoll and M. Lang(2004).“Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine – Belief Network Architecture”, Proceedings of IEEE

International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, vol. 1, pp. 577-580.

[9] X.H. Le, G. Quénot and E. Castelli(2004).“Recognizing Emotions for

Audio-Visual Document Indexing," Proceedings of 9^th Symposium on Computers and Communications,Alexandria, Egypt, vol. 2, pp. 580-584.

[10] Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, Te-Won Lee,”Emotion Recognition by Speech Signals”, Institute for Neural Computation University of California, San Diego, USA.

[11] Dimitrios Ververidis, Constantine Kotropoulos ,” Emotional speech recognition:

Resources, features, and methods”,Artificial Intelligence and Information Analysis Laboratory, Department of Informatics, Aristotle University of Thessaloniki,University Campus, Box 451, Thessaloniki 541 24, Greece, accepted 24 April 2006.

[11] Y. Chen and W.J. Heng, “Automatic synchronization of speech tra,nscript and slides in presentation,” in Proc. Int. Symp. Circuits and Systems, vol. 2, pp.

568–571. 2003.

[12] F. Wang, C.W. Ngo, and T.C. Pong, “Synchronization of lecture videos and electronic slides by video text analysis,” in ACM Multimedia, pp. 315–318,

2003.

[13] T. Liu, R. Hejelsvold, and J.R. Kender, “Analysis and enhancement of videos of electronic slide presentations,” in IEEE International Conference on Multimedia and Expo, vol. 1, pp. 77–80, 2002.

[14] C.W. Ngo, F. Wang, and T.C. Pong, “Structuring lecture videos for distance learning applications,” in Proc. IEEE Int. Symp. Multimedia and Software Engineering, pp. 215–222, 2003.

[15] L. He, Z. Liu, and Z. Zhang, “Why take notes use the whiteboard capture

system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 776–779, 2003.

[16] L. He and Z. Zhang, “Real-time whiteboard capture and processing using a video camera for teleconferencing,” in Proc. ICASSP, pp. 1113–1116, 2005.

[17] M. Wienecke, G.A. Fink, and G. Sagerer, “Toward automatic videobased whiteboard reading,” Int. J. Doc. Anal. Recognit., vol. 7, no. 2-3, pp. 188–200, 2005.

[18] Z. Zhang and L. He, “Notetaking with a camera: Whiteboard scanning and image enhancement,” in Proc. ICASSP, vol. 3, pp. 533–536, 2004.

[19] C.C. Chang and C.K. Lin, LIBSVM: a libraryfor support vector machines.

Software availableat http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[20] S. Ammouri, and G.A. Bilodeau, “Face and Hands Detection and Tracking Applied to the Monitoring of Medication Intake,” Canadian Conference on Computer and Robot Vision, pp. 147-154, Canadian, May 2008.

[21] 語音訊號處理,王小川編著,2009 年 2 月

[22] Fukuda S., and Kostov V., ”Extraction emotion from voice”, IEEE International Conference on System, Man, and Cybernetics, 1999.

[23] Theodoros Giannakopoulos, Aggelos Pikrakis and Sergios Theodoridis,” A DIMENSIONAL APPROACH TO EMOTION RECOGNITION OF SPEECH FROM MOVIES,” ICASSP 2009

在文檔中基於視覺和聽覺的教學影片內容分析與分類 (頁 44-49)