5.1 結論
本研究提出一個針對教學影片綜合影像與語音分析的方法,萃取出講者的姿 態,以及講者的聲音語意等特徵,分析、擷取出教學重點以及其他資訊。
本研究流程可以分為幾步驟 :
一、 講者姿態分析,找出講者頭、手區塊,紀錄相關特徵並建構一決策樹,
利用 SVM-Based 的方式分類。
二、 講者語音分析,先利用基本的聲音特徵,進行語音情緒分類,再藉由 語音情緒來分析講者的教學狀況。
三、 整合上述兩點,混合出一學習注意力的模型。
根據實驗結果,可以看到本研究在講者肢體分析上有不錯的結果,語音分析 上則是會依據情緒辨識上的優缺,語音情緒辨識程度較佳者,對於整理系統的正 確率有一定的提升,然而若語音情緒辨識率不好,可能因此使的系統的可信度下 滑。
5.2 實驗貢獻
本研究主要貢獻在於以下幾項 :
一、建構一基於人類注意力的黑板教學影片結構化方法,由於黑板影片畫
35
36
攝影機拍攝方法的改良 :
(1) 傳統方法為了偵測講者肢體,因此攝影機皆為架設固定距離,且鏡 頭無法拉進拉遠,使得無法講老師講解的重點區塊放大給學習者觀 看,這亦會降低學習的效率,位來是否能夠在不影響講者辨別的情 況下,提供此應用。
(2) 目前拍攝方法無法提供黑板的全景給使用者,只單獨跟著講者移 動,變成可能會有部分重要的區塊無法拍攝進來,未來是否可以在 加入多台的攝影機,提供更多元的服務。
37
參考文獻
[1] Ying Li, Shrikanth Narayanan, C.-C. Jay Kuo, “Content-Based Movie Analysis and Indexing Based on AudioVisual Cues,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO.8 , AUGUST 2004.
[2] C. Krishna Mohan, B.Yegnanarayana , “Classification of sport videos using edge-based features and autoassociative neural network models,” Signal, Image and Video Processing, 4, 1: 61-73.
[3] Cannon, W.B. , “Again the James-Lange theory of emotion: a critical examination and an alternative theory”, Am J. Psychol, 39.106-24,1931.
[4] Cornelius R.R., “A THEORETICAL APPROACHES TO EMOTION”, ISCA Workshop on Speech and Emotion, Vassar College Poughkeepsie, NY USA, 2000.
[5] Picard R.W., “Toward Machine Emotional Intelligence: Analysis of Affective Physiological State”, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol 23,no. 10.October 2002.
[6] B. Schuller, G. Rigoll and M. Lang(2003).“Hidden Markov Model-based Speech Emotion Recognition”, Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, vol. 2, pp. 1-4.
[7] D. Ververidis, C. Kotropoulos and I.Pitas(2004).“Automatic Emotional Speech Classification,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, vol. 1, pp. 593-596.
[8] B. Schuller, G. Rigoll and M. Lang(2004).“Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine – Belief Network Architecture”, Proceedings of IEEE
International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, vol. 1, pp. 577-580.
38
[9] X.H. Le, G. Quénot and E. Castelli(2004).“Recognizing Emotions for
Audio-Visual Document Indexing," Proceedings of 9th Symposium on Computers and Communications,Alexandria, Egypt, vol. 2, pp. 580-584.
[10] Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, Te-Won Lee,”Emotion Recognition by Speech Signals”, Institute for Neural Computation University of California, San Diego, USA.
[11] Dimitrios Ververidis, Constantine Kotropoulos ,” Emotional speech recognition:
Resources, features, and methods”,Artificial Intelligence and Information Analysis Laboratory, Department of Informatics, Aristotle University of Thessaloniki,University Campus, Box 451, Thessaloniki 541 24, Greece, accepted 24 April 2006.
[11] Y. Chen and W.J. Heng, “Automatic synchronization of speech tra,nscript and slides in presentation,” in Proc. Int. Symp. Circuits and Systems, vol. 2, pp.
568–571. 2003.
[12] F. Wang, C.W. Ngo, and T.C. Pong, “Synchronization of lecture videos and electronic slides by video text analysis,” in ACM Multimedia, pp. 315–318,
2003.
[13] T. Liu, R. Hejelsvold, and J.R. Kender, “Analysis and enhancement of videos of electronic slide presentations,” in IEEE International Conference on Multimedia and Expo, vol. 1, pp. 77–80, 2002.
[14] C.W. Ngo, F. Wang, and T.C. Pong, “Structuring lecture videos for distance learning applications,” in Proc. IEEE Int. Symp. Multimedia and Software Engineering, pp. 215–222, 2003.
[15] L. He, Z. Liu, and Z. Zhang, “Why take notes use the whiteboard capture
system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 776–779, 2003.
[16] L. He and Z. Zhang, “Real-time whiteboard capture and processing using a video camera for teleconferencing,” in Proc. ICASSP, pp. 1113–1116, 2005.
39
[17] M. Wienecke, G.A. Fink, and G. Sagerer, “Toward automatic videobased whiteboard reading,” Int. J. Doc. Anal. Recognit., vol. 7, no. 2-3, pp. 188–200, 2005.
[18] Z. Zhang and L. He, “Notetaking with a camera: Whiteboard scanning and image enhancement,” in Proc. ICASSP, vol. 3, pp. 533–536, 2004.
[19] C.C. Chang and C.K. Lin, LIBSVM: a libraryfor support vector machines.
Software availableat http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[20] S. Ammouri, and G.A. Bilodeau, “Face and Hands Detection and Tracking Applied to the Monitoring of Medication Intake,” Canadian Conference on Computer and Robot Vision, pp. 147-154, Canadian, May 2008.
[21] 語音訊號處理,王小川 編著,2009 年 2 月
[22] Fukuda S., and Kostov V., ”Extraction emotion from voice”, IEEE International Conference on System, Man, and Cybernetics, 1999.
[23] Theodoros Giannakopoulos, Aggelos Pikrakis and Sergios Theodoridis,” A DIMENSIONAL APPROACH TO EMOTION RECOGNITION OF SPEECH FROM MOVIES,” ICASSP 2009