Conclusion and Future Work - 陪伴機器人之基於情感辨識音樂播放器系統

An emotional speech based music player has been proposed and implemented using and embedded system platform targeting for personal robotics. In order to allow the system to automatically select a song based on the user emotional state, a method to map an input speech utterance into a two dimensional emotional plane of valence and arousal has been developed. Using a referenced database of songs, which arousal and valence values has been manually annotated by several users, the system can successfully automatically find a song that best matches the detected location on the emotional plane. Furthermore, a selected for low complexity implementation and result show that they can be used to detect emotional content in the speech. Neural network architecture was designed for mapping speech to arousal and valence values. The performance was tested using 3 different emotional speech databases. Three off-line tests, the online test and the evaluation survey probed the feasibility of the proposed system. Obtained arousal and valence values were converted to emotional categories in order to compare the performance of the system to other works. Performed test shows that an overall recognition rate of 59.24% is good result compared to that of 73.5% and 49.12% in [39] and [40] respectively. A questionnaire survey further shows that the 80% subjects somewhat or totally agree with the songs selected by proposed cheer-up strategy based on the emotional model.

Results from the present work shows that there are some aspects of the system that can be further improved in order to increase emotional mapping and also implementation for use in a useful system like in a pet robot:

• A more powerful embedded platform could also make possible the use of more powerful algorithms to improve system performance thus improving the user-robot interaction.

• Using other sensors like a video camera can allow the use of image recognition to have other means of emotion recognition and allow better music recommendation even if the user is not speaking.

• A better microphone with better sensibility and noise rejection can be included in order to avoid being very close to the device to speak.

• To improve music recommendation, more songs can be added to the actual music database; furthermore, music emotion recognition technology can be added in order to allow the user to load his personalized music database.

• Adding more emotional related features, use of a different neural network can improve mapping in the emotional plane and robustness in speaker independent mode.

• A better data set that has far more speech utterances and maybe natural language could improve speaker independency recognition. Here manual annotation of arousal and valence values by many speakers could also improve performance since every utterance used for training could have a better emotional representation in the dimensional plane.

References

Conference on pattern recognition, China, November 2009, pp. 1-5.

[3] D. Ververidis and C. Kotropoulos, “Emotional Speech Recognition: Resources, Features, and Methods,” in ELSEVIER Speech Communication, vol. 48, no. 9, pp. 1162-1181, September 2006.

[4] Z. Xiao, Recognition of Emotions in Audio Signals, Ph.D. Dissertation, Ecole Doctorale Informatique et Information pour la Société, Lyon, France, 2008. Available:

http://www.google.com [Accessed Apr. 2011].

[5] M. Han, J. Hsu, K. T. Song and F. Chang “A New Information Fusion Method for Bimodal Robotic Emotion Recognition,” in Proc. Of IEEE International conference on Systems, Man and Cybernetics, Montreal, Quebec, Canada, October 2007, pp. 2656-2661.

[6] K. Jang and O. Kwon, “Speech Emotion Recognition for Affective Human-Robot Interaction,” in Proc. of international conference on Speech and Computer, St. Petersburg, Russia, June 2006, pp. 25-29.

[7] B. Schuller, G. Rigoll, S. Can and H. Feussner, “Emotion Sensitive Speech Control for Human-Robot Interaction in Minimal Invasive Surgery,” in Proc. Of IEEE International Symposium on Robot and Human Interactive Communication, Munich, Germany, August 2008, pp.453-458.

[8] S. Dornbush, K. Fisher, K. McKay, A.Prikhodko and Z. Segall, “ XPOD – A Human Activity and Emotion Aware Mobile Music Player,” in Proc. Of International Conference on Mobile Technology, Appications and Systems, Guangzhou, China, 2009, pp.1-6.

[9] Y. H. Yang, Y. C. Lin, H. T. Cheng and H. Chen, “Mr. Emo: Music Retrieval in the Emotion Plane,” in Proc. Of ACM International Conference on Multimedia, Vancouver,

Canada, 2008, pp1003-1004.

[10] V. A. Petrudhin, “Emotion in Speech: Recognition and Application to Call Centers,” in Proc. Of International conference on Artificial Neural Networks, Edinburgh, England, 1999, pp.7-10.

[11] C. M. Thibeault, O. Sessions, P. H. Goodman and F. C. Harris Jr., “Real-Time Emotional Speech Processing for Neurobotics Applications,” in Proc. Of International Conference on Computer Aplications in Industry and Engineering, Las Vegas, NV, 2010, pp.239-244.

[12] N. Sebe, I. Cohen and T. S. Huang, “Multimodal Emotion Recognition,” Handbook of Pattern Recognition and Computer Vision, World Scientific, 2005, pp 1-23.

[13] (2011, May 5). BeagleBoard-XM [Online]. Available: http://www.beagleboard.org

[14] I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion,” Journal of the Acoustic Society of America, vol. 93(2), pp.1097–1108, February 1993.

[15] P. Oudeyer, “The Production and Recognition of Emotions in Speech: Features and Algorithms,” International Journal in Human-Computer Studies, vol. 59/1-2, pp. 157-183, special issue on Affective Computing, 2003.

[16] A. Iliev, Emotion Recognition Using Glottal and Prosodic Features, Ph.D. Thesis, University of Miami, Florida, 2009.

[17] (2011, September 20). Acoustic Phonetics [Online]. Available:

http://www.kfs.oeaw.ac.at/content/blogsection/26/396/

[18] R. Jang. (2011, June 10). Audio Signal Processing and Recognition [Online]. Available:

http://neural.cs.nthu.edu.tw/jang/books/audioSignalProcessing/

[19] G. Saha, “A New Silence Removal and Endpoint Detection Algorithm for Speech and speaker recognition applications,” in Proc. Of National Conference on Comunications, India, 2005, pp.291-295.

[20] X. Huang, A. Acero and H. Hon, Spoken Language Processing. New Jersey: Prentice

Hall, 2001.

[21] M. Mansoorizadeh and N. M. Charkari, “Speech Emotion Recognition: Comparison of Speech Segmentation Approaches,” in Proc of International conference on Knowledge Technology, Mashhad, Iran, 2007, pp.133-136.

[22] T. Iliou and C. N. Anagnostopoulos, “Classification on Speech Emotion Recognition – a Comparative Study, ” Journal on Advances in Life Sciences , vol. 2, no. 1-2, pp. 18-28, 2010.

[23] J. Sidorova, Speech Emotion Recognition, Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2007.

[24] D. Gharavian, M. Sheikhan and M. Jainipour, “Pitch in Emotional Speech and Emotional Speech Recognition Using Pitch Frequency,” Majlesi Journal of Electrical Engineering, Vol. 4, No. 1, 2010, pp. 18-28.

[25] C. Lee, 2011 Short Course on Digital Speech Processing and Applications, unpublished, June 2011.

[26] D. Morrison, R. Wang and L. C. De Silva, “Spoken Affect Classification Using Neural Networks,” in Proc of IEEE International Conference on Granular Computing , Beiging, China, July 2005, pp. 583- 586.

[27] C. Breazeal and L. Aryananda, “Recognition of Affective Communicative Intent in Robot-Directed Speech, ” Autonomous Robots, vol. 12, pp. 83–104, 2002.

[28] Ch. Kim, K. D. Seo and W. Sung, “A Robust Formant Extraction Algorithm Combining Spectral Peak Picking and Root Polishing, “EURASIP Journal on Applied Signal Processing, vol. 2006, pp. 1-16, 2006.

[29] E. Bozkurt, E. Erzin, Ç. E. Erdem and A. T. Erdem, “Formant Position Based Weighted Spectral Features for Emotion Recognition, ” in Speech Communication, vol. 53, pp.

1186-1197, 2011.

[30] R. Jang. (2011, november 10). Preprocessing Data for Neural Networks [Online].

Available: http://www.tradertech.com/preprocessing_data.asp

[31] Y. Yoshitomi, “Effect of Sensor Fussion for Recognition of Emotional States Using

Voice, Face Image and Thermal Image of Face,” Proc. Of International workshop on robot and human interactive communication, Osaka, Japan, pp. 1186-1197, pp. 178-183, 2000.

[32] R. E. Thayer: “The Biopsychology of Mood and Arousal,” New York: Oxford University Press, 1989.

[33] J. Heaton, Introduction to Neural Networks with JAVA. Missouri: Heaton Research, 2008.

[34] Y. H. Yang, Y. F. Su, Y. Ch. Lin and H. Chen, “Music emotion recognition: the role of individuality,” in Proc. Of the International workshop on Human-centered Multimedia, Augsburg, Bavaria, Germany, 2007, pp.13-22.

[35] (2011, May 10). Andorid developers [Online]. Available:

http://developer.android.com/guide/basics/what-is-android.html

[36] P. Reddy, “Gender Based Emotion Recognition System for Telegu Rural Dialects Using Hidden Markov Models,” Journal of Computing, Vol. 2, No. 6, pp.94-98, 2010.

[37] S. Wu, T. H. Falk and W. Y .Chan, “Automatic Recognition of Speech Emotion Using Long-Term Spectro-Temporal Features,” in Proc. Of the International conference on Digital Signal Processing, Santorini, Greece, 2009, pp.1-6.

[38] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier and B. Weiss, “A Database of German Emotional Speech,” in Proc. Of the International Speech Communication Association, Lisboa, Italy, 2005, pp1517-1520.

[39] B. Yang and M. Lugger, “Emotion Recognition From Speech Signals Using New Harmony Features,” in Signal Processing, vol. 90, No. 5, pp. 1415-1423, 2010.

[40] Y. Huang, G. Zhang, X. Li and F. Da, “Improved Emotion Recognition with Novel Global Utterance Level Features,” in Applied Mathematics & information Sciences, vol. 5, No. 2, pp. 147-153, 2011.

在文檔中陪伴機器人之基於情感辨識音樂播放器系統 (頁 70-75)