機器人情感模型及情感辨識設計

(1)

國立交通大學

電控工程研究所

博士論文

機器人情感模型及情感辨識設計

Design of Robotic Emotion Model and

Human Emotion Recognition

研究生：韓孟儒

指導教授：宋開泰博士

(2)

機器人情感模型及情感辨識設計

Design of Robotic Emotion Model and

Human Emotion Recognition

研究生：韓孟儒 Student: Meng-Ju Han

指導教授：宋開泰博士 Advisor: Dr. Kai-Tai Song

國立交通大學

電控工程研究所

博士論文

A Dissertation

Submitted to Institute of Electrical Control Engineering College of Electrical and Computer Engineering

National Chiao Tung University in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in

Electrical Control Engineering January 2013

Hsinchu, Taiwan, Republic of China

(3)

機器人情感模型及情感辨識設計

學生:韓孟儒

指導教授:宋開泰博士

國立交通大學電控工程研究所

摘要

本論文之主旨在研究機器人情感模型(emotion model)及其互動設計，文中提出一種擬人化之心情變遷(mood transition)設計方法，提高機器人與人類作自主情感互動之能力。為使機器人能產生富有類人情感表達之互動行為，本論文提出一個二維的情感模型，同時考慮機器人之情感(emotion)、心情(mood)與人格特性(personality)等狀態，使產生擬人化情感反應。在本設計中，機器人之人格特性模型建立係參考心理學家所提出之五大性格特質(Big Five factor)來達成，而機器人之心情變遷所造成之影響，則可藉由此五大人格特質參數來決定。

為能經由連續的互動行為來呈現機器人自主情感狀態，本論文亦提出一種可融合基本情緒行為之方法，來建立不同心情狀態下之行為表達方式。根據上述之心理學研究成果，本研究以模糊 Kohonen 群集網路(fuzzy Kohonen clustering networks)之方法，將人格特性、心情與情緒行為三者整合成一情感模型，使之能具體實現於機器人上。與其他研究相比，具有客觀之學理依據，而非憑藉研究人員本身主觀經驗來做假設。

在情感辨識方面，本論文提出結合影像與聲音之雙模情緒辨識以及語音情緒辨識等二種方法，使機器人可辨識使用者之情緒狀態。在雙模情緒辨識之設計中，論文中提出基於支持向量機(support vector machine)之分類特性與機率策略，用以決定二種特徵資料

(4)

之融合權重(fusing weights)。融合權重係根據待測資料與切割平面之距離，以及學習樣本之標準差所決定。而在分類階段，融合權重較高之特徵所辨識之結果，將成為最後系統辨識之結果。此外，在語音情緒辨識之設計中，本論文提出採用聲音訊號進行處理與分類。首先，在預處理時先將語音訊號進行端點偵測(end-point detection)以取得音框所在位置，而後再以統計方式將能量計算成特徵之型態，並以費雪線性辨別分析法(Fisher's linear discriminant analysis)來增強辨識率。

本論文寫作基於 DSP 之影像、語音處理系統驗證所發展的辨識方法，並整合至機器人上展示與人情感互動的功能。為了評估所開發之情感模型，文中並建立一人臉模擬器展示情緒表情之變化。為了解所提方法對於使用者之感受，本研究透過觀察人臉模擬器對使用者之情感表達狀況，以問卷調查方式來作評估。評估結果顯示，受訪者之感受與原設計目標相符。

(5)

Design of Robotic Emotion Model and

Human Emotion Recognition

Student: Meng-Ju Han

Advisor: Dr. Kai-Tai Song

Institute of Electrical Control Engineering

National Chiao Tung University

ABSTRACT

This thesis aims to develop a robotic emotion model and mood transition method for autonomous emotional interaction with human. A two-dimensional (2-D) emotional model is proposed to combine robotic emotion, mood and personality in order to generate emotional behaviors. In this design, the robot personality is programmed by adjusting the big five factors referred from psychology. Using Big Five personality traits, the influence factors of robot mood transition are analyzed.

A method to fuse basic robotic emotional behaviors is proposed in this work in order to manifest robotic emotional states via continuous facial expressions. Through reference psychological results, we developed the relationships of personality vs. mood transition for robotic emotion generation. Based on these relationships, personality, mood transition and emotional behaviors have been integrated into the robotic emotion model. Comparing with existing models, the proposed method has the merit of having a theoretical basis to support the human-robot interaction design.

In order to recognize the user’s emotional state, both bimodal emotion recognition and speech-signal-based emotion recognition methods are studied. In the design of the bimodal

(6)

emotion recognition system, a novel probabilistic strategy has been proposed for a classification design to determine statistically suitable fusing weights for two feature modalities. The fusion weights are selected by the distance between test data and the classification hyperplane and the standard deviation of training samples. In the latter bimodal SVM classification, the recognition result with higher weight is selected.

In the design of the proposed speech-signal-based emotion recognition method, the proposed method uses voice signal processing and classification. Firstly, end-point detection and frame setting are accomplished in the pre-processing stage. Then, the statistical features of the energy contour are computed. Fisher's linear discriminant analysis (FLDA) is used to enhance the recognition rate.

In this thesis, the proposed emotion recognition methods have been implemented on a DSP-based system in order to demonstrate the functionality of human-robot interaction. We have realized an artificial face simulator to show the effectiveness of the proposed methods. Questionnaire surveys have been carried out to evaluate the effectiveness of the proposed emotional model by observing robotic responses to user’s emotional expressions. Evaluation results show that the feelings of the testers coincide with the original design.

(7)

誌謝

一路走來，由衷感謝我的指導教授宋開泰博士，感謝他多年來在我多次感到氣餒之時不斷給予鼓勵，使我終能順利到站，展開人生新的旅途；另一方面，在專業及論文寫作上的指導，不厭其煩的給我建議及修正方向，使我受益良多，也讓本論文得以順利完成。亦感謝論文口試委員－傅立成教授、王文俊教授、蔡清池教授、胡竹生教授、莊仁輝教授，對於本論文的建議與指引，強化本論文的嚴整性與可讀性。感謝實驗室的夥伴嘉豪、晉懷、濬尉及仕傑在實務驗證時所提供的協助，也感謝學長戴任詔博士、蔡奇謚博士和博士班學弟妹巧敏、信毅對本論文的建議與討論，同時亦感謝過往相互鼓勵的碩士班學弟妹崇民、松歭、俊瑋、富聖、振暘、煥坤、舒涵、科棟、奕彣、哲豪、仕晟、建宏、上畯、章宏等在生活上所帶來的樂趣。另外，特別感謝我的父母，由於他們辛苦栽培，在生活上給予我細心地關愛與照料，使我才得以順利完成此論文；也感謝我妻子在我身邊的全力支持，在我最無助及失意的時候給予背後安定的力量。願能貢獻一己棉薄之力，成就有益家庭、社會之能事。

(8)

List of Figures

Fig. 1-1: Structure of the thesis. ... 9

Fig. 2-1: Block diagram of the autonomous emotional interaction system (AEIS) for an artificial face. ... 10

Fig. 2-2: Two-dimensional scaling for facial expressions based on pleasure-displeasure and arousal-sleepiness ratings. ... 17

Fig. 2-3: The fuzzy-neuro network for fusion weight generation. ... 19

Fig. 2-4: Fusion weights distribution for seven facial expressions. ... 25

Fig. 2-5: Illustration of the proposed robotic emotion model. ... 29

Fig. 3-1: The experimental setup. ... 31

Fig. 3-2: Block diagram of the robotic audio-visual emotion recognition system. ... 31

Fig. 3-3: The functional block diagram of facial image processing. ... 33

Fig. 3-4: Face detection procedure. (a) Original image, (b) Color segmentation and closing operation, (c) Candidate face areas, (d) Final result obtained by attentional cascade. .. ... 33

Fig. 3-5: Definition of the facial feature points and feature values. ... 34

Fig. 3-6: Test results of feature extraction of eyes and eyebrows. (a) Binary operation using IOD, (b) Edge detection, (c) AND operation. (d) Extracted feature points. ... 35

Fig. 3-7: Feature extraction of lips. ... 36

Fig. 3-8: The functional block diagram of facial image processing. ... 37

Fig. 3-9: Representing recognition reliability using the distance between test sample and hyperplane. ... 38

Fig. 3-10: Representing recognition reliability using the standard deviation of training samples. (a) Smaller standard deviation, (b) Larger standard deviation. ... 39

(12)

Fig. 3-12: Block diagram of the proposed speech-signal-based emotion recognition system. 42

Fig. 3-13: Energy of a speech signal. ... 44

Fig. 3-14: Zero-crossing rate of a speech signal. ... 45

Fig. 3-15: Example of real human speech detection. ... 46

Fig. 3-16: An example of end-point detection. ... 46

Fig. 3-17: Frame-signal separation using a Hamming window. ... 47

Fig. 3-18: Procedure for speech signal extraction in each frame. (a) A frame of original speech signal. (b) Hamming window. (c) Result of original speech signal multiplied by Hamming window. ... 48

Fig. 3-19: The original time response of the speech signal and the results for feature extraction of the fundamental frequency. ... 49

Fig. 3-20: Structure of the hierarchical SVM classifier. ... 52

Fig. 3-21: The TMS320C6416 DSK codec interface. ... 53

Fig. 3-22: Interaction scenario for a user and the entertainment robot... 54

Fig. 3-23: Control architecture of the entertainment robot. ... 55

Fig. 4-1: Architecture of the self-built anthropomorphic robotic head... 57

Fig. 4-2: Examples of facial expressions of the robotic head. ... 57

Fig. 4-3: Interaction scenario of a user and robotic head. ... 58

Fig. 4-4: Experiment setup: interaction scenario with an artificial face. ... 59

Fig. 4-5: Robotic mood transition of RobotA... 63

Fig. 4-6: Robotic mood transition of RobotB. ... 63

Fig. 4-7: Weights variation for RobotA (active trait). ... 64

Fig. 4-8: Weights variation for RobotB (passive trait). ... 65

Fig. 4-9: Questionary result of psychological impact. ... 66

Fig. 4-10: Questionary result of Natural vs. Artificial. ... 67

(13)

Fig. 4-12: Examples of database... 70 Fig. 4-13: Experimental results of recognition rate for any two emotional categories. ... 74 Fig. 4-14: Block diagram of the emotional interaction system. ... 76 Fig. 4-15: Interactive response of the robot as the user says, “I am angry!” (a) The robot puts down its hands to portray fear. (b) The robot continues to put down its hands to the lowest position. (c) The robot raises its hands back to the original position. ... 76 Fig. 4-16: Interactive response of the robot, when the user speaks in a surprised tone. (a) The robot shakes its head to the right. (b) The robot shakes its head to the left. (c) The robot puts its head back to the original position. ... 77 Fig. 4-17: Examples of user emotional state recognition. ... 79

(14)

List of Tables

Table 2-1: Big five model of personality. ... 15

Table 2-2: Rule table for interactive emotional behavior generation. ... 20

Table 2-3: Basic facial expressions with various weights executed in the simulator. ... 27

Table 2-4: Linear combined facial expressions with various weights on the simulator. ... 28

Table 3-1: The description of facial feature values. ... 35

Table 3-2: The description of speech feature values. ... 38

Table 3-3: The description of speech feature values. ... 49

Table 4-1: List of the conversation dialogue and corresponding subject’s facial expressions. 59 Table 4-2: Regulated user emotion intensity of conversion sentence 1 and 2. ... 60

Table 4-3: Definition of personality scales using Big Five factors. ... 60

Table 4-4: Facial expressions for the RobotA. ... 62

Table 4-5: Facial expressions for the RobotB. ... 62

Table 4-6: Estimation of personality parameters by questionnaire survey. ... 68

Table 4-7: Standard deviation of questionnaire results. ... 68

Table 4-8: Experimental results using speech features. ... 70

Table 4-9: Experimental results using image features. ... 71

Table 4-10: Experimental results using information fusion. ... 71

Table 4-11: On-line experimental results using only image features. ... 72

Table 4-12: On-line experimental results using information fusion. ... 72

Table 4-13: Meaning of sentence content for five emotional categories. ... 73

Table 4-14: Experimental results of recognizing five emotional categories. ... 75

(15)

Chapter 1 Introduction

1.1 Motivation

In recent years, many useful domestic and service robots, including museum guide robots, personal companion robots and entertainment robots have been developed for various applications [1]. It has been forecasted that edutainment and personal robots will be very attractive products in the near future [2-3]. One of the most interesting features of intelligent service robots is their human-centered functions. Actually, intelligent interaction with a user is a key feature for service robots in healthcare, companion and entertainment applications. For a robot to engage in friendly interaction with human, the function of emotional expression will play an important role in many real-life application scenarios. However, it is known that to make a robot behave human-like emotional expressions is still a challenge in robot design.

On the other hand, the ability to recognize a user’s emotion is also important in human-robot interaction applications. The emotion communicator, Kotohana, developed by NEC [4] is a successful example of vocal emotion recognition. Kotohana is a flower-shaped terminal equipped with Light Emitting Diodes (LEDs). It can recognize a visitor’s emotional speech and respond with a color display to convey the interaction. The terminal responds in a lively manner to the detected emotional state, via color variation in the flower. For human beings, facial expression and voice reveal a person’s emotion most. They also provide important communicative cues during social interaction. A robotic emotion recognition system will enhance the interaction between human and robot in a natural manner. Base on the above discussion, it is observed that a proper emotion model is desirable in robotic

(16)

emotional behavior generation. This also motivates us to investigate mood transition algorithms based on physiological findings for humans.

1.2 Literature Survey

Methodologies for developing emotional robotic behaviors have drawn much attention in robotic research community [5]. Breazeal et al. [6] presented the sociable robot Leonardo, which has an expressive face capable of near human-level expression and possesses a binocular vision system to recognize human facial features. The humanoid robot Nexi [7] demonstrated a wide range of facial expressions to communicate with people. Wu et al. [8] explored the process of self-guided learning of realistic facial expression by a robotic head. Mavridis et al. [9-10] developed the Arabic-language conversational android robot; it can become an exciting educational or persuasive robot in practical use. Hashimoto et al. [11-12] developed a reception robot SAYA to realize realistic speaking and natural interactive behaviors with six typical facial expressions. In [13], a singer robot EveR-2 is able to acquire visual and speech information, while expressing facial emotion during performing robotic singing. For some application scenarios such as persuasive robotics [14] or longer-term human-robot interaction [15], interactive facial expression has been demonstrated to be very useful.

There have been increasing interests in the study of robotic emotion generation schemes in order to give a robot more human-like behaviors. Reported approaches to emotional robot design often adopted results from psychology in order to design robot behaviors to mimic human beings. Miwa et al. proposed a mental model to build the robotic emotional state from external sensory inputs [16-17]. Duhaut [18] presented a computational model which includes emotion and personality in the robotic behaviors. The TAME (Traits, Attributes, Moods, and Emotions) framework proposed by Moshkina et al. gives a model of time-varying affective response for humanoid robots [19]. Itoh et al. [20] proposed an emotion generation model

(17)

which can assess the robot’s individuality and internal state through mood transitions. Their experiments showed that the robot could provide more human-like communications to users based on the emotional model. Banik et al. [21] demonstrated an emotion-based task sharing approach to a cooperative multi-agent robotic system. Their approach can give a robot a kind of personality through accumulation of past emotional experience. Park et al. [22] developed a hybrid emotion generation architecture. They proposed a robot personality model based on human personality factors to generate robotic interactions. Kim et al. [23] utilized the probability-based computational algorithm to develop the cognitive appraisal theory for designing artificial emotion generation systems. Their method was applied to a sample of interactive tasks and led to a more positive human–robot interaction experience. In order to allow a robot to express complex emotion, Lee et al. [24] proposed a general behavior generation procedure for emotional robots. It features behavior combination functions to express complex and gradational emotions. In [25], a design of autonomous robotic facial expression generation is presented.

Previous related works show abundant tools for designing emotional robots. It is observed, however, that a proper mood state transition plays a key role in robotic emotional behavior generation. Robotic mood transition from current to next mood state directly influences the interaction behavior of robot and also a user’s feeling to the robot. Most existing models treat mood transition by simple and intuitive representations. These representations lack a theoretical basis to support the assumptions for their mood state transition design. This motivated us to investigate a human-like mood transition model for a robot by adopting well-studied mood state conversion criteria from psychological findings. The transition among mood states would become smoother and thus might enable a robot to respond with more natural emotional expressions. We further combine personality into the robotic mood model to represent the trait of individual robot.

(18)

need to be designed. The relationship between mood states and responding behavior of a robot should not be a fixed, one-to-one relation. A continuous robotic facial expression would be more interesting and natural to manifest the mood state transition. Instead of being arbitrary defined, the relationships between robot emotional behaviors (e.g. in a form of facial expression) and mood state can be modeled from psychological analysis and utilized to build the interaction patterns in the design of expressive behaviors.

To respond to a user sensationally, a robot needs first to understand the user’s emotion state. There are many approaches to building-up a robotic emotion recognition system. The majority of studies focus on image-based facial expression recognition [26-27]. Approaches using speech signal processing have also been investigated for sociable robotics [28-29]. Recently, there has been an increasing interest in audio-visual biometrics [30]. The combination of audio and visual information provides more reliable estimate of emotional states. The complementary relationship of these two modalities makes a recognition decision more accurate than using only a single modality. De Silva et al. [31] proposed to process audio and visual data separately. They have shown that some emotional states are visual dominant and some are audio dominant. They exploited this observation to recognize emotion efficiently by assigning a weight matrix to each emotion state. In [32], De Silva combined the audio and visual features using a rule-based technique to obtain improved recognition results. Rather simple rules were used in his design. For example, a rule is such that if a sample has been classified as certain emotion by both audio and visual processing methods, then the final result is that emotion. If samples have been classified differently by audio and visual analyses, the dominant mode is used as the emotion decision. Negative emotional expressions, such as anger and sadness, were assigned to be audio dominant, while joy and surprise were assigned to be visual dominant. Go et al. [33] combined audio and visual features directly to recognize different emotions using a neural network classifier. However, they did not give comparative experimental results between using bimodal and single modality. Wang et al. [34] proposed to

(19)

use cascade audio and visual feature data to classify variant emotions. They built one-against-all (OAA) linear discriminant analysis (LDA) classifiers for each emotion state and computed the probability of each emotion type. They set two rules in the decision module with several multi-class classifiers to determine the most possible emotion.

It is clear that audio and visual information are related to each other. In many situations, they offer complementary effect for recognizing emotion states. However, current related works do not deal with the robustness of emotion classification of such bimodal systems. Existing approaches to combining audio-visual information employ some straightforward and simple rules. The reliability of individual modality is not taken into consideration in the decision stage. One solution to this problem is that the classifier output is a calibrated posterior probability P(class|input) to perform post-processing. Platt [35] proposed a probabilistic support vector machine (SVM) to produce a calibrated posterior probability. The method trained parameters of a sigmoid function to map SVM outputs into probabilities. Although this method is valid to estimate the posterior probability, a sigmoid function cannot represent all the modals of SVM outputs. In this study, we develop a new method for reliable emotion recognition utilizing audio-visual information. We emphasize the decision mechanism of the recognition procedure when fusing visual and audio information. By setting proper weights to each modality based on their recognition reliability, a more accurate recognition decision can be obtained.

In the design of emotion recognition systems that use speech signals, most methods employ vocal features, including the statistics of fundamental frequency, energy contour, duration of silence and voice quality [36]. In order to improve the recognition rate when more than two emotional categories are to be classified, Nwe et al. [37] used short time log frequency power coefficients (LFPC) to represent speech signals and a discrete hidden Markov model (HMM) as the classifier. Based on the assumption that the pitch contour has a Gaussian distribution, Hyun et al. [38] proposed a Bayesian classifier for emotion recognition

(20)

in speech information. They reported that the zero value of a pitch contour causes errors in the Gaussian distribution and proposed a non-zero-pitch method for speech feature extraction. Pao and Chen [39] used 16-bit linear predictive coding (LPC) and twenty Mel-frequency cepstral coefficients (MFCC) to identify the emotional state of a speaker. Five emotional categories were classified using the minimum-distance method and the nearest mean classifier. Neiberg et al. [40] modeled the pitch feature by using standard MFCC and MFCC-low, which is calculated between 20 and 300 Hz. Their experiments showed that MFCC-low outperformed the pitch features.

You et al. [41] indicated that the effectiveness of principal component analysis (PCA) and linear discriminant analysis (LDA) is limited by their underlying assumption that the data is in a linear subspace. For nonlinear structures, these methods fail to detect the real number of degrees of freedom of the data, so they proposed the method of Lipschitz embedding [42]. The method is not limited by an underlying assumption that the data belong to a linear subspace, so it can analyze the speech signal in more practical situations. Schuller et al. [29] considered an initially large set of more than 200 features; they ranked the statistical features according to LDA results and selected important features by ranking statistical features. Chuang and Wu [43] showed that the contours of the fundamental frequency and energy are not smooth. In order to remove discontinuities in the contour, they used the Legendre polynomial technique to smooth the contours of these features. Their feature extraction procedures firstly estimated the fundamental frequency, energy, formant 1 (F1) and zero-crossing rate. From these four features, the feature values are transformed to 33 statistical features. PCA was then used to select 14 principal components from these 33 statistical features, for the analysis of emotional speech. Busso et al. [44] indicated that gross fundamental frequency contour statistics, such as mean, maximum, minimum and range, are more emotionally prominent than features that describe the shape of the fundamental frequency. Using psychoacoustic harmony perception from music theory, Yang et al. [45]

(21)

proposed a new set of harmony features for speech emotion recognition. They reported improved recognition by the use of harmony parameters and state of the art features.

For robotics applications, Li et al. [46] developed a prototype chatting robot, which can communicate with a user in a speech dialogue. The recognition of the speech emotion of a specific person was successful for two emotional categories. Kim et al. [47] focused on speech emotion recognition for a thinking robot. They proposed a speaker-independent feature, namely the ratio of a spectral flatness measure to a spectral center, to solve the problem of diverse interactive users. Similarly, Park et al. [48] also studied the issue of service robots interacting with diverse users who are in various emotional states. Acoustically similar characteristics between emotions and variable speaker characteristics, caused by different users’ style of speech, may degrade the accuracy of speech emotion recognition. They proposed feature vector classification for speech emotion recognition, to improve performance in service robots.

For practical application, several important problems exist. Firstly, a robust speech signal acquisition system must be built on the front end of the design. It is also required that the robot is equipped with a stand-alone system for realistic human-robot interaction. One of the greatest challenges in emotion recognition for robotic applications is the performance required for nature and daily life environments.

1.3 Research Objectives and Contributions

The objective of this thesis is to develop a robot emotion model in order to interact with people emotionally. A two-dimensional (2-D) emotional model is proposed to represent robot emotion, mood transition and personality in order to generate human-like emotional expressions. In this design, the robot personality is programmed by adjusting the factors of the Five Factors model proposed by psychologists. From Big Five personality traits, the influence factors of robot mood transition are determined.

(22)

A method to fuse on basic robotic emotional behaviors is proposed in order to manifest robotic emotional states via continuous facial expressions. An artificial face on a screen is an effective way to evaluate a robot with a human-like appearance. An artificial face simulator has been implemented to show the effectiveness of the proposed methods. Questionnaire surveys have been carried out to evaluate the effectiveness of the proposed method by observing robotic responses to user’s emotional expressions. Preliminary experimental results on a robotic head show that the proposed mood state transition scheme appropriately responds to a user’s emotional changes in a continuous manner.

The second part of this thesis aims to develop suitable emotion recognition methods for human-robot interaction. A bimodal emotion recognition method was proposed in this thesis. In the design of the bimodal emotion recognition system, a probabilistic strategy has been studied for a support vector machine (SVM)-based classification design to assign statistically selected fusing weights to two feature modalities. The fusion weights are determined by the distance between test data and the classification hyperplane and the standard deviation of training samples. In the latter bimodal SVM classification, the recognition result with higher weight is selected.

In the design of the speech-signal-based emotion recognition method, speech signals are used to recognize several basic human emotional states. The proposed method uses voice signal processing and classification. In order to determine the effectiveness of emotional human-robot interaction, an embedded system was constructed and integrated with a self-built entertainment robot.

1.4 Organization of the Thesis

Figure 1-1 shows the organization of this thesis. In Chapter 2, a novel robotic emotion generation system is developed based-on mood transition model. A robotic mood state

(23)

generation algorithm is proposed using a two-dimensional emotional model. An interactive emotional behaviors generation is then proposed to generate an unlimited number of emotional expressions by fusing seven basic facial expressions. In Chapter 3, several human emotion recognition methods are developed to provide user’s emotional state. Here bimodal information fusion algorithm and speech-signal-based emotion recognition method are proposed for human-robot interaction. Simulation and experimental results of the proposed robotic emotion generation system and the proposed human emotion recognition methods are reported and discussed in Chapter 4. Chapter 5 concludes the contributions of this work and provides the recommendations for future research.

Human Emotion Recognition Chapter 3 Bimodal Information Fusion Algorithm Speech-signal-based Emotion Recognition Robotic Emotional State

Modeling Chapter 2 Robotic Mood Transition Model Emotional Behaviors Generation

Artificial Face and Robotic Head

Chapter 4

User’s Emotional State

(24)

Chapter 2 Robotic Emotion Model and Emotional State

Generation

Figure 2-1 shows the block diagram of the proposed autonomous emotional interaction system (AEIS). Taking a robotic facial expression as the emotion behavior, the robotic interaction is expected not only to react to user’s emotional state, but also to reflect the mood state of the robot itself. We attempt to integrate three modules to construct the AEIS, namely, user emotional state recognizer, robotic mood state generator and emotional behavior decision maker. An artificial face is employed to demonstrate the effectiveness of the design. A camera is provided to capture the user’s face in front of the robot. The acquired images are sent to the image processing stage for emotional state recognition [49]. The user emotional state recognizer is responsible for obtaining user’s emotional state and its intensity. In this design, user’s emotional state at instant k ( n

k

UE ) is recognized and represented as a vector of four

β α P P , ) , ( k k k RM = α β k k β α Δ Δ , n k UE 1 − k RM

(25)

emotional intensities: neutral ( n k N

ue , ), happy (ueHn,k), angry (uenA,k) and sad (ueSn,k). Several existent emotional intensity estimation methods [50-53] provide effective tools to recognize the intensity of human’s emotional state. Their results can be applied and combined into the AEIS.

In this work, an image-based emotional intensity recognition module (see 4.5) has been designed and implemented for current design of AEIS. The recognized emotional intensity consists of basic emotional categories at each sampling instant and is represented by a value between 0 and 1. These intensities are sent to the robotic mood state generator. Moreover, other emotion recognition modalities and methods (e.g. emotional speech recognition) can also be input to AEIS, only the recognized emotional states contain intensity values between 0 and 1.

In the robotic mood state generator, the recognized user’s emotional intensities are transformed into interactive robotic mood variables represented by (Δα_k, Δβ_k) (see 2.1.1 for detailed description). These two variables represent the way that user’s emotional state influences the robotic mood state transition. Furthermore, the robotic emotional behavior depends not only on user’s emotional state, but also on robot personality and previous mood state. Therefore the proposed method takes into account the interactive robotic mood variables (Δα_k, Δβ_k), previous robotic mood state (RM_k-1) and robot personality parameters (P_α, P_β) to compute current robotic mood state (RM_k) (see 2.1.4). Note that the previous robotic mood state (RM_k-1) is temporary stored in a buffer. In this work, the current robotic mood state is represented as a point in the two-dimensional (2D) emotional plane. Furthermore, robotic personality parameters are created to describe the distinct human-like personality of a robot. Based on the current robotic mood state, the emotional behavior decision unit autonomously generates suitable robot behavior in response to the user’s emotion state.

(26)

intensities, a set of fusion weights (FW_i, i=0~6) corresponding to each basic emotional behavior are generated by using a fuzzy Kohonen clustering network (FKCN) [54] (see 2.2). Similar to human beings, the facial expression of a robotic face is very complex and is difficult to be classified by limited number of categoryes. In order to demonstrate interaction behaviors similar to that of humans, FKCN is adopted to generate an unlimited number of emotional expressions by fusing seven basic facial expressions. Outputs of FKCN are sent to the artificial face simulator to generate the interactive behaviors (facial expressions in this work). An artificial face has been designed exploiting the method in [55] to demonstrate the facial expressions generated in human-robot interaction. Seven basic facial expressions are simulated, including neutral, happiness, surprise, fear, sadness, disgust and anger. The facial expressions are depicted by moving control points determined from Ekman’s model [56]. In the practical interaction scenario, each expression can be generated with different proportions of seven basic facial expressions. The actual facial expression of the robot is generated by summation of each behavior output multiplied by its corresponding fusion weight. Therefore, more subtle emotional expressions can be generated as desired. Detailed design of the proposed robotic mood transition model, emotional behavior generation and image-based emotional state recognition will be described in the following sections.

2.1 Robotic Mood Model and Mood Transition

Emotion is a complex psychological experience of an individual’s state of mind as interacting with people or environmental influences. For humans, emotion involves “physiological arousal, expressive behaviors, and conscious experience” [57]. Emotional interaction behavior is associated with mood, temperament, personality, disposition, and motivation. In this study, the emotion for robotic behavior is simplified to association with mood and personality. We apply the concept that emotional behavior is controlled by current emotional state and mood, while the mood is influenced by personality. In this thesis, a novel

(27)

robotic mood state transition method is proposed for a given human-like personality. Furthermore, the corresponding interaction behavior will be generated autonomously for a determined mood state.

2.1.1 Robotic Mood Model

A simple way to develop robotic emotional behaviors that can interact with people is to allow a robot to respond emotional behaviors by mimicking humans. In human-robot emotional interaction, users’ emotional expressions can be treated as trigger inputs to drive the robotic mood transition. Furthermore, transition of robotic mood depends not only on user’s emotional states, but also on the robot mood and personality of itself. For a robot to interact with several individuals or a group of people, users’ current (at instant k) emotional intensities ( n

k

UE ) are sampled and transformed into interactive mood variables Δα_k and Δβ_k to represent how user’s emotional state influences the variation of robotic mood state transition.

From the experience of emotional interaction of human beings, a user’s neutral intensity, for instance, usually affects the arousal and sleepiness mood variation directly. Thus, the robotic mood state tends to arousal while the user’s neutral intensity is low. Similarly, the user’s happiness, anger and sadness intensities affect the pleasure-displeasure axes. Thus, user’s happy intensity will lead robotic mood into pleasure. On the other hand, the robotic mood state behaves more displeasure while user’s angry and sad intensities are high. Based on the above observations, a straightforward case is designed for the interactive robotic mood variables (Δα_k, Δβ_k), which represent the reaction from current users’ emotional intensities on the pleasure-arousal plane, such that:

∑

₌ − + = Δ Ns n n k S n k A n k H s k _N ue ue ue 1[ , ( , , )/2] 1 α (2.1)

∑

₌ ⋅ − = Δ Ns n n k N s k _N ue 12 (0.5 , ) 1 β (2.2)

(28)

⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ = n user for intensity sadness n user for intensity anger n user for intensity happiness n user for intensity neutral th th th th , , , , k k k k ue ue ue ue UE n k S n k A n k H n k N n k , (2.3)

where N_s denotes the number of users and n k

UE represents four kinds of the nth user’s emotional intensities. By using (2.1)-(2.3), the effect on robotic mood from multiple users’ emotional inputs is represented. However, in this work, only one user is considered for better concentrating on the illustration of the proposed model, i.e. N_s=1 in the following discussion. It is worth to extend the number of users in the next stage of this study, such that a scenario like the Massachusetts Institute of Technology mood meter [58] can be investigated. Furthermore, the mapping between facial expressions of interacting human and robotic internal state may be modeled in a more sophisticated way. For example, Δα_k can be designed

as i k H i k S i k A ue ue ue , , )/2 ,

( + − such that alternative (opposite) responses to a user can be obtained.

2.1.2 Robot Personality

McCrae et al. [59] proposed Big Five factors (Five Factor model) to describe the traits of human personality. Big Five model is an empirically based result, not a theory of personality. The Big Five factors were created through a statistical procedure, which was used to analyze how ratings of various personality traits are correlated for general humans. Table 2-1 lists the Big Five factors and their descriptions [60]. Besides, Mehrabian [61] utilized the Big Five factors to represent the pleasure-arousability-dominance (PAD) temperament model. Through linear regression analysis, the scale of each PAD value is estimated by using the Big Five factors [62]. These results are summarized as three equations of temperament, which includes pleasure, arousability and dominance.

In this work, we adopted Big Five model to represent the robot personality and determine the mood state transition on a two-dimensional pleasure-arousal plane. Hence only two equations are utilized to represent the relationship between robot personality and

(29)

Table 2-1: Big five model of personality.

Factor Description Openness Open mindedness, interest in culture.

Conscientiousness Organized, persistent in achieving goals.

Extraversion Preference for and behavior in social situations. Agreeableness Interactions with others.

Neuroticism Tendency to experience negative thoughts.

pleasure-arousal plane. The reason that we utilize this two-dimensional pleasure-arousal plane rather than the three-dimensional PAD model is based on the Russell’s study. Russell and Pratt [63] indicated that pleasure and arousal each account for large proportions of variance in the meaning of affect terms, each dimension beyond these two accounted for only a tiny proportion. More importantly, these secondary dimensions became more and more clearly interpretable as cognitive rather than emotional in nature. The secondary dimensions thus appear to be aspects of the cognitive appraisal system that has been suggested for emotions. Here elements of the Big Five factors are assigned based on a reasonable realization of Table 2-1. Referring to [61], the robot personality parameters (P_α, P_β) are adopted such that:

N A E P_α =0.21 +0.59 +0.19 (2.4) N A O Pβ =0.15 +0.3 −0.57 , (2.5)

where O, E, A and N represent the Big Five factors of openness, extraversion, agreeableness and neuroticism respectively. Therefore the robot personality parameters (Pα, Pβ) are given as the robot personality is known, i.e. O, E, A and N are determined constants. Later we will show that (Pα, Pβ) works as the mood transition weightings on pleasure (α axis) and arousal (β axis) plane.

Note that the conscientiousness of Big Five factors was not used in this design, because this factor only influences the dominance axis of three-dimensional PAD model. In this study, the pleasure-arousal plane of two-dimensional emotional model was applied, so only four out

(30)

of five parameters are used to translate the mood transition weighting from the Big Five factors.

2.1.3 Facial Expressions in Two-Dimensional Mood Space

The relationship between mood states and emotional behaviors has been studied by psychologists. Russell and Bullock [64] proposed a two-dimensional scaling on the pleasure-displeasure and arousal-sleepiness axes to model the relationships between the facial expressions and mood state. In this work, the results from [64] are employed to model the relationship between mood state and output emotional behavior. Figure 2-2 illustrates a two-dimensional scaling result for general adult's facial expressions based on pleasure-displeasure and arousal-sleepiness ratings. The scaling result was analyzed by the Guttman-Lingoes smallest space analysis procedure [65]. This two-dimensional scaling procedure provides a geometric representation (stress and orientation) of the relations among the facial expressions by placing them in a space (Euclidean space is used here) of specified dimensionality. Greater similarity between two facial expressions is represented by their closeness in the space. Hence the coordinate in this space can be used to represent the characteristic of each facial expression. As shown in Fig. 2-2, axis α and β represent the amount of pleasure and arousal respectively. Eleven facial expressions are analyzed and located on the plane. The location of each facial expression is represented by a square along with its coordinates. The coordinates of each facial expression is obtained by measuring the location in the figure (interested readers are referred to [64]). The relationship between robotic mood and output behavior, facial expression in this case, is determined.

2.1.4 Robotic Mood State Generation

As mentioned in 2.1.1, both user’s current emotional intensity and robot personality affect the robotic mood transition. The way that robot personality affects the mood transition

(31)

Fig. 2-2: Two-dimensional scaling for facial expressions based on pleasure-displeasure and arousal-sleepiness ratings.

is described by robot personality parameters (Pα, Pβ). As given in 2.1.2, these two parameters act as weighting factors on α and β axis respectively. When P_α and P_β vary, the speed of mood transition in the direction of α and β axes is affected. On the other hand, the interactive mood variables (Δα_k, Δβ_k) give the influence of user’s emotional intensity on the variation of robotic mood state transition. To reveal the relationship between robot personality and mood transition, we suggest to multiply robot personality parameters (P_α, P_β) with interactive mood variables (Δα_k, Δβ_k). This indicates the influence of robotic mood transition from current user’s emotional intensity as well as robot personality.

Furthermore, the manifested emotional state is determined not only by current robotic emotional variable but also by previous robotic emotional states. The manifested robotic mood state at sample instant k (RM_k) is calculated such that:

) , ( ) , ( k k k1 k k k RM P P RM ≡ α β = − + α⋅Δα β⋅Δβ _,_(2.6)

where ₍_α_k_,_β_k₎_∈_[₋₁_, ₁_] represents the coordinates of robotic mood state at sample instant k on pleasure-arousal plane. By using (2.6), the current robotic mood state is determined and

(32)

located on emotional plane. Moreover, the mood transition is influenced by personality, which is reflected by the Big Five factors. After obtaining the manifested robotic mood state (RM_k), the coordinate of (α_k, β_k) will be mapped onto pleasure-arousal plane, and a suitable corresponding facial expression can be determined, as shown in Fig. 2-2.

2.2 Emotional Behavior Generation

After the robotic mood state is determined by using (2.6), a suitable emotional behavior is expected to respond to the user. In this work, we propose a design based on fuzzy Kohonen clustering network (FKCN) to generate smooth variation of interaction behaviors (facial expressions) as mood state transits gradually.

In this approach, pattern recognition techniques were adopted to generate interactive robotic behaviors [25, 54]. By adopting FKCN, robotic mood state, obtained from (2.6), is mapped to fusion weights of basic robotic emotional behaviors. The output will be a linear combination of weighted basic behaviors. In the current design, the basic facial expression behaviors are neutral, happiness, surprise, fear, sadness, disgust and anger, as shown in Fig. 2-1. FKCN is employed to determine the fusion weight of each basic emotional behavior based on the current robotic mood. Figure 2-3 illustrates the structure of the fuzzy-neuro network for fusion weight generation. In the input layer of the network, the robotic mood state (α_k, β_k) is regarded as inputs of FKCN. In the distance layer, the distance between input pattern and each prototype pattern is calculated such that:

(

i j

) (

T i j

)

j i

ij X P X P X P

d = − 2= − − , (2.7)

where X_i denotes the input pattern and P_j denotes the jth prototype pattern (see 2.3.2). In this layer, the degree of difference between the current robotic mood state and the prototype pattern is calculated. If the robotic mood state is not similar to the built-in prototype patterns, then the distance will reflect the dissimilarity. The membership layer is provided to map the distance d_ij to membership values u_ij, and it calculates the similarity degree between the input

(33)

Robotic emotional sta te estimator ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ 6 1 0 ... FW FW FW ⎥⎦ ⎤ ⎢⎣ ⎡ = k k i X β α

Fig. 2-3: The fuzzy-neuro network for fusion weight generation.

pattern and the prototype patterns. If an input pattern does not match any prototype pattern, then the similarity between the input pattern and each individual prototype pattern is represented by a membership value from 0 to 1. The determination of the membership value is given such that:

(

)

⎩ ⎨ ⎧ − ≤ > = = = 1 , 0 0 0 0 1 c j k d if d if u ik ij ij , (2.8)

where c denotes the number of prototype patterns, otherwise,

1 1 0 − − = ⎥⎦ ⎤ ⎢ ⎣ ⎡ =

∑

c l il ij ij d d u . (2.9)

Note that the sum of the outputs of the membership layer equals 1. Using the rule table (see later) and the obtained membership values, the current fusion weights (FW_i, i=0~6) are determined such that:

∑

− = = 1 0 c j ji ij i w u FW , (2.10)

where w_ji represents the prototype-pattern weight of ith output behavior. The prototype-pattern weights are designed in a rule table to define basic primitive emotional behaviors corresponding to carefully chosen input states.

(34)

2.2.1 Rule Table for Behavior Fusion

In the current design, several representative input emotional states were selected from the two-dimensional model in Fig. 2-2, which gives the relationship between facial expressions and mood states. Each location of facial expression on the mood plane in Fig. 2-2 is used as a prototype pattern for FKCN. Thus, a rule table is constructed accordingly following the structure of FKCN. As shown in Table 2-2, seven basic facial expressions were selected to build the rule table. The IF-part of the rule table is the emotional state of α_k and β_k of the pleasure-arousal space and the THEN-part is the prototype-pattern weight (w_ji) of seven basic expressions. For example, the neutral expression in Fig. 2-2 occurs at (0.61, -0.47), which forms the IF-part of the first rule and the prototype pattern for neutral behavior. The THEN part of this rule is the neutral behavior expressed by a vector of prototype-pattern weights (1, 0, 0, 0, 0, 0, 0). The other rules and prototype patterns are set up similarly following the values in Fig. 2-2. Some facial expressions are located at two distinct points on the mood space, both locations are employed, and two rules are set up following the analysis results from psychologist. There are all together 13 rules as shown in Table 2-2. Note that Table 2-2 gives us suitable rules to mimic the behavior of human, since the content of Fig. 2-2 is referenced from psychology results. However, other alternatives and more general rules can

Table 2-2: Rule table for interactive emotional behavior generation.

IF-part

prototype patterns THEN-part weighting

# j α_k β_k Neutral Happiness Surprise Fear Sadness Disgust Anger 1 0.61 -0.47 1 2 0.81 0.66 1 3 0.88 0.59 1 4 -0.75 0.61 1 5 -0.71 0.58 1 6 -0.83 0.54 1 7 -0.91 0.52 1 8 -0.47 -0.56 1 9 -0.54 -0.59 1 10 -0.95 -0.06 1 11 -0.84 -0.19 1 12 -0.95 0.17 1 13 -0.98 0.1 1

(35)

also be employed. FKCN works to generalize from these prototype patterns all possible situations (robotic mood state in this case) that may happen to the robot. In the FKCN generalization process, proper fusion weights for the corresponding pattern are calculated. After obtaining the fusion weights of output behaviors from FKCN, the robot’s behavior is determined from seven basic facial expressions weighted by their corresponding fusion weights such that:

6 A 5 D 4 Sad 3 F 2 Sur 1 H 0 N RB RB RB RB RB RB RB Expression Facial FW FW FW FW FW FW FW × + × + × + × + × + × + × = _, _(2.11)

where RB_N, RB_H, RB_Sur, RB_F, RB_Sad, RB_D, RB_A, represent the seven basic facial expressions of neutral, happiness, surprise, fear, sadness, disgust and anger respectively. It is seen that (2.11) gives us a method to generate facial expressions by combining and weighting the seven basic expressions.

The linear combination of basic facial expressions gives a straightforward yet effective way to express various emotional behaviors. In order to make the combined facial expression to be more consistent with human experience, an evaluation and adjusting procedure was carried out by a panel of students in the lab. The features of seven basic facial expressions were adjusted as distinguished as possible to approach human perception experience. Some results of linear combination are demonstrated using a face expression simulator, please refer to 2.2.3.

In fact, human emotional expressions are difficult to be represented by a mathematical model or several typical rules. On the other hand, FKCN is very suitable for building up the emotional expressions. The merit of FKCN is its capacity to generalize the results using limited assigned rules (prototypes). Furthermore, dissimilar emotional types can be designed by adjusting the rules. For the artificial face, facial expressions are defined as the variation of control points, which are positions of eyebrow, eye, lips and wrinkles of the artificial face.

(36)

2.2.2 Evaluation of Fusion Weight Generation Scheme

In order to verify the result of fusion-weight generation using FKCN, we applied the rules in Table 2-2 and simulated the weight distribution for various emotional states. The purpose is to evaluate how the proposed FKCN does work to generalize any input emotional state (α_k, β_k) and give a set of output fusion weights corresponding to the input. Figure 2-4 shows the simulation results of weight distribution vs. robotic mood variation on pleasure-arousal plane. In order to check seven fusion weights corresponding to seven basic emotional expressions for a given mood transition from (α_k-1, β_k-1) to (α_k, β_k), the simulation outputs for seven basic emotional expressions are illustrated respectively. The blue squares in Fig. 2-4 indicate the robotic mood transition from (α_k-1, β_k-1) to (α_k, β_k). Every position or point in this two-dimensional mood space has corresponding fusion weights. Figure 2-4(a) shows the weight distribution of neutral expression for the whole robotic mood space. The

0.

1

0.

1

0.1

0.2

0.

2

0.2

0.3

0.

3

0.3

0.4

0.4 _0.5

_0.4

0.5

0.6

0.7 _0.6

0.7

0.8

0.9

0.9 Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(a) Neutral

(37)

0.1

0.2

0.3

0.

4

0.4

0.

5

0.5

0.6

0.7

0.8

0.9

0.

9

0.9 Displeasure - Pleasure

Sl

ee

pi

nes

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(b) Happiness

0.1

0.2

0.2 _0.3

0.3

0.4

0.5 _0.5

0.5 0.5 0.6

0.

6

0.6

0.7

0.8

0.8 _0.9

Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(c) Surprise

(38)

0.1

0.

1

0.1

0.

2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(d) Fear

0.1

0.1 _0.1

0.1

0.2

0.3

0.

4

0.4

0.5

0.

6

0.6

0.7

0.8

0.9

0.9 Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(e) Sadness

(39)

0.1

0.1 _0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.9 Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(f) Disgust

0.1

0.

1

0.1

0.2

0.3

0.4

0.5

0.6

0.70.8

0.9 Displeasure - Pleasure

Sl

eep

in

es

s

A

ro

us

al

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

1 -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

(g) Anger

(40)

same contour curve in the figure has the identical neutral weight. The maximum weight (1) occurs at (0.61, -0.47) in the pleasure-arousal plane. It is seen that the neutral weigh decreases while the robotic mood state moves away from (0.61, -0.47). Figure 2-4(b) shows the weight distribution of happiness expression for the whole robotic mood space. The maximum weight (1) occurs at (0.81, 0.66) and (0.88, 0.59) in the pleasure-arousal plane. It is seen that the happiness weigh increases while the robotic mood state moves to the upper right quadrant. Figures 2-4(c)-(g) show similar results that the maximum weight positions are located in corresponding coordinates in Fig. 2-2. These results coincide with the two-dimensional emotional state of facial expressions in Fig. 2-2. Furthermore, the correlation among seven basic emotional behaviors is also checked in the simulation. It is seen that a point on the mood plane will map to a corresponding fusion weight for each of seven basic emotional expressions.

2.2.3 Animation of Artificial Face Simulator

To evaluate the effectiveness of the FKCN-based behavior fusion on actual emotional expressions, we developed an artificial face simulator exploiting the method in [55] to examine robotic facial expressions. The method follows a muscle-based approach and thus mimics the way biological faces operate. The artificial face illustrates the expression based on the contraction of facial muscles. It can also dynamically generate features such as wrinkles [55]. Emotions are the high-level concept which is aimed to display via facial expressions. Each emotion influences a different set of muscles. For each emotion and each intensity level, muscles were adjusted to match the reference drawing.

In this simulation, seven basic facial expressions: neutral, happiness, surprise, fear, sadness, disgust and anger are first designed by specifying muscles tensions of each expression composed of 7 different fusion weights. Table 2-3 shows some examples of the 7 basic facial expressions generated by the simulator with different weights. One observes that

(41)

Table 2-3: Basic facial expressions with various weights executed in the simulator. 30% 60% 80% 100% Happiness (RBH) Surprise (RBSur) Fear (RBF) Sadness (RBSad) Disgust (RBD) Anger (RBA) Neutral (RBN)

(42)

the facial expression changes from smiling to laughing as the weight of happiness increases and from staring to screaming as the weight of surprise increases. Similarly, the facial expression changes from dreading to panic as the weight of fear increases and from gloomy to crying as the weight of sadness increases. Note that the facial expression of neutral is invariable because it is set as a normal facial expression.

Finally, fused emotional expressions are depicted by linear combination of weighted basic facial expressions. Table 2-4 shows some examples of facial expressions generated by linear combination. The facial expressions with different fusion weights of sadness, anger, surprise and fear are fused to show the complex variation of emotion transition. It provides a quantitative and vivid way to express the feeling of human emotion.

Table 2-4: Linear combined facial expressions with various weights on the simulator.

Sadness:70% Anger:30% Sadness:50% Anger:50% Sadness:30% Anger:70% Suprise:70% Anger:30% Suprise:50% Anger:50% Suprise:30% Anger:70% Fear:70% Sadness:30% Fear:50% Sadness:50% Fear:30% Sadness:70%

(43)

2.3 Summary

A method of robotic mood transition for autonomous emotional interaction has been developed. An emotional model is proposed for mood state transition exploiting a robotic personality approach. We apply the concept that emotional behavior is controlled by current emotional state and mood, while the mood is influenced by personality. Here the psychological Big Five factors are utilized to represent the personality. By referring Eq. (2.4) and (2.5), the relationship between personality and mood is described. Furthermore, a two-dimensional scaling result (see Fig. 2-2) is adopted to represent general adult's facial expressions based on pleasure-displeasure and arousal-sleepiness ratings. Based on above mention, an illustration of the proposed robotic emotion model is illustrated in Figure 2-5. Finally, via adopting psychological Big Five factors in the 2-D emotional model, the proposed method generates facial expressions in a more natural manner. The FKCN architecture together with rule tables from psychological findings sufficiently provides behavior fusion capability for a robot to generate emotional interactions.

(44)

Chapter 3 Human Emotion Recognition

The capability of recognizing human emotion is an important factor in human-robot interaction. For human beings, facial expression and voice reveal a person’s emotion most. They also provide important communicative cues during social interaction. A robotic emotion recognition system will enhance the interaction between human and robot in a natural manner. In this chapter, several emotion recognition methods are proposed in the following sections. In 3.1, a bimodal information fusion algorithm is proposed to recognize human emotion by using both facial image and speech signal. In 3.2, a speech-signal-based emotion recognition method is presented.

3.1 Bimodal Information Fusion Algorithm

An embedded speech and image processing system has been designed and realized for real-time audio-video data acquisition and processing. Figure 3-1 illustrates the experimental setup of the emotion recognition system. The stand-alone vision system uses a CMOS image sensor to acquire facial images. The image data from the CMOS sensor are first stored in a frame buffer. Then the image data are passed to a DSP board for further processing. The audio signals are acquired through the analogue I/O port of the DSP board. The recognition results are transmitted via RS-232 serial link to a host computer (PC) to generate the interaction responses of a pet robot.

(45)

Fig. 3-1: The experimental setup.

Figure 3-2 shows the block diagram of robotic audio-visual emotion recognition (RAVER) system. After a face is detected in the image frame, facial feature points are extracted. Twelve feature values are then computed for facial expression recognition. Meanwhile, the speech signal is acquired from a microphone. Through a pre-processing procedure, the statistical feature values are calculated for each voice frame [66]. After the feature extraction procedures of both sensors are completed, the two feature modalities are sent to an SVM-based classifier [67] with the proposed bimodal decision scheme. Detailed design of facial image processing, speech signal processing and bimodal information fusion will be described in the following sections.

F

Z ZA

(46)

We propose in this section a probabilistic bimodal SVM algorithm. As shown in Fig. 3-2, the extracted features using visual and audio sensors are sent to a facial expression classifier and an audio emotion classifier respectively. In the current design, five emotional categories are determined, namely, anger, happiness, sadness, surprise and neutral. Cascade SVM classifiers are developed for each modality to determine the current emotion state.

3.1.1 Facial Image Processing

The facial image processing part consists of face detection module and feature extraction module. The functional block diagram of the proposed facial image processing is illustrated in Fig. 3-3. After an image frame is captured from the CMOS image sensor, color segmentation and attentional cascade procedure [68] are performed to detect human faces. As a face is detected and segmented, the feature extraction stage is performed to locate the eyes, eyebrows and lips region in the human face area. The system employs edge detection and adaptive threshold to find these feature points. According to the distance between the two selected feature points, several feature vectors are obtained for later emotion recognition. The processing steps will be described in more detail in the following paragraphs.

A. Face Detection

The first step of the proposed emotion recognition system is to detect the human face in the image frame. As shown in Fig. 3-4(a), the skin color is utilized to segment possible human face area in a test image. The morphology closing procedure is then performed to reduce the noise in the image frame, as shown in Fig. 3-4(b). The color region mapping is applied to obtain the human face candidates, as depicted by two white squares in Fig. 3-4(c). Finally, the attentional cascade method is used to determine which candidate is indeed a human face. In Fig. 3-4(d) the black square region indicates a detected human face region.

機器人情感模型及情感辨識設計

國立交通大學

電控工程研究所

博士論文

機器人情感模型及情感辨識設計

Design of Robotic Emotion Model and

Human Emotion Recognition

研 究 生：韓孟儒

指導教授：宋開泰 博士

機器人情感模型及情感辨識設計

Design of Robotic Emotion Model and

Human Emotion Recognition

研 究 生：韓孟儒 Student: Meng-Ju Han

指導教授：宋開泰 博士 Advisor: Dr. Kai-Tai Song

國 立 交 通 大 學

電 控 工 程 研 究 所

博 士 論 文

機器人情感模型及情感辨識設計

學生:韓孟儒

指導教授:宋開泰 博士

國立交通大學電控工程研究所

摘要

Design of Robotic Emotion Model and

Human Emotion Recognition

Student: Meng-Ju Han

Advisor: Dr. Kai-Tai Song

Institute of Electrical Control Engineering

National Chiao Tung University

ABSTRACT

誌謝

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Motivation

1.2 Literature Survey

1.3 Research Objectives and Contributions

1.4 Organization of the Thesis

Chapter 2

Robotic Emotion Model and Emotional State

Generation

2.1 Robotic Mood Model and Mood Transition

2.1.1

Robotic Mood Model

∑

∑

2.1.2

Robot Personality

2.1.3

Facial Expressions in Two-Dimensional Mood Space

2.1.4

Robotic Mood State Generation

2.2 Emotional Behavior Generation

(

) (

)

(

)

∑

∑

2.2.1

Rule Table for Behavior Fusion

2.2.2

Evaluation of Fusion Weight Generation Scheme

0.

1

0.

1

0.1

0.1

0.1

0.2

0.

2

0.2

0.2

0.2

0.3

0.

研究生：韓孟儒

指導教授：宋開泰博士

研究生：韓孟儒 Student: Meng-Ju Han

指導教授：宋開泰博士 Advisor: Dr. Kai-Tai Song

國立交通大學

電控工程研究所

博士論文

指導教授:宋開泰博士

_0.5

_0.4

_0.6