• 沒有找到結果。

Chapter 2 Robotic Emotion Model and Emotional State Generation

2.3 Summary

A method of robotic mood transition for autonomous emotional interaction has been developed. An emotional model is proposed for mood state transition exploiting a robotic personality approach. We apply the concept that emotional behavior is controlled by current emotional state and mood, while the mood is influenced by personality. Here the psychological Big Five factors are utilized to represent the personality. By referring Eq. (2.4) and (2.5), the relationship between personality and mood is described. Furthermore, a two-dimensional scaling result (see Fig. 2-2) is adopted to represent general adult's facial expressions based on pleasure-displeasure and arousal-sleepiness ratings. Based on above mention, an illustration of the proposed robotic emotion model is illustrated in Figure 2-5.

Finally, via adopting psychological Big Five factors in the 2-D emotional model, the proposed method generates facial expressions in a more natural manner. The FKCN architecture together with rule tables from psychological findings sufficiently provides behavior fusion capability for a robot to generate emotional interactions.

Fig. 2-5: Illustration of the proposed robotic emotion model.

Chapter 3

Human Emotion Recognition

The capability of recognizing human emotion is an important factor in human-robot interaction. For human beings, facial expression and voice reveal a person’s emotion most.

They also provide important communicative cues during social interaction. A robotic emotion recognition system will enhance the interaction between human and robot in a natural manner.

In this chapter, several emotion recognition methods are proposed in the following sections.

In 3.1, a bimodal information fusion algorithm is proposed to recognize human emotion by using both facial image and speech signal. In 3.2, a speech-signal-based emotion recognition method is presented.

3.1 Bimodal Information Fusion Algorithm

An embedded speech and image processing system has been designed and realized for real-time audio-video data acquisition and processing. Figure 3-1 illustrates the experimental setup of the emotion recognition system. The stand-alone vision system uses a CMOS image sensor to acquire facial images. The image data from the CMOS sensor are first stored in a frame buffer. Then the image data are passed to a DSP board for further processing. The audio signals are acquired through the analogue I/O port of the DSP board. The recognition results are transmitted via RS-232 serial link to a host computer (PC) to generate the interaction responses of a pet robot.

Fig. 3-1: The experimental setup.

Figure 3-2 shows the block diagram of robotic audio-visual emotion recognition (RAVER) system. After a face is detected in the image frame, facial feature points are extracted. Twelve feature values are then computed for facial expression recognition.

Meanwhile, the speech signal is acquired from a microphone. Through a pre-processing procedure, the statistical feature values are calculated for each voice frame [66]. After the feature extraction procedures of both sensors are completed, the two feature modalities are sent to an SVM-based classifier [67] with the proposed bimodal decision scheme. Detailed design of facial image processing, speech signal processing and bimodal information fusion will be described in the following sections.

ZF ZA

Fig. 3-2: Block diagram of the robotic audio-visual emotion recognition system.

We propose in this section a probabilistic bimodal SVM algorithm. As shown in Fig. 3-2, the extracted features using visual and audio sensors are sent to a facial expression classifier and an audio emotion classifier respectively. In the current design, five emotional categories are determined, namely, anger, happiness, sadness, surprise and neutral. Cascade SVM classifiers are developed for each modality to determine the current emotion state.

3.1.1 Facial Image Processing

The facial image processing part consists of face detection module and feature extraction module. The functional block diagram of the proposed facial image processing is illustrated in Fig. 3-3. After an image frame is captured from the CMOS image sensor, color segmentation and attentional cascade procedure [68] are performed to detect human faces. As a face is detected and segmented, the feature extraction stage is performed to locate the eyes, eyebrows and lips region in the human face area. The system employs edge detection and adaptive threshold to find these feature points. According to the distance between the two selected feature points, several feature vectors are obtained for later emotion recognition. The processing steps will be described in more detail in the following paragraphs.

A. Face Detection

The first step of the proposed emotion recognition system is to detect the human face in the image frame. As shown in Fig. 3-4(a), the skin color is utilized to segment possible human face area in a test image. The morphology closing procedure is then performed to reduce the noise in the image frame, as shown in Fig. 3-4(b). The color region mapping is applied to obtain the human face candidates, as depicted by two white squares in Fig. 3-4(c). Finally, the attentional cascade method is used to determine which candidate is indeed a human face. In Fig. 3-4(d) the black square region indicates a detected human face region.

Fig. 3-3: The functional block diagram of facial image processing.

(a) (b)

(c) (d)

Fig. 3-4: Face detection procedure. (a) Original image, (b) Color segmentation and closing operation, (c) Candidate face areas, (d) Final result obtained by attentional cascade.

B. Facial Feature Extraction

The feature extraction module finds feature points from a frontal face image. The feature points are represented by a vector of numerical data, which represent the position of the facial

features such as eyes, eyebrows, and lips. To search positions of eyes and eyebrows on the upper part of the face image, the characteristics that eyeballs are the darkest areas on the upper face is utilized. Further, the system employs integral optical density (IOD) [69] to find the area of eyes and eyebrows. IOD works on binary images and gives reliable position information of both eyes.

In order to increase the robustness of feature point extraction, our method combines IOD and edge detection. Passing through an AND operation of two successive binary images, the outlines of eyes and eyebrows can be extracted. Figure 3-5 illustrates the definition of all facial feature values and Table 3-1 lists the corresponding detailed descriptions. We defined three feature points for each eye and two feature points for each eyebrow. We locate the upper, lower and inner points of eyes as feature points, and set the central, inner points of eyebrows as feature points. Further, there are four feature points for lips, as shown in Fig. 3-5. Figure 3-6 shows the image processing results of extracting eyes and eyebrows feature points. In Fig.

3-6(a), the detected facial image is processed using IOD while edge detection is performed in

E1 E2 E3 E4

E5 E6

E7

E8 E9 E10

E11

E12

Fig. 3-5: Definition of the facial feature points and feature values.

Table 3-1: The description of facial feature values.

Features Description

E1 Distance between the central of right eyebrow and eye

E2 Distance between the right eyebrow and eye

E3 Distance between the left eyebrow and eye

E4 Distance between the central of left eyebrow and eye

E5 Distance between upper and lower right eye contour

E6 Distance between upper and lower left eye contour

E7 Distance between right and left eyebrows

E8 Distance between right side lip and right eye

E9 Distance between upper lip and eyes

E10 Distance between left side lip and left eye

E11 Distance between upper and lower lip

E12 Distance between right and left side lip

(a) (b)

(c) (d)

Fig. 3-6: Test results of feature extraction of eyes and eyebrows. (a) Binary operation using IOD, (b) Edge detection, (c) AND operation. (d) Extracted feature points.

Fig. 3-6(b). In Fig. 3-6(c), the AND operation of IOD and edge detection are performed. The feature extraction result is shown in Fig. 3-6(d). Similarly, Fig. 3-7 depicts the result of

(a) (b) (c) Fig. 3-7: Feature extraction of lips.

feature points extraction of lips. The candidate area in Fig. 3-7(a) is processed by using IOD.

The binary detection result is shown in Fig. 3-7(b). Finally, the feature extraction result is obtained as shown in Fig. 3-7(c).

After obtaining the position of facial feature points, we calculate twelve significant feature values, which are distances between two selected feature points as shown in Table 3-1.

In order to reduce the influence of distance between a user and the CMOS image sensor, these feature values are normalized for emotion recognition.

3.1.2 Speech Signal Processing

The functional block diagram of the proposed speech signal processing method is shown in Fig. 3-8. The procedure of speech signal processing is divided into two parts. The first part is the pre-processing of speech signal, including endpoint detection and frame setting. The second part is responsible for extracting speech features. The processing steps will be described in more detail in the following paragraphs.

A. Frame Detection

The endpoint detection determines the location of real speech signals by short time energy detection and zero- crossing rate detection. We use the first 128 samples to determine the threshold value in energy detection and then divide a frame into 32ms periods for further

Fig. 3-8: The functional block diagram of facial image processing.

feature extraction processing. The basic idea for estimating emotion by the speech signal is to select features that imply emotion information.

B. Speech Feature Extraction

In this work, contours of pitch and energy are analyzed [29] for human emotion recognition. The pitch contour is obtained by autocorrelation. The maximum point is selected to calculate the pitch values. The energy contour is obtained by calculating the short time energy of each frame. Then the speech feature values can be obtained by computing the statistical quantity of pitch and energy contour. Altogether, twelve speech feature values are obtained for emotion recognition. The elements of speech features are listed in Table 3-2.

3.1.3 Bimodal Information Fusion Algorithm

In order to determine the final result by taking into account both the audio and visual classification results, we developed a bimodal information fusion algorithm to provide a fusion weight for the classifier. According to the principle of SVM, the larger the distance

Table 3-2: The description of speech feature values.

Features Description

Pave Average pitch

Pstd Standard deviation of pitch

Pmax Maximum pitch

Pmin Minimum pitch

PDave Average of pitch derivation

PDstd Standard deviation of pitch derivation

PDmax Maximum of pitch derivation

Eave Average energy

Estd Standard deviation of energy

Emax Maximum energy

EDave Average of energy derivation

EDstd Standard deviation of energy derivation

between a test sample and the hyperplane, the greater the recognition reliability. Figure 3-9 shows a trained SVM hyperplane and the distance of a test sample to the hyperplane. It can be seen from the figure that the test samples x1 and x2 belong to the same class. However, the distance d1 is smaller than d2. Thus, the recognition reliability of test sample x2 is greater than that of x1, because the position of x2 can resist a larger shift of the hyperplane.

Furthermore, if the training samples are distributed widely, the trained hyperplane will lead to smaller recognition reliability. It may result in a false recognition even the average distance between a test sample and the hyperplane is still large. Figure 3-10 shows two cases

Test samples

d1

d2

Hyperplane

x2

x1

Fig. 3-9: Representing recognition reliability using the distance between test sample and hyperplane.

(a) (b)

Fig. 3-10: Representing recognition reliability using the standard deviation of training samples. (a) Smaller standard deviation, (b) Larger standard deviation.

of training sample distributions. In Fig. 3-10(a) and (b), the mean values for both distribution are the same, but the standard deviation of hyperplane 1 (σ1) is smaller than that of hyperplane 2 (σ2). The recognition reliability of hyperplane 1 is thus greater than that of hyperplane 2, because the training samples are more congregated in the former case. We can conclude that the recognition result is more reliable if the distance between the test sample and hyperplane is larger and the standard deviation of training data set is smaller.

Based on the above observation, we propose the following algorithm of bimodal information fusion:

1) Assume the number of training samples is N for both visual and audio SVM classifiers. Compute the average distance DFave and DAave between samples and the hyperplane of facial and speech training data respectively such that:

=

d represent the distance between the ith facial and speech training samples and their corresponding respectively.

F

i training sample of facial and speech training data respectively. HrF and HrA represent the SVM hyperplane of facial and speech data the corresponding hyperplanes respectively.

F

where xrF and xrA represent the facial and speech test sample respectively.

4) Calculate and normalize the weights of facial classification and speech classification respectively such that:

5) If the classified results of two modalities are not the same, the decision machine compares the magnitude of facial and speech classification weights to obtain a classified

result. If Z ≥F ZA, adopt the recognition result of facial feature. If Z <F ZA, then adopt the recognition result of speech feature.

3.1.4 Hierarchical SVM Classifiers

In this work, five facial expressions are categorized according to both the facial and speech information. An SVM hyperplane distinguishes two categories. Therefore two four-stage classifiers need to be constructed as shown in Fig. 3-11. Each stage determines one expression using two emotion categories. The selected emotion category will proceed to the next stage until a final expression is determined. For instance, when an unknown sample appears, the SVM first classifies happiness vs. sadness followed by surprise vs. neutral. After this stage, the corresponding results are further classified at the next stage. For example, the results of the first stage classifiers are assumed to be happiness and surprise (shown as ① and ② in Fig. 3-11). At the second stage, the classifier determines the unknown data as surprise or anger. If the facial image recognition result is surprise but the speech recognition result is anger (shown as ③ and ④), a fusion result is obtained from comparing the weights of both modalities. Here suppose that the weight ZF of facial image data is larger than the weight ZA of speech data. So the result of anger (from speech features) vs. surprise (from

Fig. 3-11: SVM bimodal recognition procedure.

facial features) is classified as surprise. At the last stage, the classifiers determine the unknown data as happiness or surprise as shown in Fig. 3-11. The system will eventually come to a final recognition result.

3.2 Speech-signal-based Emotion Recognition

An embedded speech processing system was designed and produced for real-time speech signal acquisition and processing. Figure 3-12 shows the block diagram of the proposed speech-signal-based emotion recognition system. Speech signals are acquired from a microphone. Using a speech signal pre-processing procedure, the speech voice frames are determined by end-point detection [70]. In the speech feature extraction stage, the fundamental frequency and energy features of a speech frame are extracted to represent the speech signal of interest. After obtaining the features of speech frame, Fisher's linear discriminant analysis (FLDA) is utilized to transfer feature values to a suitable space [71].

The feature values in the transferred space represent significant emotional traits and improve the recognition rate. Finally, a hierarchical support vector machine (SVM) classifies the emotional categories. In order to simplify the design of the emotion recognition system for an

Fig. 3-12: Block diagram of the proposed speech-signal-based emotion recognition system.

entertainment robot, it is assumed that each sentence corresponds to only one emotional category. The detailed design of the emotion recognition system is presented in the following section.

3.2.1 Speech Signal Pre-processing

Before extracting the features of the speech signal for recognition, a voice signal pre-processing stage separates speech frames from the acquired signal. In this design, pre-processing consists of analog to digital conversion, end-point detection and frame signal separation.

Speech signals acquired from the microphone are analog voltage signals. Through amplification and sampling, the analog voltage signal is converted to digital, in a discrete form. Based on the sampling theorem, a sampling frequency is set to be more than twice the bandwidth of the input signals, in order to avoid signal distortion. In general, the spectrum of human speech is less than 4K Hz. The sampling frequency is set to 8K Hz, in this study.

Furthermore, a normalization scheme is used to reduce the influence of constantly changing input signals. The normalized speech signal is obtained such that:

N where x(n) represents the normalized speech signal, xori(n) represents the original speech signal and xmax is the maximum value in the sequence, xori(n). By dividing withxmax, as shown in Equation (3.11), the amplitudes of whole speech signal are normalized between -1 and 1.

In order to extract the emotional features in a voice, a frame size must first be determined for the digitized speech signal. Short-time energy, which is an acoustic feature that correlates the sampled amplitude in each voice frame, is calculated such that:

= +

= 1

0 ( )

)

( N

m x n m

k

E , (3.13)

where E(k) is the short-time energy in the kth frame, x(n) represents the normalized speech signal and N is the frame size. The starting and terminal thresholds are then determined for the voice frame, to determine the starting and terminal points respectively by using empirical rules. Once the value of E(k) is greater than the starting threshold, the starting point is determined. However, the terminal point is determined when the value of

) (k

E is smaller than the terminal threshold. Hence a frame size, N, is determined as the real speech signal. As shown in Figure 3-13, the starting and terminal points of a speech frame are determined by the starting threshold and the terminal threshold, respectively.

The zero-crossing rate (ZCR) is then used for audio frame setting. Zero-crossing rate is a basic acoustic feature. It is equal to the number of zero-crossings of the waveform within a given frame. Here the zero-crossing rate is defined as the number of times which the speech signals cross the zero value origin of the y-coordinates. In general, the zero-crossing rate of non-speech and environmental noise is lower than that of human speech [72]. The zero-crossing rate is calculated such that:

Z(k)=21

mN=10sgn(x(n+m))sgn(x(n+m1)) (3.14)

Fig. 3-13: Energy of a speech signal.

<

= 1 ( )0 0 ) ( )] 1

(

sgn[ if x n

n x n if

x , (3.15)

where Z(k) is the zero-crossing rate of the kth frame. In practice, the short-time energy is used to estimate the starting and terminal points of the whole speech segment, wherein the speech voice occurs. Then, the zero-crossing rate is used to find the real speech signal more precisely. As shown in Figure 3-14, the real speech signal is determined by the ZCR threshold.

In this design, zero-crossing rate and short-time energy are both used to detect the starting and terminal points of non-speech. Figure 3-15 shows the four rules to find the real human speech signal:

(1) If E(k)is lower than the terminal threshold, it belongs to non-speech.

(2) If E(k)is higher than the starting threshold, the starting point of the human speech signal is determined.

(3) If E(k)is lower than the starting threshold and Z(k) is higher than the ZCR threshold of the zero-crossing rates, this is determined as the starting point of the human speech signal.

Fig. 3-14: Zero-crossing rate of a speech signal.

Starting threshold

Time

Time Zero-crossing

rate Energy

Terminal threshold

ZCR threshold

Starting point Terminal point Real human speech

Fig. 3-15: Example of real human speech detection.

(4) If E(k) is lower than the terminal threshold, after the starting point, it is determined that this is the terminal point of the human speech signal.

Using the above rules, the starting and terminal points of speech signals are determined. The boundary of real human speech is also determined. Figure 3-16 shows an example of end-point detection.

Using the above rules, the starting and terminal points of speech signals are determined. The boundary of real human speech is also determined. Figure 3-16 shows an example of end-point detection.