• 沒有找到結果。

III. Experiments and Results

3.3 Off-Line Experiments

To test the performance of the proposed system, off-line experiments were performed were the databases were divided in two parts, one part is for training the neural network and the second part is used only for validation of the architecture. The obtained data is plotted in the arousal and valence emotional plane for analysis purposes. The results are basically the arousal and valence values detected for every inputted utterance in the AV mapping module

which are values in the range of -1 and 1. Although the expected output of the AV mapping module is arousal and valence values, this values are also changed to corresponding emotion categories for comparison purposes with other works using emotion categories as output. This conversion is done by using the emotional plane in Figure 18 and the Euclidean distance equation in (17). For each of the obtained arousal and valence values from the validation database, the Euclidean distance to the target emotion categories used for training (Figure 18) is calculated and the arousal and valence pair is classified on to an emotion category according to which emotion category location has the shortest distance. This process is shown graphically in Figure 27:

Figure 27: Arousal and valence pair classified as emotion category method.

where an obtained arousal and valence pair (red triangle) is classified as “anger” since the Euclidean distance d1 between it and the anger location in the emotional plane (black dot) is the shortest one compared to the distances d2, d3, d4, d5 and d6 to the others emotion

categories locations in the emotional plane (black squares).

Once all of the obtained arousal and valence values have been classified in emotion categories, a confusion matrix is calculated in where all the validation utterances supposed emotion categories are compared to the obtained emotion categories. This information is tabulated in order to get a percentage value that indicates how well the inputted utterances have been classified as emotions. Conversion is done in order to compare the reliability of the system, but it is necessary to clarify that the desired characteristic is to be able to recommend a song that best matches the inputted emotional speech and this is better done in the emotional plane where an inputted utterance is expected to have a continuous location in the emotional plane and that every inputted utterance will never be the same so the difference can be expected to be inferred from obtained arousal and valence values.

For every database used, after the training phase, a different FFBP neural network structure (see Figure 21) was obtained. The configuration of the structure is shown on Table 2:

Table 2: Number of nodes per layer for the neural network architectures used according the speech database used.

The first test was performed using the proprietary database. The proprietary database has a total of 456 utterances. Since every speaker spoke different emotional phrases for each of the six emotion categories, about 75% of the phrases were taken for training and the other 25% were used for validation. In this way the training dataset has 348 utterances and the validation one 72 utterances. The result is plotted in the two dimensional emotional plane in Figure 28 and the corresponding confusion matrix in Table 3. As seen in Figure 28, the

distribution of arousal and valence pairs with boredom (b), surprise (f) and neutral (c) emotional content fall around the target emotion location. Anger distribution can be confused with neutral and some of the values have lower arousal level in the emotional plane. Values with sad, neutral and bored emotional content seems to be difficult to differentiate in the correct area since these emotional categories has very similar arousal and valence characteristics. Confusion matrix in Table 3 shows that arousal and valence values with neutral emotional content are classified in most of the cases and the same is valid for surprise. On the contrary, values with happiness and sadness emotional category have lower classification rate since it seems that many utterances are classified as having emotional content of the closest emotional category such as boredom or neutral instead of sadness emotional content. Values with happiness emotional content are very close to the ones having surprise emotional content. The overall recognition result is quite acceptable since it reaches 68% and only values with sadness emotional content has a lower classification rate.

The second test was performed using the ISCI lab database. This database was also used for similar research in emotion recognition in the past. The dataset has a total of 750 utterances. Every speaker uttered 3 differences phrases for every emotion category. Every phrase was uttered 10 times. So in this database the training set is composed of the first 9th utterances of every phrase for a total of 675 utterances. Each 10th utterance of every phrase is taken for the validation set for a total of 75 utterances. The results are plotted in the emotional plane in Figure 29 and the confusion matrix in Table 4. Distribution of values in Figure 29 shows that in overall, inputted utterances detected arousal and valence values are located much closer to the target emotions and the distribution are closer in the case of anger emotional content. Happiness and surprise has a worst emotional classification since some samples are confused with neutral emotional content since these categories are located much closer between them. Confusion matrix in Table 4 confirms that values with anger emotional content has the best classification rate but values with surprise emotional content

are confused with happiness and neutral and hence the lower classification rate. It is possible that results using ISCI Lab database are better than those using the proprietary database since in the first one, a single phrase is uttered several times and recording conditions were better monitored.

The third experiment was done using the Berlin emotional speech database (Emo-DB).

This database is quite popular and has been referenced in other works on emotional speech recognition. Emotion is acted and also short utterances are used so it was used as mean to compare system architecture performance. The Emo-DB has almost 500 utterances comprising seven emotions (anger, boredom, disgust, anxiety, happiness, sadness, neutral), for fairness in the comparison result, disgust and anxiety were taken out. At the end there are a total of 416 emotional speech utterances. The works in [39] and [40] are used for comparison. They use the leaving-one-speaker-out validation method, where part of uttered phrases of a subject is used as the validation set and the others as the training set. In this work this strategy is adopted. So a total of 389 utterances are used as training set and 27 as the validation set. The results are plotted in the two dimensional emotional plane in Figure 30 and the corresponding confusion matrix in Table 5. Results from Figure 30 and Table 5 shows that arousal and valence values has an acceptable overall classification for all emotions categories since classification rate almost reaches 60% or above except from values with sadness emotional content, since sadness is confused with boredom because their values are located very closed to each other. Results from Table 5 shows that although there are many utterances and recording is professionally controlled, results using the proprietary and ISCI Lab databases are better maybe because utterances in German database are not well distributed since there are different number of phrases for each emotion category.

Figure 28: System output using proprietary database. Green square indicates target emotional value, blue asterisk indicates the inputted utterance detected AV values and the red circle indicates the system proposed song.

Table 3: Confusion matrix for the emotion categories obtained using proprietary database.

Anger Boredom Happiness Neutral Sadness Surprise

Anger 66.7 0.0 0.0 25.0 8.3 0.0

Boredom 0.0 75.0 0.0 16.7 8.3 0.0

Happiness 0.0 8.3 58.3 16.7 0.0 16.7

Neutral 0.0 0.0 0.0 75.0 25.0 0.0

Sadness 8.3 25.0 0.0 16.7 50.0 0.0

Surprise 0.0 0.0 16.7 0.0 0.0 83.3

(a) (b)

(c) (d)

(e) (f)

Figure 29: System output using ISCI Lab database. Green square indicates target emotional value, blue asterisk indicates the inputted utterance detected AV values and the red circle indicates the system proposed song.

Table 4: Confusion matrix for the emotion categories obtained using ISCI Lab database.

Anger Happiness Neutral Sadness Surprise

Anger 93.3 0 6.7 0 0

Happiness 6.7 60.0 26.7 0 6.7

Neutral 0 0 86.7 13.3 0

Sadness 6.7 0 20.0 80.0 0

Surprise 0 20.0 26.7 0 53.3

(a) (b)

(c)

(e)

(d)

Figure 30: System output using EMODB database. Green square indicates target emotional value, blue asterisk indicates the inputted utterance detected AV values and the red circle indicates the system proposed song.

Table 5: Confusion matrix for the emotion categories obtained using Emo-DB database.

Anger Boredom Happiness Neutral Sadness

Anger 62.2 15.0 8.7 8.7 5.5

Boredom 13.6 70.4 2.5 7.4 6.2

Happiness 12.9 12.9 64.3 7.1 2.9

Neutral 16.7 10.3 11.5 57.7 3.8

Sadness 8.3 36.7 6.7 6.7 41.7

(d)

(e)

(a) (b)

(c)

(d)

Results obtained from the emotional databases plotted in the emotional plane, shows that an input utterance is adequately placed in the emotional plane since for a certain emotion category, the obtained arousal and valence pairs (blue asterisks) are located in the area where the target emotion where supposed to be . This desire characteristic is useful to later select songs based on the arousal and valence values detected. This music selection process result is seen also in the emotional planes where the songs to be proposed by the system are also plotted (red circle). Like that just the song with the closest emotional content is selected. From the confusion matrices, it is also seen that the database to use is very important since for every one of them, the results are quite different, and it has to be related to the subjects who participate in the recordings, the kind of phrases uttered and the spoken language. Comparing the proprietary database and the ISCI lab databases, both using the Chinese mandarin language, is seen that there is a better performance of the system by using the ISCI lab one. This is thought to be due to the kind of phrases used. For instance in ISCI lab database, one phrase is uttered ten times by same speaker and also is known that the setting up of the experiment was better controlled. Recognition rate obtained using Emo-DB database has a slightly lower recognition rate value than the both the speech using different kinds of features. Comparison of the confusion matrices obtained on the offline test using the Emo-DB shows that recognition depends highly on the features extracted from the speech. In [39] overall recognition rate is higher than [40] and this work, but comparing speaker independent row in [40] and this work shows similarity in the results

although better results are obtained in these work. Features chosen in this work are very common features used in emotion recognition and also their complexity implementation is less that that used in [39] and [40], this was the desired choice since an embedded system was the target platform. The confusion matrices of the entire off-line test realized in this work and those of [39] and [40] are combined in Table 6. It is seen that overall recognition rate using Emo-DB is quite comparable to [39] and [40] probing the feasibility of the proposed architecture. Still it has to be noticed that mapping on the emotional plane is difficult to classify in discrete emotions since similar emotion content can merge together as in bored neutral and sad or happy and surprise, but still this does not mean that output is incorrect since the main purpose is to allow different emotional content to be mapped in the emotional plane to infer some differentiation that will be used to choose right songs and this has been accomplished.

Table 6: Average emotion recognition rates comparison table.

Proprietary ISCI Emo-DB Work in[39]

using Emo-DB

An online experiment has been carried out in order to test system performance on a real scenario. Figure 31 shows a picture of the set-up used for the online experiments. It is

相關文件