Chapter 2 Literature Review
2.2. Support Vector Machine (SVM)
2.2.3. Nonlinear problem
The formulations of the SVM can also be extended to tackle nonlinear problems. The SVM adopts a way to map the original features into a higher dimensional space, and solves the problem linearly in the new space (see Figure 2-8). If φ is our mapping function, the
However, the inner product in equation (2-11) would increase the computational load. A key property of SVM recognizers is to use the so-called kernel function
(
( )i , ( )j)
( ( )i )T ( ( )j )K x x =φ x ⋅φ x to replace the inner product. There is no need to find the mapping function φ explicitly, while any function that satisfy Mercer’s theorem can be used as kernel functions here. Table 2-1 lists four basic kernel functions used frequently.
Figure 2- 8 Map nonlinear problem to higher dimensional space.
Table 2- 1 basic kernel functions linear K
(
x xi, j)
=xiT⋅xjpolynomial K
(
x xi, j)
=(
γxiT⋅ +xj r)
, >0γGaussian (RBF) K
(
x xi, j)
=exp(
−γ xi−xj 2)
, >0γsigmoid K
(
x xi, j)
=tanh(
γxiT⋅ +xj r)
Chapter 3
Database and Feature Extraction
3.1. Berlin Emotional Speech Database (EMO-DB)
The popular Berlin Emotional Speech Database [14] is tested in pilot simulations in this study. Clean speech samples are uttered by five female and five male actors. Each actor speaks ten sentences in German. Each sentence has duration of 2 to 5 seconds. Detail contents are listed in Table 3-1. The database contains emotions of anger (126), happiness (70), sadness (62), fear (66), disgust (44), boredom (80), and neutral (78). Only those utterances scoring higher than 80 emotion recognition rate in a subjective listening test are included in the database. Hence, there are 526 sentences in total with seven classes of emotions. Original speech samples are recorded with 16 kHz sampling frequency under studio condition, and are downsampled to 8 kHz to cover the fundamental frequencies of male speakers when analyzed by our 5.3-octave frequency coverage cochlear filterbank in our auditory model (see section 2.1).
White noise and babble noise are obtained from the NOISEX-92 database [17] and added to clean speech to simulate various SNR conditions. A simple energy-based VAD is first applied to each clean utterance to determine its active regions. Only durations of active regions are considered in calculating SNR.
Table 3- 1 The German content of EMO-DB and its English translation
3.2. FAU AIBO database
The FAU AIBO corpus [15] contains recordings from children interacting with SONY's pet robot AIBO. The most important characteristic of these recordings is they are natural with non-acted emotions. The children were invited to play with the AIBO and
code German text English translation
a01 Der Lappen liegt auf dem
Eisschrank. The tablecloth is lying on the fridge.
a02 Das will sie am Mittwoch
abgeben. She will hand it in on Wednesday.
a04 Heute abend könnte ich es ihm
sagen. Tonight I could tell him.
a05
Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.
The black sheet of paper is located up there besides the piece of timber.
a07 In sieben Stunden wird es soweit
sein. In seven hours it will be.
b01 Was sind denn das für Tüten, die da unter dem Tisch stehen?
What about the bags standing there under the table?
b02 Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.
They just carried it upstairs and now they are going down again.
b03
An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.
Currently at the weekends, I always went home and saw Agnes.
b09 Ich will das eben wegbringen und dann mit Karl was trinken gehen.
I will just discard this and then go for a drink with Karl.
b10 Die wird auf dem Platz sein, wo wir sie immer hinlegen.
It will be in the place where we always store it.
asked to guide it through certain missions, such as moving from point A to point B along a particular route. Children believed that the AIBO would have responded to their commands directly, whereas it was actually controlled by a human operator to behave excellently or disobediently, thereby to provoke emotional reactions. The data was collected from two different German schools, Mont and Ohm, from 51 children (of age 10~13; 21 boys and 30 girls). Speaker independence is assured by using the data from one school for training and the data from another school for testing. The original recordings are sampled at 16k Hz. For the same reason as stated in section 3.1, speech samples are downsampled to 8k Hz. The original recordings with pause length over 1 sec were segmented automatically into “turns”.
Five labelers (advanced students of linguistics) annotated each turns in word-level as neutral (default) or as one of ten other emotion classes. Majority voting (MV) was then used, that is, only those words with three or more than three labelers’ agreement were included into the corpus. The classes and number of speech samples in each class were: joyful (101), surprised (0), emphatic (2,528), helpless (3), touchy (225), angry (84), motherese (1,260), bored (11), reprimanding (310), rest (3), neutral (39,169).
We follow the INTERSPEECH 2009 emotion challenge [18] criterions which differentiate the classification problem into a five-class problem and a two-class problem.
For the five-class classification problem, emotions are grouped into Anger (angry, touchy, and reprimanding), Emphatic, Neutral, Positive (motherese and joyful), and Rest. The two-class problem deals with NEGative (subsuming angry, touchy, reprimanding, and emphatic) and IDLe (consisting of all nonnegative states) emotions. More details about numbers of speech samples are listed in Table 3-2 and Table 3-3. Similar to section 3.1, white and babble noises are added to the original clean speech to test the robustness of various features.
Table 3- 2 Number of instances for the 5-class problem
# A E N P R sum
train 881 2093 5590 674 721 9959
test 611 1508 5377 215 546 8257
sum 1492 3601 10967 889 1267 18216
Table 3- 3 Number of instances for the 2-class problem
# NEG IDL sum
train 3358 6601 9959
test 2465 5792 8257
sum 5823 12393 18216
3.3. Rate-Scale (RS) Features
As mentioned in section 2.1.3, rate-scale plots reveal joint spectro-temporal modulations of the speech. The slow modulations, which are related to the speaking rate (i.e., the changing rate of the vocal track), are shown in low rate regions. On the other hand, the energy of resolved pitch is captured in high rate regions. In this study, we consider rates at ±21, ,9⋯ Hz to cover the complete temporal structures (speaking rate and pitch) of the speech. As for the scale region, we emphasize on the 2−1, ,3⋯ cycle/octave to cover complete frequency structures, from formants (captured by low scales) to harmonics (captured by high scales). Therefore, 90 rate-scale features (9 rates, 5 scales and both directions) are
extracted per frame. The mean and standard deviation of these 90 RS features are then calculated over the entire utterance. Finally, 180 RS features per utterance are preserved for emotion recognition.
3.4. MFCC Features
The mel-frequency cepstral coefficients (MFCCs) are widely used in the speech analysis field. Here, the first 13 MFCCs (including the zero-order coefficient) are extracted from 25 ms Hamming-windowed frame every 10 ms with the pre-emphasis coefficient 0.97.
The mean, standard deviation, skewness, and kurtosis of these 13 MFCCs, their deltas, and double-deltas are computed as 156 features per utterance. It is referred to as MFCC156.
Figure 3- 1 block diagram for extracting MFCC
3.5. Prosodic Features
The 180 RS features mentioned above contain pitch and timbre (i.e., the formant structure) information, however, conventional MFCCs only carry timbre information. To make a fair comparison, prosodic features (pitch, energy and duration) are extracted and combined with MFCC features.
The fundamental frequency (F0) contour is extracted by STRAIGHT [19]. The algorithm estimates the aperiodic power (AP) of each frame. Frames with high AP are assumed unvoiced with zero F0. Only low-AP frames are treated as voiced frames and
return valid F0 estimate. The energy contour is extracted every 10 ms with a 25 ms window.
Duration related features are derived from the voiced/unvoiced discrepancy obtained in F0 estimation.
Statistics of these prosodic features used in this study are similar to those used by other researchers [3, 4]. However, not to form a huge feature set with 1000 ~ 4000 parameters, a reasonably small-sized feature set is constructed. As a result, some features are omitted or replaced. For example, the mean of the positive and the negative dF0 are calculated separately to represent the upward and the downward trend, respectively, instead of the mean of all dF0. As for the energy, the minimum value of energy must be close to zero such that the min value, relative position of min, and range would not provide crucial information and hence are dropped from our feature list. Finally, 30 prosodic features are extracted and referred to as the PRO30 feature set. The description of this feature set is given in Table 3-4.
Table 3- 4 30 prosodic features
F0 (8 features)
mean, std,
max value, relative position of max, min value, relative position of min,
range, number of local max point dF0
(8 features)
mean of positive, mean of negative, std, max value, relative position of max,
min value, relative position of min, ratio of positive
logE (3 features)
std,
max value, relative position of max dlogE
(8 features)
mean of positive, mean of negative, std, max value, relative position of max,
min value, relative position of min, ratio of positive
Duration (3 features)
speaking rate,
std of voiced duration, mean pause time
3.6. INTERSPEECH 2009 Emotion Challenge Acoustic
Features
For the AIBO database, we compare the acoustic features adopted in INTERSPEECH 2009 emotion challenge with proposed RS features under noisy conditions. This default feature set provides baseline results for both HMM and linear kernel SVM recognizers in the 2009 challenge and is totally transparent with the accessible open source openSMILE feature extraction toolkit [20]. It includes the most common features in pertaining to prosody, spectral shape, voice quality, as well as their derivatives. In details, the 16 low-level descriptors chosen are: zero-crossing-rate (ZCR) from the time signal, root mean square (RMS) frame energy, pitch frequency (normalized to 500 Hz), harmonics-to-noise ratio (HNR) by autocorrelation function, and mel-frequency cepstral coefficients (MFCC) 1-12 in full accordance to HTK-based computation. To each of these 16 features, the delta coefficients are included as well. Next, as depicted in Table 3-5, the 12 functionals: mean;
standard deviation; kurtosis; skewness; minimum and maximum value, relative position, and range; and two linear regression coefficients with their mean square error (MSE); are derived for each low-level and its delta feature on a chunk basis. Thus, the final feature contains 16×2×12 = 384 attributes and is referred to as the Inter384 features. In this thesis, we conduct experiments in section 4.3 to compare Inter384 features with proposed RS features in their robustness.
Table 3- 5 Features used in INTERSPEECH 2009 emotion challenge
LLD (16*2) Functionals (12)
(∆) ZCR mean
Energy standard deviation kurtosis, skewness
extremes: value, rel. position, range linear regression: offset, slope, MSE (∆) RMS
(∆) F0 (∆) HNR
(∆) MFCC 1-12
Chapter 4
Simulation Result
4.1. Experimental Setup
As mentioned in section 2.2, there are many kinds of kernels available for the SVM to map problems onto higher dimensional spaces. Although the radial basis function (RBF) kernel is suggested to use the first, different choices of parameters C and γwould affect results radically [16]. These parameters need to be fine tuned by the grid search for each training condition. Therefore, a simpler linear kernel is adopted in this study only to investigate the robustness of features. Before building the SVM, all training and testing features are linearly scaled to [0, 1]. To evaluate the robustness of RS features in unknown environments, mismatched tests (clean data for training while noisy data for testing) are performed under various SNR conditions.
To address the problem of insufficient speech samples in the Berlin database, the 10-fold cross-validation procedures are adopted in our test. Speech samples are randomly divided into 10 subsets. In each trial, one subset is used for testing while the other nine subsets are used for training the SVM recognizer. Final recognition rates are obtained by averaging over 10 trials. Features extracted from the Berlin database will further processed through the intra-speaker normalization. That is, for each speaker, features from all
sentences, including seven emotion classes, are normalized by their mean and standard deviation. As for the FAU AIBO database evaluation, the 10-fold method is not utilized due to its sufficient data samples. The data of one school, Ohm, is used for training and the data of another school, Mont, is used for testing. Therefore, speaker independence is assured for the FAU AIBO database evaluation since there is no overlap between training speakers and testing speakers.
Recognition results are reported in form of the total recognition rate (RR), the mean of class-wise recognition rate (CL) and the harmonic mean F where
2 RR CL F RR CL
⋅ ⋅
= + (4-1)
These three different measures are assessed for cases with unbalanced number of instances among classes. The classes with more instances have more substantial influence on RR than ones with fewer instances. Thus, the RR measure has the tendency of over-estimating the performance. On the contrary, the CL measure increases the influence of minority classes thus under-estimating the performance. Therefore, the F-measure is commonly used to give a fair performance estimate when sizes of classes are not balanced [21]. However, since the FAU AIBO database is severely unbalanced, the classifier loses its detecting ability against those minority classes. To cope with this problem, an under-sampling method is used in majority classes. We randomly down-sample other classes to have the same number of instances as the smallest class, which is the NEG in the 2-class problem and the P in the 5-class problem. Final recognition rates are obtained by averaging over 10 trials. In this totally balanced condition, the RR and CL measures produce the same results; hence, we only list one measure for the FAU AIBO database.
4.2. Results on Berlin Database
Table 4- 1 Recognition rates (in %) of RS180 under additive white noises RS180 H A S F N B D CL RR F clean 42.86 91.35 100.00 52.86 81.96 81.25 45.50 70.83 74.32 72.53 20dB 42.86 88.14 100.00 52.86 80.71 81.25 48.00 70.55 73.57 72.03 15dB 44.29 87.31 98.33 51.43 79.29 81.25 47.50 69.91 72.98 71.41 10dB 44.29 84.87 98.33 52.86 76.61 80.00 54.50 70.21 72.60 71.38 5dB 50.00 83.21 91.90 53.57 72.86 80.00 61.00 70.36 72.22 71.28 0dB 45.71 64.17 75.71 50.71 78.04 65.00 57.00 62.33 62.91 62.62
Table 4- 2 Recognition rates (in %) of MFCC156+PRO30 under additive white noises MFCC156
Table 4- 3 Recognition rates (in %) of RS180 under additive babble noises
RS180 H A S F N B D CL RR F
clean 44.29 91.28 100.00 50.95 81.07 90.00 47.00 72.08 75.48 73.74 20dB 45.71 90.45 100.00 52.62 82.14 87.50 47.00 72.20 75.49 73.81 15dB 44.29 88.85 100.00 55.48 80.89 85.00 46.50 71.57 74.73 73.12 10dB 45.71 84.17 100.00 54.29 79.64 78.75 44.00 69.51 72.25 70.85 5dB 42.86 61.15 100.00 46.43 92.50 61.25 44.50 64.10 64.62 64.36 0dB 35.71 15.77 100.00 24.29 87.50 52.50 62.00 53.97 49.46 51.62
Table 4- 4 Recognition rates (in %) of MFCC156+PRO30 under additive babble noises MFCC156
+PRO30 H A S F N B D CL RR F
clean 61.43 83.40 100.00 75.71 92.14 88.75 69.50 81.56 82.54 82.05 20dB 40.00 87.31 81.90 59.29 79.11 87.50 59.00 70.59 73.50 72.01 15dB 45.71 81.79 59.76 46.67 68.57 80.00 66.00 64.07 66.55 65.29 10dB 45.71 76.09 65.71 33.57 49.64 83.75 57.50 58.85 61.38 60.09 5dB 77.14 63.91 50.00 18.33 51.25 73.75 50.00 54.91 56.78 55.83 0dB 68.57 37.50 19.52 13.81 21.43 57.50 40.00 36.90 37.53 37.21
Figure 4- 1 F-measure for additive white noise
Figure 4- 2 F-measure for additive babble noise
Table 4-1 to 4-4 show detailed performance of using RS180 and MFCC156+PRO30 features in additive white and babble noises, respectively. The class-wise (from H to D) recognition rates are shown in each column. The CL is the mean of class-wise recognition rates and the RR is the total recognition rate. The F-measure, which provides a fair comparison, is given in the last column in each table and summarized in Figure 4-1 and 4-2.
Clearly, the RS180 outperforms the MFCC156+PRO30 in all SNR conditions (20dB~0dB), except in the clean condition. The MFCC156+PRO30 features from training samples depict magnitude spectra and pitch values with high precision. Such precise representations would produce good matches in clean condition, but are also prone to degradations by noises. On the other hand, RS features only carry the information of spectro-temporal amplitude modulations, which is equivalent to the spectro-temporal envelopes without carriers’ fine structure (phase) information. While not providing accurate matches in the clean condition, RS features are more resistant to deteriorations from spectro-temporal envelopes of noises.
Using RS features, Anger and Sadness are the two most recognizable emotions, whereas Fear, Disgust and Happiness are more difficult to be classified. With additive background noises, Fear emotion is particularly prone to be deteriorated in both feature domains of MFCC plus prosodic and RS features. Moreover, recognition rates of the emotion of Neutral are severely degraded under noisy conditions when using conventional features. Nevertheless, it is very well preserved by the RS features.
Figure 4-3 shows an example of sentence spoken by the same speaker with Anger and Neutral emotions. Panel (a) and (c) are the auditory spectrograms of utterances with Anger and Neutral emotions, respectively. Panel (b) and (d) are their corresponding rate-scale plots.
As seen in these figures, the pitch-related response (high rate, high scale) of Neutral is more intense than that of Anger. The reason for this phenomenon is that the speaker’s pitch is moving up-and-down more dramatically in Anger emotion than in Neutral emotion. Hence, the mean response at each specific pitch-related rate-scale point in Anger emotion is weaker than that in Neutral emotion. On the other hand, the low rate region encodes the coarse temporal AM structure of the utterance. In Neutral speech, pitch and formant contours are usually with smooth declination toward the end of sentence. This declination trend is revealed as a notable positive rate (downward) response as mentioned in section 2.1.3.
However, no declination trend in Anger speech produces comparable response in positive
and negative rates.
For better demonstrating the robustness of our RS features, Figure 4-4 shows the response curves of a pitch-related region (rate=256 Hz, scale=4 cycle/octave) and an AM-related region (rate=4 Hz, scale=0.25 cycle/octave) along the time axis under clean (panel (a) and (b)) and 5 dB noisy conditions (panel (c) and (d)). Both curves are derived from the Anger sentence used in Figure 4-3. Figure 4-5 shows the spectrogram and RS plot of white noise alone. As observed in Figure 4-5 (b), the white noise activates a high rate/high scale response, which is quite different from the response of speech. For speech with added white noise, the low rate/low scale regions are less affected (see Figure 4-4 (b) and (d) for clean and 5 dB SNR condition) while the pitch-related RS regions are more affected. However, comparing Figure 4-4 (a) and (c), distortions are roughly as from a dynamic range compression. The original trend along the time axis is not damaged. On the contrary, conventional ways of extracting pitch may totally become invalid with low SNR noise. The similar trend can also be observed in one Neutral sentence as shown in Figure 4-6.
Figure 4-7 and 4-8 show the distributions of the specific pitch-related RS feature (rate=256 Hz, scale=4 cycle/octave) under clean and 5dB noisy conditions, respectively.
The distributions are derived from the same sentences used in Figure 4-4 and 4-6. Response for Neutral is greater than that for Anger as we mentioned earlier. The effect by white noise does not cause dramatic damage but only a slight shift to the distributions. These figures give ideas about the superior performance of our RS features to conventional MFCCs plus prosodic features in low SNR conditions.
Frequency (Hz)
Time (ms)
200 400 600 800 1000 1200 1400 1600 1800 125
200 400 600 800 1000 1200 1400 1600
125
Figure 4- 3 (a), (b) spectrogram and RS plot of a Berlin Anger sentence; (c), (d) spectrogram and RS plot of the same sentence with Neutral emotion.
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0
Figure 4- 4 One Berlin Anger sentence: (a) and (b) depict a high rate/high scale (pitch-related) response and a low rate/low scale (AM-related) response plotted along the
time axis under clean condition; (c) and (d) depict the responses of the same rate-scale combinations as in (a) and (b) under 5dB noisy condition
- 2.
100 200 300 400 500 600
125
Figure 4- 5 white noise: (a) spectrogram, (b) RS plot
0 200 400 600 800 1000 1200 1400 1600 1800
0
0 200 400 600 800 1000 1200 1400 1600 1800
0
0 200 400 600 800 1000 1200 1400 1600 1800
0
0 200 400 600 800 1000 1200 1400 1600 1800
0
Figure 4- 6 One Berlin Neutral sentence: (a) and (b) depict a high rate/high scale (pitch-related) response and a low rate/low scale (AM-related) response plotted along the
time axis under clean condition; (c) and (d) depict the responses of the same rate-scale combinations as in (a) and (b) under 5dB noisy condition
0 1 2 3 4 5 6 7 8 9 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
clean
feature value
Prob.
Anger Neutral
Figure 4- 7 The distribution of a pitch-related feature (rate=256 Hz, scale=4 cycle/octave) under clean condition
0 1 2 3 4 5 6 7 8 9
Figure 4- 8 The distribution of a pitch-related feature (rate=256 Hz, scale=4 cycle/octave) under 5dB noisy condition
SFFS curve for clean train, 10dB babble noise test
Figure 4- 9 Recognition rate (in %) of RS180 by SFFS method
A feature selection method, sequential forward floating selection (SFFS) [22], is used to examine contributions within RS180 features. It starts from an empty feature set and sequentially includes (or excludes) a feature into the selected set, then evaluates the
performance of newly constructed feature set. As shown in Figure 4-9, the performance peaks around using 100 features and does not vary a lot from using 60 to 140 features. Tests on other SNR conditions have the similar trend. These results simply imply our RS features are highly redundant, which is not unexpected due to the highly overlapped two-dimensional filters in the cortical module [2]. Therefore, RS180 can be further downsampled to RS92 by choosing rate-scale combinations of gray spots in Figure 4-10.
Note, only downward direction (positive rate) is shown in the figure.
2 4 8 16 32 64 128 256 512 Rate (Hz)
Scale (cycle/octave) 8 4 2 1 0.5
Figure 4- 10 Rate-scale selections (gray areas) of RS92
Two subsets of MFCC156 are selected to compare with our reduced RS92 features.
The first subset (MFCC78) contains the mean and standard deviation of 13 MFCCs, 13
∆MFCCs and 13 ∆∆MFCCs. The second subset (MFCC52) contains the mean, standard deviation, skewness, and kurtosis of 13 MFCCs. Both subsets are then combined with
∆MFCCs and 13 ∆∆MFCCs. The second subset (MFCC52) contains the mean, standard deviation, skewness, and kurtosis of 13 MFCCs. Both subsets are then combined with