Thesis Organization - 使用時頻變化調變於強健語音情緒辨識

Chapter 1 Introduction

1.4. Thesis Organization

This thesis is organized as follows. In section 2, a brief literature review of the two-module spectro-temporal auditory model and support vector machine are given. Two emotion speech databases and four sets of features used in this study are then introduced in section 3. In section 4, experimental setup and recognition results on two databases are demonstrated. We end in section 5 with conclusions and discussions.

Chapter 2 Literature Review

2.1. Auditory Model

The auditory features adopted in this study are extracted from stages of a physiological based auditory model. For better understanding the ideas and reasons of auditory model, some hearing physiology of human perception will be briefly introduced at first. Then, the auditory model which consists of an early cochlear (ear) and a central cortical (A1) module will be discussed in section 2.1.2 and 2.1.3.

2.1.1. Hearing Physiology

The cross-sectional view of the human ear is shown in Figure 2-1. It can be divided into three parts: the outer ear, the middle ear, and the inner ear. Sound waves enter the outer ear and travel through the ear canal to the tympanic membrane (ear drum). The vibrations of the ear drum are transmitted into the inner ear through three ossicles (the malleus, incus and stapes) in the middle ear. The stirrup touches a liquid filled sack and the vibrations travel into the cochlea, which is shaped like a shell. The cochlea attaches to hundreds of nerve fibers, which transmit information along the auditory pathway to the brain. Finally, the brain

processes the information from the ear for various tasks.

Figure 2- 1 Cross-sectional view of the human ear (http://mail.pittsfield.net/teachersites/Whelihan_Kathleen/).

The major functions of the out ear are localization, amplification and protection. The shape of the outer ear enables people to collect sound waves and judge the direction of sound source easily. The three ossicles transduct the acoustical vibrations into mechanical vibrations and compensate part of the loss of energy due to entering the liquid from the air.

The cochlea in the inner ear plays a significant role in the auditory system. The structure of the cochlea is shown in Figure 2-2. The left panel shows the stretched cochlea with the basilar membrane (BM), which is about 35 mm in length with its width increasing and stiffness decreasing both non-uniformly from base to apex. When a mechanical vibration reaches the oval window, a traveling wave is generated and propagates along the basilar membrane of the cochlea. Because of the different stiffness along the BM, the traveling waves caused by different frequencies will reach maximum response and stop at different locations of the BM. The left panel of Figure 2-2 shows the side view and top view of cochlea and the right panel shows a schematic plot about maximum responsive frequencies along the basilar membrane. The lower the frequency is, the further the traveling wave

reaches. A linear relationship was observed between the traveling distance from the cochlear base and the log-frequency of input sounds. The range of resonance frequencies is about 20-20,000 Hz, which is the audible frequency range of human beings. Due to the mechanical property of the traveling wave, the maximum response on a specific frequency would inhibit its neighboring frequencies on the BM. This might explain the well-known

“frequency masking” phenomenon of human audition.

Figure 2- 2 Structure of cochlea (left) and responses for different frequencies (right) (Hearing Physiology Handout, AAIP).

There are about 3000 inner hair cells distributed along the basilar membrane. When a traveling wave generates displacement on the BM, the hair cells will be stimulated and remit electrical signals via auditory nerves to the midbrain. There are two different hair cells:

inner hair cells and outer hair cells. Most of this mechanism of transforming mechanical vibrations into electrical signals is done by inner hair cells. Outer hair cells, on the other hand, are often active in further amplification or reduction in pertaining to extreme sounds.

Due to the fact that a relaxation time is needed between consecutive fires of auditory neurons, firing rates can not keep up with high frequency vibrations, as demonstrated in Figure 2-3. Firing rates of inner hair cells are bounded by 4-5k Hz, while the rates of the

midbrain are bounded by about 1k Hz.

Figure 2- 3 The firing rate of auditory nerve correspond to the single tone input (left) and the adaptation mechanism of auditory nerve (Hearing Physiology Handout, AAIP).

2.1.2. Cochlear Module

L ate ra l in h ib ito ry n e tw o rk B a s ilar m e m b ra n e filte rs

y1 y₂ y₃ y4

H a ir c e ll s ta g e s

Frequency(Hz)

T im e

A u d ito ry S p e c tro g ra m y5

T im e (m s )

Figure 2- 4 Stages of the early cochlear module (adopted from [2])

The cochlear module models functions of the peripheral auditory system. As shown in Figure 2-4, it first consists of a bank of 128 overlapping asymmetric constant-Q bandpass filters (Q₃_dB ≈4) which mimic the frequency selectivity of the cochlea. These filters distribute evenly over 5.3 octaves with 24 filters/octave frequency resolution. The output of each filter is fed into a non-linear compression stage and a lateral inhibitory network (LIN), and then processed by an envelope extractor. The non-linear compression is to model the

saturation of the inner hair cells, and the LIN is to model the frequency masking effect. In this study, a simplified linear version of this module without the hair cell stage is used. All tested speech signals are normalized in advance to avoid the high-gain compression done by hair cells. Outputs of different stages of this module can be written as:

ω ω

1( , ) ( ) _t ( ; )

y t =s t ∗ h t (2-1)

ω ω ω

3( , ) 1( , )

y t = ∂ y t (2-2)

ω ω

4( , ) max( ( , ), 0)3

y t = y t (2-3)

ω ω µ τ

5( , ) 4( , ) _t ( ; )

y t =y t ∗ t (2-4)

where h t( ; )ω is the impulse response of the constant-Q cochlear filter with center frequency ω ^; ^∗_t depicts the convolution in time; the integration window µ τ( ; )t =e⁻^t^/τ ⋅u t( ) with the time constant τ models the current leakage along the neural pathway to the midbrain; and u t( )is the unit step function.

The output y t₅( , )ω is referred to as an auditory spectrogram, which represents neuron activities along the time and log-frequency axis. Intuitively, it is similar to the magnitude response of a mel-scaled FFT based spectrogram, where our constant-Q criterion approximates the mel-scale and our local envelope approximates the magnitude of a FFT based spectrogram.

2.1.3. Cortical Module and Rate-Scale Representation

The second module models the spectro-temporal selectivity of neurons of the auditory cortex (A1). Briefly speaking, the auditory spectrogramy t₅( , )ω is further analyzed by A1’s neurons which are modeled by two-dimensional filters tuned to different spectro-temporal

modulation parameters [2]. The rate (or velocity) parameter in Hz reflects how fast the local spectro-temporal envelope varies along the temporal axis. The scale (or density) parameter in cycle/octave characterizes how broad the signal’s local spectro-temporal envelope distributed along the log-frequency axis.

In addition to the rate and scale, cortical neurons are also found to be sensitive to the direction of the FM sweep. This directionality is characterized in this module by the sign of the rate (negative for upward sweeping; positive for downward sweeping). From functional point of view, this module models cortical neurons as performing a joint spectro-temporal multi-resolution analysis (due to various rate-scale combinations) on the input auditory spectrogram. The excitation pattern of cortical neurons to a single t-f point in the spectrogram is referred to as the rate-scale representation of that particular t-f point. Each rate-scale representation is labeled by neurons’ tuning characteristic of rate, scale, and directionality.

Two averaged rate-scale plots over the frequency axis around 200 and 550 ms are given in Figure 2-5. Two aspects are clearly shown in each rate-scale plot: (1) spectro-temporal modulations of envelopes and (2) resolved pitch below 512 Hz. Take the 550 ms frame as an example. The resolved pitch around 230 Hz excites {high rate, fine scale} neurons, thus produces the corresponding rate-scale representation. On the other hand, envelopes of the almost flat harmonic structure shown at 230, 460 and 1150 Hz excite neurons tuned to {low rate (due to the flatness), low scale (2 cycles within 2.32 octave)}

and produce strong rate-scale responses at regions less than 8 Hz and less than 1 cycle/octave. Since flat envelopes do not favor any sweeping directions, symmetric responses to rate are clearly shown in the {low rate, low scale} region. More detailed description and mathematic formulation of this cortical module can be found in [2].

Figure 2- 5 Rate-scale representation produced by the cortical module.

2.2. Support Vector Machine (SVM)

The SVM, a supervised learning algorithm, is usually used for classification and regression. It is very popular in recent years due to its remarkable performance. In this thesis, we adopt the support vector machine as our emotion classifier. In this section, we will give a brief introduction to SVM. Detailed setups for our experiments will then be given in section 4.1.

2.2.1. Separable problem

For a supervised learning algorithm, we consider a set of training samples

(

^x^{( )}ⁱ ^,^y^{( )}ⁱ

)

where

x x

 

 

= 

 

  x(i)

⋮ is the m-dimensional feature vector of the i-th training data, i =1, 2, …,

n and ^y^{( )}ⁱ ^{∈ −}

{ }

^{1, 1} represents the class label of the i-th training data for a basic two-class classification problem. As Figure 2-6 indicates, we want to find a hyperplane that can perfectly separate these two classes. The hyperplane can be represented as:

( )

g x =w x^T +b (2-5)

then, the data of the two classes satisfies:

( )

Figure 2- 6 The optimal hyperplane for a separable problem using SVM.

There may be a lot of choices for

(

^w^,b

)

that are separable; however, the goal of the SVM is to find a hyperplane which possesses the largest separation, or margin, between the two classes. That is, we want to choose a hyperplane so that the distance from it to the nearest data point on each side is maximized. From Figure 2-6, the margin can be represented as

w . The optimal separating hyperplane can be found by solving the following problem:

(2-7)

Equation (2-7) can be solved by constructing Lagrange multipliers α_i ≥0 in the following primal form: over the nonnegative Lagrange multipliers α_i ≥0. At the saddle point, one obtains:

( ) ( )

Invoking the Karush-Kuhn-Tucker dual complementary conditions, problems in (2-11) form can further be derived into the following form:

( )

are called support vectors. Thus, the computations in equation (2-9) can be further reduced.

Finally, α_i are solved by the quadratic programming and parameters w, b are then obtained by equation (2-9) and (2-12). Hence, the final classifier g( )⋅ is derived and used to predict a new test point x by:

( ) ( )

g x =sign w x^T +b (2-13)

2.2.2. Binary non-separable problem

The simplest problem discussed in section 2.2.1 can be extended to a non-separable problem (Figure 2-7) by introducing additional slack variables ξ_i and cost parameter C.

We can relax the equation (2-7) intoy^{( )}ⁱ (w x^T ⁽ⁱ⁾+ ≥ −b) 1 ξ_i to tolerate some outliers. The parameter C controls the total relaxation value _k

∑

ξ in a reasonable small range. The lower the value of C is, the smaller the penalty for outliers is and a softer margin exists.

outlier

Figure 2- 7 Non-separable problem.

The primal form of equation (2-7) can be reformulated as follows:

( )

and the dual form of equation (2-11) can be modified into:

( )

2.2.3. Nonlinear problem

The formulations of the SVM can also be extended to tackle nonlinear problems. The SVM adopts a way to map the original features into a higher dimensional space, and solves the problem linearly in the new space (see Figure 2-8). If φ is our mapping function, the

However, the inner product in equation (2-11) would increase the computational load. A key property of SVM recognizers is to use the so-called kernel function

(

^{( )}ⁱ ^, ^{( )}^j

)

⁽ ^{( )}ⁱ ⁾^T ⁽ ^{( )}^j ⁾

K x x =φ x ⋅φ x to replace the inner product. There is no need to find the mapping function φ explicitly, while any function that satisfy Mercer’s theorem can be used as kernel functions here. Table 2-1 lists four basic kernel functions used frequently.

Figure 2- 8 Map nonlinear problem to higher dimensional space.

Table 2- 1 basic kernel functions linear ^K

(

^{x x}ⁱ^, ^j

)

⁼^xⁱ^T^⋅^x^j

polynomial ^K

(

^{x x}ⁱ^, ^j

)

⁼

(

^γ^xⁱ^T^{⋅ +}^x^j ^r

)

^{, >0}^γ

Gaussian (RBF) ^K

(

^{x x}ⁱ^, ^j

)

⁼^exp

(

⁻^γ ^xⁱ⁻^x^j ²

)

^{, >0}^γ

sigmoid ^K

(

^{x x}ⁱ^, ^j

)

⁼^tanh

(

^γ^xⁱ^T^{⋅ +}^x^j ^r

)

Chapter 3 Database and Feature Extraction

3.1. Berlin Emotional Speech Database (EMO-DB)

The popular Berlin Emotional Speech Database [14] is tested in pilot simulations in this study. Clean speech samples are uttered by five female and five male actors. Each actor speaks ten sentences in German. Each sentence has duration of 2 to 5 seconds. Detail contents are listed in Table 3-1. The database contains emotions of anger (126), happiness (70), sadness (62), fear (66), disgust (44), boredom (80), and neutral (78). Only those utterances scoring higher than 80 emotion recognition rate in a subjective listening test are included in the database. Hence, there are 526 sentences in total with seven classes of emotions. Original speech samples are recorded with 16 kHz sampling frequency under studio condition, and are downsampled to 8 kHz to cover the fundamental frequencies of male speakers when analyzed by our 5.3-octave frequency coverage cochlear filterbank in our auditory model (see section 2.1).

White noise and babble noise are obtained from the NOISEX-92 database [17] and added to clean speech to simulate various SNR conditions. A simple energy-based VAD is first applied to each clean utterance to determine its active regions. Only durations of active regions are considered in calculating SNR.

Table 3- 1 The German content of EMO-DB and its English translation

3.2. FAU AIBO database

The FAU AIBO corpus [15] contains recordings from children interacting with SONY's pet robot AIBO. The most important characteristic of these recordings is they are natural with non-acted emotions. The children were invited to play with the AIBO and

code German text English translation

a01 Der Lappen liegt auf dem

Eisschrank. The tablecloth is lying on the fridge.

a02 Das will sie am Mittwoch

abgeben. She will hand it in on Wednesday.

a04 Heute abend könnte ich es ihm

sagen. Tonight I could tell him.

a05

Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.

The black sheet of paper is located up there besides the piece of timber.

a07 In sieben Stunden wird es soweit

sein. In seven hours it will be.

b01 Was sind denn das für Tüten, die da unter dem Tisch stehen?

What about the bags standing there under the table?

b02 Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.

They just carried it upstairs and now they are going down again.

b03

An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.

Currently at the weekends, I always went home and saw Agnes.

b09 Ich will das eben wegbringen und dann mit Karl was trinken gehen.

I will just discard this and then go for a drink with Karl.

b10 Die wird auf dem Platz sein, wo wir sie immer hinlegen.

It will be in the place where we always store it.

asked to guide it through certain missions, such as moving from point A to point B along a particular route. Children believed that the AIBO would have responded to their commands directly, whereas it was actually controlled by a human operator to behave excellently or disobediently, thereby to provoke emotional reactions. The data was collected from two different German schools, Mont and Ohm, from 51 children (of age 10~13; 21 boys and 30 girls). Speaker independence is assured by using the data from one school for training and the data from another school for testing. The original recordings are sampled at 16k Hz. For the same reason as stated in section 3.1, speech samples are downsampled to 8k Hz. The original recordings with pause length over 1 sec were segmented automatically into “turns”.

Five labelers (advanced students of linguistics) annotated each turns in word-level as neutral (default) or as one of ten other emotion classes. Majority voting (MV) was then used, that is, only those words with three or more than three labelers’ agreement were included into the corpus. The classes and number of speech samples in each class were: joyful (101), surprised (0), emphatic (2,528), helpless (3), touchy (225), angry (84), motherese (1,260), bored (11), reprimanding (310), rest (3), neutral (39,169).

We follow the INTERSPEECH 2009 emotion challenge [18] criterions which differentiate the classification problem into a five-class problem and a two-class problem.

For the five-class classification problem, emotions are grouped into Anger (angry, touchy, and reprimanding), Emphatic, Neutral, Positive (motherese and joyful), and Rest. The two-class problem deals with NEGative (subsuming angry, touchy, reprimanding, and emphatic) and IDLe (consisting of all nonnegative states) emotions. More details about numbers of speech samples are listed in Table 3-2 and Table 3-3. Similar to section 3.1, white and babble noises are added to the original clean speech to test the robustness of various features.

Table 3- 2 Number of instances for the 5-class problem

# A E N P R sum

train 881 2093 5590 674 721 9959

test 611 1508 5377 215 546 8257

sum 1492 3601 10967 889 1267 18216

Table 3- 3 Number of instances for the 2-class problem

# NEG IDL sum

train 3358 6601 9959

test 2465 5792 8257

sum 5823 12393 18216

3.3. Rate-Scale (RS) Features

As mentioned in section 2.1.3, rate-scale plots reveal joint spectro-temporal modulations of the speech. The slow modulations, which are related to the speaking rate (i.e., the changing rate of the vocal track), are shown in low rate regions. On the other hand, the energy of resolved pitch is captured in high rate regions. In this study, we consider rates at ±2^{1, ,9}^⋯ Hz to cover the complete temporal structures (speaking rate and pitch) of the speech. As for the scale region, we emphasize on the 2⁻^{1, ,3}^⋯ cycle/octave to cover complete frequency structures, from formants (captured by low scales) to harmonics (captured by high scales). Therefore, 90 rate-scale features (9 rates, 5 scales and both directions) are

extracted per frame. The mean and standard deviation of these 90 RS features are then calculated over the entire utterance. Finally, 180 RS features per utterance are preserved for emotion recognition.

3.4. MFCC Features

The mel-frequency cepstral coefficients (MFCCs) are widely used in the speech analysis field. Here, the first 13 MFCCs (including the zero-order coefficient) are extracted from 25 ms Hamming-windowed frame every 10 ms with the pre-emphasis coefficient 0.97.

The mean, standard deviation, skewness, and kurtosis of these 13 MFCCs, their deltas, and double-deltas are computed as 156 features per utterance. It is referred to as MFCC156.

Figure 3- 1 block diagram for extracting MFCC

3.5. Prosodic Features

The 180 RS features mentioned above contain pitch and timbre (i.e., the formant structure) information, however, conventional MFCCs only carry timbre information. To make a fair comparison, prosodic features (pitch, energy and duration) are extracted and combined with MFCC features.

The fundamental frequency (F0) contour is extracted by STRAIGHT [19]. The algorithm estimates the aperiodic power (AP) of each frame. Frames with high AP are assumed unvoiced with zero F0. Only low-AP frames are treated as voiced frames and

return valid F0 estimate. The energy contour is extracted every 10 ms with a 25 ms window.

Duration related features are derived from the voiced/unvoiced discrepancy obtained in F0 estimation.

Statistics of these prosodic features used in this study are similar to those used by other researchers [3, 4]. However, not to form a huge feature set with 1000 ~ 4000 parameters, a reasonably small-sized feature set is constructed. As a result, some features are omitted or replaced. For example, the mean of the positive and the negative dF0 are calculated separately to represent the upward and the downward trend, respectively, instead of the mean of all dF0. As for the energy, the minimum value of energy must be close to zero such that the min value, relative position of min, and range would not provide crucial information and hence are dropped from our feature list. Finally, 30 prosodic features are extracted and referred to as the PRO30 feature set. The description of this feature set is given in Table 3-4.

Table 3- 4 30 prosodic features

F0 (8 features)

mean, std,

max value, relative position of max, min value, relative position of min,

range, number of local max point dF0

(8 features)

mean of positive, mean of negative, std, max value, relative position of max,

min value, relative position of min, ratio of positive

logE (3 features)

std,

max value, relative position of max dlogE

(8 features)

mean of positive, mean of negative, std, max value, relative position of max,

min value, relative position of min, ratio of positive

Duration (3 features)

speaking rate,

std of voiced duration, mean pause time

3.6. INTERSPEECH 2009 Emotion Challenge Acoustic

Features

For the AIBO database, we compare the acoustic features adopted in INTERSPEECH 2009 emotion challenge with proposed RS features under noisy conditions. This default feature set provides baseline results for both HMM and linear kernel SVM recognizers in the 2009 challenge and is totally transparent with the accessible open source openSMILE feature extraction toolkit [20]. It includes the most common features in pertaining to prosody, spectral shape, voice quality, as well as their derivatives. In details, the 16

在文檔中使用時頻變化調變於強健語音情緒辨識 (頁 13-0)