The Research Paradigm - 論情緒辨識架構中時頻調變特徵參數的強健性

Chapter 1 Introduction

1.5 The Research Paradigm

Our research paradigm, or “research programme” (in Imre Lakatos’ phrase), is always clear: to develop a robust affect recognizer. Robustness against noise is one start that spectro-temporal modulation features bring. The ultimate goal is to refine the recognition system to be robust against even speaker. During my research, I can only narrow the ambition down to a simpler version: to verify the robustness spectro-temporal modulation features can offer.

To sum up, my research paradigm has the following structure:

1. Hard core: To verify that RS features is still robust (against noise) in stricter

cases (e.g. when the case is recognizing spontaneous emotion).

2. Protective Belts: Factors that prevent RS features from performing full

potential should be ruled out

3. Positive Heuristics

Robust against noise

Robust performance of both acted and spontaneous emotions

An obvious positive heuristic here is to draw stricter condition to training and testing condition and see if the RS feature set still performs robust. More specifically, the paradigm asks us to do tests on mismatched conditions. Previously, experiments under matched conditions, i.e. training and testing samples that have the same distortion or noise, have been studied and bolstered. However, matched condition is just a start, never the end. Yeh commenced mismatched condition and now it is our time to carry through it.

The only thing I tried to do is to verify that even under a very strict condition, there are still features that are not disqualified. Many features significantly change their characteristics under the effect of noise, energy profile being one of those. Since human beings can recognize emotion, speech content, or other meaningful information in noisy speech, there is no reason machines cannot. This gives us interests to examine RS features which have inspiration origins from human audition.

This thesis is structured as follows. Chapter 2 covers related work and background knowledge. Chapter 3 lists detailed information about the materials and methods. Chapter 4 is for experiments and discussions. Finally, Chapter 5 illustrates the big picture and some specific future work.

Chapter 2 Literature Review, Related Work and Background Knowledge

2.1 Machine Learning

Recognition is essentially a question of classification (or regression), and the whole process can be purely mathematical. Rooted deeply in statistics and aided by computer, automatic classification has become a powerful well-structured instrument.

Machine learning which offers such instruments became one of the most prominent research fields in recent years. Some famous classifiers has been employed even in realistic applications; artificial neural networks (ANN), naïve Bayesian classifier, (NBC), Bayesian logistic regression (BLR), relevance vector machine (RVM), just to name a few. This section explains the reason the support vector machine (SVM) (Vapnik, 1995) is adopted and briefs some background knowledge about it.

2.1.1 Kernel methods and sparse kernel machines

In supervised learning, there are three main perspectives to solve a regression or classification problem. Generative models attempt to model the distribution of inputs as well as the outputs, explicitly or implicitly. Discriminative models only model the posterior distribution. Discriminant functions, however, concern nothing about distribution but seek the decision boundaries. SVM, which is a member of the last category of learning, can still be interpreted in the light of probability. This interpretation will be given later.

Kernel method and the dual form

All regression or classification problems have a similar form of solution:

y f w^Tφ x b

y: algorithm output regression target or class label t : ground truth regression target or class label .

w: weighting vector (or equivalently, the normal vector of the decision boundary in classification case).

b: bias term x: input vector φ x : feature vector

f · : activation function that transforms regression target into class labels.

Now, let us consider a linear regression case (for it shares the same core as classification) whose parameters (weighting vector) are determined by minimizing a regularized minimum mean squared error (MMSE) criterion where the error function is given by

J w 1

2 w^Tφ x t λ

2w^Tw

, where λ 0.

and n 1,2, … , N denotes sample indices The optimal solution (by differentiating J w w.r.t. w) for w takes the form of w λ∑^N w^Tφ x t φ x ∑^N a φ x Φa,

with a

λ∑^N w^Tφ x t

If we substitute w with Φa and define the kernel matrix as K Φ^TΦ, whereΦ φ x is the design matrix, we have

J a 1

2a^TKKa a^TKt 1

2t^Tt λ 2a^TKa

Setting the gradient of J a with respect to a to 0, we obtain dJ

da KKa Kt λKa 0 a K λI_N t

Substituing a back into y w^Tφ x , we get y x k x ^T K λI_N a k x^Tx φ x ^Tφ x is known as the kernel function.

This is the dual form of the original problem. In the dual form, the prediction (regression target) can be made solely by the training set (cf. [Bishop, 2006] and [Ng, 2009], for Mercer’s theorem and other detail limitation a kernel must obey). We recognize that this form of decision (kernel method) belongs to discriminant function in which probability are not involved at appearance. In the next section, we attempt to link the two.

Probability interpretation

Assume the prior distribution of w obeys an isotropic Gaussian of the form:

w~N 0, α I

Given the training samples and basis functions (or equivalently, kernel function), we have the design matrix (or equivalently, the kernel matrix) and then we know that y Φw is a linear combination of Gaussian vectors and hence is a Gaussian vector.

E y ΦE w 0

cov y E yy^T E Φww^TΦ^T Φ 1

αI Φ^T 1 αK Therefore, y~N 0,

αK . The kernel matrix is the dominating factor of the covariance matrix. This bridges the gap between the two paradigms of discriminant function and discriminative model.

Now we further assume that we can model the problem using Gaussian process:

t y , where ~N 0, σ is prediction error

The joint distribution of the regression target t t , … , t_N ^T conditioned on the values of y y , … , y_N ^T is given by an isotropic Gaussian of the form

t y , where ~N 0, σ I 1) t y ,

2) y~N 0,

αK 3) ~N 0, σ I

 t~N 0, C where C σ I

αK. This concludes the training phase.

In the testing phase, the regression target, say t_N , has correlation with the targets in the training phase. Incorporating training samples t with testing sample t_N into t_N , according to Bayes’ theorem for Gaussian variable, we know

t_N ~N 0, C_N , where C_N C_N k^T k

c , where k k x , x_N , n 1,2, … N and c k x_N , x_N Therefore

t_N |t~N k^TC_N t, c k^TC_N k

Since C_N is determined by training data and c and k are determined by testing sample, the decision is made naturally. Starting from divergent point of views, discriminant function and probabilistic discriminative model finally reach the same

end.

Despite of the same purpose and mathematical analysis, why should someone adopting kernel methods and the dual form if in reality the number of samples (N) is larger than dimension (the kernel matrix is N N while solving w, we only have to deal with M M matrix where M is the dimension of basis function)? This is because in some cases, not all training samples are necessary and thus the kernel matrix becomes sparse. A sparse kernel is equivalent to a reduced N N matrix.

Sometimes the reduced N can be very small-- even smaller than M. This justifies the use of the dual form. In the next section, we will introduce one type of sparse kernel machines: the support vector machines (SVM).

2.1.2 Support Vector Machines

Attempts to solve binary classification problems were made in early days. Frank Rosenblatt’s perceptron is among the early attempts (Rosenblatt, 1962). Perceptrons have several shortcomings. It cannot solve overlapping classes and its decision boundary might not be optimal (Bishop, 2006).

Figure 2.1.1

Left: Correct binary classification without maximizing margin.

Right: Maximum margin classifier.

Source: Lecture notes from Machine Learning. Wang, 2011.

Originally devised for linearly separable binary classification problem, the SVM attempted to maximize the margin between classes. Let the class label be 1 and -1, the distance of any point x to the decision boundary is

t y x

w , t 1, 1

Define functional margin as “the shortest distance between the decision boundary and the class”, i.e. functional margin is the distance between the boundary and the sample which is nearest to it:

γ min t y x w

t w^Tφ x b w

Now we try to find a decision boundary that has maximum margin, so we solve arg max

, γ arg max

t y x w

This optimization problem is too complex to solve, therefore we set the functional margin to unity

Set γ t w^Tφ x b 1

t w^Tφ x b 1, n 1, … , N The problem becomes

arg min

2 w , subject to

t w^Tφ x b 1

which is a (solvable) quadratic programming problem.

To solve this problem, we introduce Lagrange multipliers a 0 such that

where a a , a , … , a ^T Taking derivatives, we have

dw 0 w a t φ x

db 0 0 a t

Substituting w, we have

L a a

N 1

2 a a t t k x , x

N N

bject to Karush-Kuhn-Tucker condition (if it has a solution):

a 0

t y x 1 0

a t y x 1 0 To predict testing label,

y x a t k x, x b

Based on the kernel method we introduced previously, SVM can make prediction based on empirical input data (training samples). Applying certain kernel functions, SVM can also solve non-linearly separable problems.

Figure 2.1.2 Non-linearly separable problem using a radial basis function (rbf) kernel.

Source: Lecture notes from Machine Learning. Wang, 2011.

Extension to overlapping classes (Non-separable problems)

Since the class distributions are overlapping, the technique mentioned in previous section cannot be directly applied. Consequently, we introduce a slack variable ξ 0 that allows sample points to be on the wrong side.

Figure 2.1.3 A simple example explaining how the slack variables work.

Source: Lecture notes from Machine Learning. Wang, 2011.

The introduction of slack variables changes the constraint to t y x 1 ξ , n 1, … , N

Similar to the previous section, we again introduce additional Lagrange multipliers μ 0 to solve the optimization problem

L w, b, a 1

2 w C ξ a t y x 1 ξ

µ ξ

subject to KKT conditions if it has a solution a 0

t y x 1 ξ 0

a t y x 1 ξ 0

µ 0

ξ 0

µ ξ 0

Taking derivatives dL

dw 0 w a t φ x

db 0 0 a t

dξ 0 a C µ

And, of course, by substituting w, we can obtain the dual representation. The introduction to the mathematics of standard SVM ends here. An alternative form of SVM, known as ν-SVM and introduced by Schölkopf et al., has the equivalent form of maximizing

L a 1

2 a a t t k x , x

N N

, subjec to

0 a 1

a t 0

a ν

Schölkopf proved that if 1) The kernel is analytical.

2) The training samples are independent and identically distributed.

then the value ν is

1. An upper bound on the fraction of margin errors 2. An lower bound on the fraction of support vectors

Figure 2.1.4 Illustration of SVM applied to a overlapping 2-dimensional data set. The support vectors are indicated by green circles.

Source: Machine Learning and Pattern Recognition. Bishop, 2006.

Philosophy of science in SVM

Vapnik’s design of SVM is an instantiation of falsification in the philosophy of science. When it comes to defining the error function, SVM chose a hinge function which exonerates samples very far from the boundary, indicating that only the wrong samples matter. When computing the decision boundary, samples that are nearly wrong or misclassified become support vectors. Even in the theory of Vapnik-Chervonenkis dimension, we can see the falsifiability concept so clear (Vapnik, 2006).

Disadvantages

Originally devised for 2-class separable problems, SVM has some disadvantages when applying to non-separable multi-class problems.

1. It can only give hard decisions (binary outputs) instead of soft ones (probabilistic, numeric ones). In some fields of application (e.g. weather forecast), we prefer a probabilistic prediction rather than a clear-cut outcome.

2. Multi-class problems are theoretically unsolvable by binary classifiers.

Commonly adopted schemes, including one-against-one (OAO) and one-against-all (OAA) decomposition of original multi-class problem, leave unresolvable areas where samples cannot be determined (an intuitive explanation is shown in Fig. 2.1.5; for comparison of multi-class SVM methods, cf. (Hsu and Lin, 2002) ).

3. Misclassified samples are all supposed to be support vectors, so when classes overlap in the feature space, the amount of support vector increases. That is why highly non-separable problem makes the kernel very non-sparse, increasing training and testing time.

4. It is time-consuming to tune the hyperparameters. For example, for a regression problem that applies rbf kernel, we have to tune C (regularization term), ε(regression tolerance parameter), and γ (rbf parameter).There is no way to know advanced what value the hyperparameters might fall on. Grid search is most commonly suggested method to try parameter, but still, this strategy is not time-saving.

Despite all the above mentioned disadvantages, SVM is still widely adopted in our or other researcher’s experiments for its simplicity.

Figure 2.1.5: Binary classifier solving multi-class problem. Unresolvable areas

are shaded green.

Source: Machine Learning and Pattern Recognition. Bishop, 2006.

2.1.3 Imbalanced Datasets

Long existing in everyday applications, imbalanced datasets are a major and annoying issue in machine learning. In recent years, the imbalanced learning problem has drawn a significant amount of interest from academia, industry, and government funding agencies (He and Garcia, 2009). The fundamental issue with the imbalanced learning problem is the ability of imbalanced data to significantly compromise the performance of most standard learning algorithms. Most algorithms assume or expect balanced class distributions or equal misclassification costs. Therefore, when presented with complex imbalanced data sets, these algorithms fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable accuracies across the classes of the data. In real-world domains, the imbalanced learning problem represents a recurring problem of high importance with wide-ranging implications, warranting increasing exploration.

There are three major approaches to handle data imbalance. The most intuitive type is sampling methods. Under-sampling the majority, over-sampling the minority, and synthetic sampling are the most popular ones. Cluster-based methods and data clearing methods such as Tomek link are usually applied as auxiliaries. Cost-sensitive methods are another main category dealing with imbalanced datasets. In some domains, cost-sensitive methods are even superior to sampling methods (He and Garcia, 2009). The other method, called kernel-based methods, is to calibrate learning algorithms themselves.

2.2 Auditory Model

The proposed auditory features were extracted from stages of an auditory model, which is based on physiological evidences and consists of early cochlear (ear) and central cortical (A1) modules.

y1 y₂ y3 y4

Figure 2.2.1 Detail block diagrams of auditory model (feature extractor).

2.2.1 Cochlear Module

The cochlear module models the functions of the peripheral auditory system. The cochlea behaves like a frequency analyzer. As Fig. 2.2.1 shows, the cochlear module consists of a bank of 128 overlapping asymmetric constant-Q band-pass filters

Q _B 4 that mimic the frequency selectivity of the cochlea. These filters are distributed evenly over 5.3 octaves with a 24 filters/octave frequency resolution. The output of each filter is fed into a non-linear compression stage and a lateral inhibitory network (LIN), and then processed by an envelope extractor (a half-wave rectifier followed by a low-pass filter). The non-linear high-gain compression models the saturation of the inner hair cells, which transduce the vibrations of the basilar membrane along the cochlea into intracellular hair cell potentials. The auditory nerve then transmits the hair cell potentials to the cochlear nucleus of the central auditory system. This transmission is simulated by the LIN, which generates a spectral profile by detecting discontinuities along the frequency axis. This is followed by integration over a few milliseconds. This study uses a simplified linear version of this module with a disabled hair cell stage. This approach normalizes all speech signals in advance to avoid the non-linear high-gain compression of the hair cells. As in Fig. 2.2.1, the outputs at different stages of this module can be written as:

y t, f s t h t, f (1) y t, f ∂ y t, f (2) y t, f max y t, f , 0 (3) y t, f y t, f µ t, τ (4)

where s t is the input speech, h t, f is the impulse response of the constant-Q cochlear filter with center frequency f, depicts the convolution in time, ∂ is the partial derivative along the f axis, the integration window µ t, τ e · u t with the time constant τ models the current leakage along the neural pathway to the

cochlear nucleus (midbrain), and u t is the unit step function.

The output y t, f is an auditory spectrogram that represents neuron activities along the time (t) and log-frequency (f) axis. The auditory spectrogram produced by this simplified linear cochlear module is similar to the magnitude response of a Mel-scaled FFT based spectrogram. The constant-Q criterion of the filter bank shares similar effects of the Mel-scale and the local envelope approximates the magnitude of a FFT based spectrogram. Note that the LIN accounts for the spectral masking effect provided that hair cells behave non-linearly. However, since this study does not consider the hair cell stage, the LIN only effectively sharpens the constant-Q cochlear filters.

2.2.2 Cortical Module and Rate-Scale Representation

The second module models the spectro-temporal selectivity of neurons in the auditory cortex (A1). The auditory spectrogram y t, f is further analyzed (filtered) by cortical neurons, which are modeled by two-dimensional filters tuned to different spectro-temporal modulation parameters (Chi et al. 2005). The rate (or velocity) parameter



(in Hz) reflects how fast the local spectro-temporal envelope varies along the temporal axis. The scale (or density) parameter  (in cycle/octave) represents the distribution of the local spectro-temporal envelope along the log-frequency axis. In addition to the rate and the scale, cortical neurons are also sensitive to the sweeping direction of the FM of the sound. This module characterizes directional selectivity using the sign of the rate: negative for upward sweeping direction, and positive for downward sweeping direction.

Therefore, the 4-dimensional output of this cortical module can be formulated as r t, f, ω, Ω y t, f STIR t, f, ω, Ω (5)

where STIR t, f, ω, Ω is the joint two-dimensional spectro-temporal impulse response (STIR) of the direction-selective filter tuned to ω and Ω, and is the two-dimensional convolution in the time and log-frequency domains. More detailed formulations and derivations of the STIR t, f, ω, Ω are available in (Chi et al. 2005).

The local energy of the four-dimensional output is then computed as E t, f, ω, Ω |r t, f, ω, Ω jH r t, f, ω, Ω | (6)

where H · is the Hilbert transform along the log-frequency (f) axis. From a

functional point of view, cortical neurons perform a joint spectro-temporal multi-resolution analysis (due to various rate-scale combinations) on the input auditory spectrogram. The excitation pattern of cortical neurons associated with a single time-frequency (T-F) unit at t , f of the input auditory spectrogram is referred to as the rate-scale (RS) representation of that particular T-F unit, and is expressed as E t, f, ω, Ω .

The frame-based RS representation of an utterance can be obtained by averaging the RS representations of T-F units over the frequency axis as follows:

P ω, Ω, t ∑ E t, f, ω, Ω (7)

The bottom panels of Fig. 2.2.2 show the time-varying RS representation P ω, Ω, t of a sample speech around 200 and 550 ms. Each plot of the RS representation clearly shows two attributes: (1) spectro-temporal modulations of envelopes and (2) resolved pitch below 512 Hz. Consider the 550 ms frame as an example. The resolved pitch around 230 Hz produces a strong response around the high rate high scale (pitch related) region. On the other hand, the envelopes of the almost flat harmonic structure shown at 230, 460, and 1150 Hz produces {low rate (due to the flatness, no FM), low scale (2 cycles/periods within 2.32 octave)} strong responses at regions less than 8 Hz and less than 1 cycle/octave. Since flat envelopes do not favor any sweeping directions, the {low rate, low scale} region exhibits symmetric rate responses. Figure 2.2.1 shows that the frame-based P ω, Ω, t encodes the information of the spectral-temporal structures, including but not limited to pitch, harmonicity, formant spacing, and AM and FM of an input sound at each time instant. Some of these structures, such as pitch, AM, and FM, are associated with the prosody of the sound, while others are associated with the spectral characteristics of the sound. Variations of these two types of features (prosodic and spectral features) commonly appear in speech emotion recognition researches (Cowie et al. 2001; Mozziconacci 2002;

Scherer 2003; New et al. 2003; Ververidis and Kotropoulos 2006; Schuller et al.

2007a; Busso et al. 2009). Therefore, the proposed time-varying RS representation could be a good candidate for speech emotion recognition.

The left and right panels of Fig 2.2.2 show the long-term averaged P ω, Ω, t of clean speech and white noise, respectively. The long-term averaged RS representation of clean speech shown in the Figure 2.2.2 was produced by extracting 30 clean utterances from the NOIZEUS corpus (Loizou 2007). Clearly, the white noise primarily affects the pitch region (> 128 Hz) of speech. In addition to the pitch region, speech possesses high energies in the low-scale low-rate region (< 4 cycle/octave, < 32 Hz), while white noise activates the high-rate high-scale region (>

2 cycle/octave, > 32 Hz) due to differences in the structures of their spectral-temporal envelopes. This indicates that local spectro-temporal speech envelopes are mostly smoother than white noise envelopes along either the time or the frequency axis.

These spectro-temporal envelopes critically encode the amplitude modulation and the frequency modulation of the sound, which are vital cues for humans to segregate individual sound streams from a sound mixture (Grimault et al. 2002; Carlyon et al.

2000). This segregation process of human hearing perception is very important to people’s daily lives, and is referred to as auditory scene analysis (ASA) (Bregman 1990). Since speech envelope modulation is critical to hearing perception and vastly different from white noise envelope modulation, this study uses the time-varying P ω, Ω, t , which decomposes modulations of local envelopes in a multi-resolution fashion, to assess speech emotions under noisy conditions.

Figure 2.2.2 Rate-scale representation of a speech frame.

2.3 Emotion Psychology

The analysis of emotion has three main perspectives. Discrete theory, having

在文檔中論情緒辨識架構中時頻調變特徵參數的強健性 (頁 15-0)