Chapter 1 Introduction
1.3 Organization of Thesis
This thesis is organized as follows: Chapter 2 reviews the ICA algorithm and the GPD method. Chapter 3 describes the proposed structure of the speaker recognition, including MFCCs, the ICA features, and the GMM model with GPD. Chapter 4 depicts the used database and shows the experimental results to verify the performance of our speaker recognition system as mentioned in Chapter 3. The conclusions of this thesis and the future work are given in Chapter 5.
Chapter 2 Framework of Independent Component Analysis and General Probability Descent
Method used in the Speaker Recognition System
2.1 Introduction
As mentioned above, the ICA is used to find out the most important and independent components of the MFCCs features, and the GPD method is taken to consider the whole situation for reducing the overall system error. Both of them are the main parts of this thesis, and therefore we will introduce them more detailed in this chapter.
2.2 Independent Component Analysis
In this subsection, we will show the policy of how to find independent components from the input vectors. And next, a technique called FastICA will be described to find the independent components.
2.2.1 Policy of ICA
Assume that the input vector x is distributed according to the ICA data model and s is the independent components:
x= As, (2.1) where A is the mixing matrix. For simplicity, we also assume that all the
independent components have identical distributions and the unknown mixing matrix A is square. After estimating matrix A , we can obtain the independent component
by:
s=Wx , where W =A . (2.2) −1 Thus one of the independent components can be considered as a linear combination of
x , denoted by: i
= T =
∑
i ii
y w x w x , (2.3)
where w is some weight vector to be determined. If w were one of the rows of the inverse of A , then y would actually be one of the independent components.
However, we cannot determine such a w in practice because of no knowledge of matrix A .
In order to find an estimator that gives a good approximation of w, let us redefine the variables as:
= T two independent random variables is more Gaussian than the original variables, z s T is more Gaussian than any of s . In addition, i s was assumed to have identical i distributions, so only one of the elements z of i z is nonzero. Therefore, we could take w as a vector that maximizes the non-gaussianity of w x . That means we need T to find all these local maxima in order to find several independent components.
For simplify computation, we assume that y is centered (zero-mean) and has variance equal one. The classical measure of nongaussianity is kurtosis, defined by:
4 2 2 4
( )= { } 3( { })− = { } 3−
kurt y E y E y E y . (2.6)
Thus, for any Gaussian variable, its kurtosis is zero, and on the other hand, kurtosis is nonzero for most non-gaussian random variables. Besides, we call the random variables with a positive kurtosis as the super-Gaussian, and those with a negative kurtosis as the sub-Gaussian. Super-Gaussian random variables have typically a spiky pdf with heavy tails; on the other hand, sub-Gaussian random variables have a plat pdf. They are illustrated in Fig. 2-1.
Fig. 2-1 Both the pdf of Super-Gaussian and Sub-Gaussian
Therefore we can use the absolute value or the square of kurtosis to measure non-gaussianity in ICA.
In practice, we could start from some weight vector w and use a gradient method or one of their extensions for finding a new w. However, there are some drawbacks in the kurtosis method and the main problem is that kurtosis is very sensitive to outliers. It means that kurtosis is not a robust measure of non-gaussianity.
For this reason, we seek for other measures of non-gaussianity. Another important
measure is given by negentropy which is based on the information-theoretic quantity of entropy. Entropy is closely related to the coding length of the random variable.
Let us define entropy H for a discrete random variable Y as:
( )= −
∑
i ( = i)log ( = i)H Y P Y a P Y a . (2.7)
If the random variables and vectors are continuous, we call this kind of entropy as differential entropy. Differential entropy H of a random vector y with density
( )
f y is defined as:
( )= −
∫
( )log ( )H y f y f y dy . (2.7)
Because a Gaussian variable has the largest entropy among all random variables of equal variance [18], we can use entropy as a measure of non-gaussianity. In order to obtain a measure of non-gaussianity that is zero for a Gaussian variable and always nonnegative, a slightly modified version of the definition of differential entropy, called negentropy, is redefined as:
( )= ( gauss)− ( )
J y H y H y , (2.8)
where ygauss is a Gaussian random variable of the same covariance matrix as y . As eq.(2.8) mentioned, negentropy is always non-negative and it is zero only when y has a Gaussian distribution. The drawback of negentropy is its difficult computation and then simpler approximations of negentropy will be discussed next.
The classical method of approximating negentropy is using higher-order moments as follows:
{ }
3 2( )
21 1
( )≈12 +48
J y E y kurt y . (2.9) However, the validity of such approximations may be rather limited. These approximations will suffer from the non-robustness encountered with kurtosis.
Therefore new approximations were developed based on the maximum-entropy
principle [19]. The approximation is showed below: unit variance, the variable y is assumed to be of zero mean and unit variance, and the functions G are some non-quadratic functions. In this case, we use only one i non-quadratic function G, and then the approximation becomes:
J y( )∝⎡⎣E G y
{ ( ) }
−E G{ ( )
ν}
⎤⎦2. (2.11) If y is symmetric, eq. (2.11) is a generalization of the moment-based approximation in eq. (2.9). In particular, choosing G that does not grow too fast, one can obtain more robust estimators.Thus, we obtain approximations of negentropy that give a very good compromise between the properties of the two classical non-gaussianity measures given by kurtosis and negentropy. They are conceptually simple, fast to compute, yet have appealing statistical properties, especially robustness.
2.2.2 Implementation of FastICA Algorithm
We have introduced the measures of non-gaussianity, which are the objective functions for ICA estimation. In practice, we also need an algorithm for maximizing the contrast function such as eq. (2.11). Then, we introduce a very efficient method of maximization suited for this task and how to find the ICA basis in the following. We will first show the one-unit version of FastICA and extend it to the several-unit version.
FastICA for one unit
Here, the ‘unit’ means a computational unit, which is an artificial neuron, having a weight vector w that is able to be updated by a learning rule. The learning rule of FastICA finds a direction for w such that the projection w x maximizes T non-gaussianity measured by eq. (2.11). Recall that the variance of w x must be T constrained to unity, which is equivalent to constraining the norm of w to be unity for whitened data
Because the non-quadratic function G used in eq. (2.11) must not grow too fast for obtaining a robust estimator, we choose G as [4]:
( )= − ⋅exp(− 2/ 2)
The basic form of the FastICA algorithm is shown below:
1. Center the data to make its mean zero.
FastICA for several units
The one-unit FastICA algorithm estimates only one of the independent components or one projection pursuit direction. To estimate several independent components, we run the one-unit FastICA algorithm using several units with weight vectors w1,...,w . To prevent different vectors from converging to the same maxima, n we decorrelate the outputs w x1T ,...,w x in every iteration. nT
A simple way of achieving decorrelation is a deflation scheme based on a Gram-Schmidt-like decorrelation. This means we must estimate the independent components one by one. When we have estimated p independent components, or
p vectors w1,...,w , we run the one-unit algorithm for p wp+1, and after every iteration step, subtract the projections wTp+1w wj j from wp+1, j=1...p , and then
renormalize wp+1. The more detailed steps are listed below:
1. Choose n, the number of independent components to estimate. Setp←1. 2. Initialize w randomly. p
3. Do an iteration of a one-unit algorithm on w . p 4. Do the following decorrelation:
1 1 1 components, go back to step 2.
When the algorithm stops, we will obtain the independent components of the original features MFCCs. Figure 2-2 shows the flowchart of the FastICA algorithm.
Fig. 2-2 Block diagram of the FastICA algorithm
2.3 General Probability Decent Method
Traditionally, the classifiers of most existing recognizers have been designed based on the design principle of the maximum likelihood (ML) algorithm; that is the expectation-maximization (EM) method, which is an extended ML estimation method for incomplete data [21], and segmental k-means clustering [22] are used for training the acoustic model. However, the conventional ML-based approach has a basic problem in which the function form of the class distribution (the conditional probability density) function to be estimated is rarely known in practice and the likelihood maximization of these estimated functions is not direct with regard to the minimization of classification errors. Besides, the ML-based approach covers only the classifier design; it does not optimize the overall system [23].
One of the solutions for solving the above problem and meeting the need of improvement in the recognition performance is the generalized probabilistic descent (GPD) method, which is based on a discriminative function approach (DFA), developed for classifier design [20]. The GPD algorithm was shown to be consistent with the objective of minimizing the classification error rate and to be very useful in various pattern recognition tasks. This thesis is therefore devoted to providing the GPD approach to the speaker recognition in a GMM-based system.
2.3.1 Discriminative Function Approach (DFA)
Consider a set of training samples X ={ , ,...,x x1 2 xN}, where each x is a j D-dimensional vector and is known to belong to one of S classes C ss, =1,2,...,S . A classifier comprises a set of parameters and a decision rule.
In DFA, a discriminant function g xs( ; )j λs is introduced for C to measure s
the class membership of the input x , where j λs is the parameters of classes C . s The discriminant function can be a probability function, distance, similarity, or any reasonable type of measure. And then, use the discriminant function to implement the decision rule as shown below [23]:
( )j = k, =arg max s( ; )j s
s
C x C iff k g x λ , (2.16) This approach is more direct with regard to the minimization of classification errors than the ML-based approach where class model parameters are designed independently of each other. However, there is plenty of room left for improvement in the DFA, as summarized in the following:
1) Execution of rule (2.16) using an arbitrary measure as the discriminant function does not necessarily lead to the minimum error probability situation.
2) The design scope does not cover the overall recognizer.
3) Most of the existing training procedures are empirical or heuristic; that means their mathematical optimality is unclear.
2.3.2 Generalized Probabilistic Descent (GPD) Method
From the above reasons in subsection 2.3.1, GPD is motivated to design a novel method for pursuing the overall optimality of a recognizer.
The fundamental concept of the GPD formalization is directly used in the overall process of classifying a pattern x in a smooth functional form that is suited for the j use of a practical optimization method, especially gradient search optimization [22],[24]. In the following, we propose an embodiment of GPD for the GMM classifier in detail. GPD is formalized in the following three-step manner:
1) Choose GMM as a discriminant function
A Gaussian mixture density is a weighted sum of M component densities, as shown in Fig. 2-3. x is a D-dimensional vector, j b xi( )j are the component densities, and wi are the mixture weights, wherei=1,...,M .
Fig. 2-3 Depiction of an M Component Gaussian Mixture Density
Each component density is a D -variate Gaussian function of the form:
( ) ( )
1( ) (
' 1)
with mean vector µGi and covariance matrix Σi. The mixture weights satisfy the constraint that
w . In order to simplify the following computation, we cite the Baum-Welch algorithm [25]. Based on an existing model λ , this algorithm transforms the objective function ( | )p xj λ into a new function
( , ')
Q λ λ that essentially measures a divergence between the existing model λ
and an updated model λ . It can be shown that ( , ')' Q λ λ ≥Q( , )λ λ implies ( | ')j ≥ ( | )j
p x λ p x λ . Therefore, we define the discriminant function as:
for classifier
s,
one. Accordingly, the method achieves a smooth discriminant function for the pattern x . jThen, we use the discriminant function to implement the decision rule which is stated as eq. (2.16):
( )j = k, =arg max s( ; )j s
s
C x C iff k g x λ .
2) Define a smooth misclassification measure
The smooth optimization criterion is a function of the discriminant function ( ; ), =1,...,
s j s
g x λ s S. Again, the classifier makes its decision for each pattern x j by choosing the largest of the discriminant function evaluated on x . The key to j the new error criterion is to express the operation decision rule of (2.16) in a function form. Among many possibilities, the following is a typical definition of the class misclassification measure for x (∈j C ): k
1 1/
where µ is a positive constant [26]. This misclassification measure is a
continuous function of the classifier parameter λ , and attempts to emulate the decision rule. A large ( ; )d xk j λk implies that more definitely the input is misclassified. By varying the value of µ, we can take all the competing classes into consideration in the process of optimizing the classifier parameter λ . To complete the definition of the objective criterion, the misclassification measure of (2.19) is used in the third step where the recognition error is counted.
3) Define the loss function
A general form of the loss function can be defined as:
( ; )l xk j λk =l d xk( ( ; ))k j λk , (2.20) which is expressed as a function of the misclassification measure. The loss function l is a sigmoid function. For minimum error classification, the following loss function is merely one of several possibilities:
( ) 1 , ( 0) positive, it leads to a penalty which becomes a classification/recognition error count. That is, this formulation allows us to directly minimize the expected recognition error by gradient descent search methods.
This three-step method is suitable for classifier parameter optimization. Based on the criterion of (2.21), we use it to minimize the expected loss for the classifier parameter search.
Optimization Method
There are various minimization algorithms which can be used to minimize the expected loss. Among them, the GPD method is a powerful algorithm that can accomplish this task. In the GPD-based minimization algorithm, the expected loss function ( )L λ =E l x[ ( ; )]k j λk is minimized according to an iterative procedure.
We seek to minimize L by adaptively adjusting λ in response to the
incurred loss each time a training pattern x is presented. The adjustment of j λ is according to:
+1= +
t t t
λ λ δλ , (2.22) where λt denotes the parameter set at the t-th iteration. The adjusted term δλt is a function of the input pattern xj (∈C ) and the current parameter set k λt, Then, we can obtain the equation:
[ ( t+1)− ( )]t = [ ( )]t = [ ( , )]j t ∇ ( )t
E L λ L λ Eδ λL Eδλ x λ L λ , (2.24) Therefore, the goal is to find an adaptation rule such that [E δ λL( )] 0t <
and such that λt converges to an at least locally optimum solution λ . The * probabilistic descent algorithm is summarized in the following theorem.
Probabilistic Descent Theorem [27]:
Assume that a given pattern x belongs to class j C . k
If the classifier parameter adjustment δλ( , )xj λt is specified by ( , )xj t = − U l x∇ ( ; )j t
δλ λ ε λ , (2.25) where ε is a small positive real number and U is a positive-definite matrix which is often assumed for simplicity to be a unit matrix, then
[ ( )] 0t ≤
E δ λL . (2.26) Furthermore, if an infinite sequence of randomly selected samples x is j used for learning and the adjustment rule of (2.25) is utilized with a corresponding learning weight sequence ( )ε t which satisfies
1 converges with probability one to λ which is at least a local minimum of *
( ) L λ .
It is obviously unrealistic to observe the infinitely repeated probabilistic descent adjustments. In practice, the learning coefficient ( )ε t is usually approximated by a finite monotonically decreasing function as
( )= (0) 1⎛⎜ − ⎞⎟,
⎝ ⎠
t t
ε ε T (2.30) where T is a preset number of adjustment repetitions.
The resulting adjustment rule using loss function (2.21) for the GMM
In eq. (2.31)-(2.33), the adjustment is done for all of the patterns.
2.3.3 Summarize Advantages of GPD Formalization
The most important point of the GPD concept is to embed the entire process of a given recognition task into a smooth function. Therefore, we can optimize all of the adjustable system parameters in consistent with the design objective of minimizing recognition errors.
In addition, GPD has both mathematical rigor and a great degree of practicality.
GPD was shown to provide attractive solutions to three of the four major DFA issues:
1) The design objective;
2) Optimization method;
3) Design consistency with unknown samples.
The forth DFA issue, which is the selection of the discriminant function form, has not been fully studied yet.
Because of the above advantages, we choose GPD to modify the GMM for speaker recognition.
Chapter 3
Speaker Recognition System Based on ICA and
GPD Optimizer
3.1 Overall Speaker Recognition System
The framework of our speaker recognition system is shown in Fig. 3-1 and Fig.
3-2.
For the training phase, feature MFCCs is extracted from the original speech signal of speaker s, and then we use the FastICA algorithm to find the independent components of MFCCs. Therefore, we transform MFCCs into feature ICAfts based on the basis found from the above step. In the next step, we use the ICAfts as the input of GMM to train the model. Among the structure, the GPD method is utilized to optimize the GMM recognizer. From the above steps, we could obtain the speaker recognition structure of each speaker s.
In the test phase of speaker recognition system, we also extract MFCC from the speech signal, and transform them by the ICA basis obtained in the training phase.
Then, we use the new features to evaluate the degree (score) of matching the GMM model of some speaker. If the largest score, which is estimated from some model of speaker k, is smaller than a threshold we set in advance, then we will reject the speaker and take him/her as an imposter. Otherwise, we regard the speaker as one customer.
Fig. 3-1 Training phase of our speaker recognition system for each speaker s.
Fig. 3-2 Test phase of our speaker recognition system
3.2 Each Block of Speaker Recognition System
In this section, we will decompose the entire speaker recognition system into blocks. After that, we will detail each block of the recognition system.
3.2.1 Feature Extraction
MFCC is widely used in the automatic speech recognition (ASR) applications. It is primarily for the three reasons [28]: 1) The cepstral features are roughly orthogonal because of the DCT, 2) cepstral mean subtraction eliminates static channel noise, and 3) MFCC is less sensitive to additive noise than linear prediction cepstral coefficients (LPCC). The key component of MFCC responsible for noise robustness is the filter bank; the filters smooth the spectrum, reducing variation due to additive noise across the bandwidth of each filter.
First, the speech signal is pre-processed by a high-pass filter. Next, a segment (frame) of speech is windowed and transformed to the frequency domain via the fast Fourier transform (FFT) and then the magnitude spectrum of the utterance is passed through a bank of triangular-shaped filters whose center frequencies are spaced along the perceptually-motivated Mel frequency scale. Therefore, the energy output from each filter is log-compressed and transformed to the cepstral domain via the discrete cosine transform (DCT). The block of feature extraction is shown in Fig. 3-3.
Fig. 3-3 Block diagram of Feature Extraction
3.2.2 ICA Algorithm
ICA can find a linear non-orthogonal coordinate system in multivariate data determined by high-order statistics. Its goal is to linearly transform the data such that the transformed variables are as statistically independent from each other as possible [29], [30]. Like data mining, ICA can extract the hidden predictive information from large databases and it is a powerful novel technology with great potential for finding the most important information in the data.
ICA not only decorrelates the signals but also reduces higher-order statistical
dependencies. We use it to find the most important and independent components of MFCC.
The block of ICA algorithm is shown in Fig. 2-2.
3.2.3 GPD-Based GMM
The most important concept of the GPD method is to formalize the overall procedure of the task into an optimized design process. Its objective is to directly minimize the recognition error rate.
One advantage of using GPD as the optimizer of the speaker recognition model is that the structure of the convention speaker recognizer can be kept intact without modification. This could demonstrate the practical value of the GPD method if it is to be incorporated in existing recognizer designs.
In addition, for reducing our computation, we will rewrite the equations in subsection 2.3.2. We assume that the covariance matrix is diagonal and the values of
In addition, for reducing our computation, we will rewrite the equations in subsection 2.3.2. We assume that the covariance matrix is diagonal and the values of