The Organization of This Dissertation - 鑑別式訓練法於語者驗證之研究

1 Introduction

1.3 The Organization of This Dissertation

The remainder of this dissertation is organized as follows. Chapter 2 and 3 describe, respectively, the MVE training methods and the kernel discriminant analysis techniques used to improve the characterization of the alternative hypothesis. Chapter 4 introduces the proposed DFA framework for improving the GMM-UBM method. Then, in Chapter 5, we present our conclusions.

Chapter 2 Improving the Characterization of the Alternative Hypothesis via Minimum Verification Error Training

To handle the speaker-verification problem more effectively, we propose a framework that characterizes the alternative hypothesis by exploiting information available from background models, such that the utterances of the impostors can be more effectively distinguished from those of the target speaker. The framework is built on either a weighted geometric combination (WGC) or a weighted arithmetic combination (WAC) of the likelihoods computed for background models. In contrast to the geometric mean in LGeo(U) defined in Eq.

(1.6) or the arithmetic mean in LAri(U) defined in Eq. (1.5), both of which are independent of the system training, our combination scheme treats the background models unequally according to how close each individual is to the target speaker model, and quantifies the unequal nature of the background models by a set of weights optimized in the training phase.

The optimization is carried out with the minimum verification error (MVE) criterion [Chou 2003; Rosenberg 1998], which minimizes both the false acceptance probability and the false rejection probability. Since the characterization of the alternative hypothesis is closely related to the verification accuracy, the resulting system is expected to be more effective and robust than those of conventional methods.

The concept of MVE training stems from minimum classification error (MCE) training [Juang 1997; Siohan 1998; McDermott 2007; Ma 2003], where the former could be a special case of the latter when the classes to be distinguished are binary. Although MVE training has been extensively studied in the literature [Chou 2003; Rosenberg 1998; Sukkar 1996, 1998;

Rahim 1997; Kuo 2003; Siu 2006], most studies focus on better estimating the parameters of the target model. In contrast, we try to improve the characterization of the alternative hypothesis by applying MVE training to optimize the parameters associated with the combinations of the likelihoods from a set of background models. Traditionally, MVE training has been realized by the gradient descent algorithms, e.g., the generalized probability descent (GPD) [Chou 2003], but the approach only guarantees to converge to a local optimum.

To overcome such a limitation, we propose a new MVE training method, called evolutionary MVE (EMVE) training, for learning the parameters associated with WAC and WGC based on a genetic algorithm (GA) [Eiben 2003]. It has been shown in many applications that GA-based optimization is superior to gradient-based optimization, because of GA’s global scope and parallel searching power. To facilitate the EMVE training, we designed a new mutation operator, called the one-step gradient descent operator (GDO), for the genetic algorithm. The results of speaker verification experiments conducted on the Extended M2VTS Database (XM2VTSDB) [Messer 1999] demonstrate that the proposed methods outperform conventional LR-based approaches.

The remainder of this chapter is organized as follows. Section 2.1 presents the proposed methods for characterizing the alternative hypothesis. Sections 2.2 and 2.3 describe, respectively, the gradient-based MVE training and the EMVE training used to optimize our methods. Section 2.4 contains the experiment results.

2.1. Characterization of the Alternative Hypothesis

To characterize the alternative hypothesis, we generate a set of background models using data that does not belong to the target speaker. Instead of using the heuristic arithmetic mean or geometric mean, our goal is to design a function Ψ(⋅) that optimally exploits the information available from background models. In this section, we present our approach, which is based on either the weighted arithmetic combination (WAC) or the weighted geometric combination (WGC) of the useful information available. Moreover, the LR measure based on WAC or WGC can be viewed as a generalized and trainable version of LUBM(U) in Eq. (1.3), LMax(U) in Eq. (1.4), LAri(U) in Eq. (1.5), or LGeo(U) in Eq. (1.6).

2.1.1. The Weighted Arithmetic Combination (WAC)

First, we define the function Ψ(⋅) in Eq. (1.7) based on the weighted arithmetic combination as

, ) λ

| ( ))

| ( ),..., λ

| ( ( ) λ

| (

∑

= Ψ

= ^N

i i i

N w pU

U p U

p U

p (2.1)

where wi is the weight of the likelihood p(U | λi) subject to

∑

_i^N=₁w_i =1. This function assigns different weights to N background models to indicate their individual contribution to the alternative hypothesis. Suppose all the N background models are Gaussian Mixture Models (GMMs); then, Eq. (2.1) can be viewed as a mixture of Gaussian mixture density functions.

From this perspective, the alternative hypothesis model λ can be viewed as a GMM with two layers of mixture weights, where one layer represents each background model and the other represents the combination of background models.

2.1.2. The Weighted Geometric Combination (WGC)

Alternatively, we can define the function Ψ(⋅) in Eq. (1.7) from the perspective of the weighted geometric combination as

. ) λ

| ( ))

| ( ),..., λ

| ( ( ) λ

( 1 1 ⁱ

w i N

N i pU

U p U

p U

p =Ψ =∏= (2.2)

Similar to the weighted arithmetic combination, Eq. (2.2) considers the individual contribution of a background model to the alternative hypothesis by assigning a weight to each likelihood value. One additional advantage of WGC is that it avoids the problem where

0 ) λ

(U →

p . The problem can arise with the heuristic geometric mean because some values of the likelihood may be rather small when the background models λi are irrelevant to an input utterance U, i.e., p(U| λi) → 0. However, if a weight is attached to each background model, Ψ(⋅) defined in Eq. (2.2) should be less sensitive to a tiny value of the likelihood;

hence, it should be more robust and reliable than the heuristic geometric mean.

2.1.3. Relation to Conventional LR Measures

We observe that Eq. (2.1) and Eq. (2.2) are equivalent to the arithmetic mean and the geometric mean, respectively, when wi = 1/N, i = 1,2,…, N; in other words, all the background models are assumed to contribute equally. It is also clear that both Eq. (2.1) and Eq. (2.2) will degenerate to a maximum function if we set w_i_* =1, where i*=argmax₁_≤_i_≤_N p(U|λ_i), and wi = 0, . Furthermore, the logarithmic LR measure based on Eq. (2.1) or Eq. (2.2) will degenerate to LUBM(U) in Eq. (1.3) if only a UBM Ω is used as the background model.

Thus, both WAC- and WGC-based logarithmic LR measures can be viewed as generalized and trainable versions of LUBM(U) in Eq. (1.3), LMax(U) in Eq. (1.4), LAri(U) in Eq. (1.5), or LGeo(U) in Eq. (1.6).

* i i≠

∀

In the WAC method, we refer to the alternative hypothesis model λ defined in Eq. (2.1) as a 2-layer GMM (GMM2), since it involves both inner and outer mixture weights. GMM2 differs from the UBM Ω in that it characterizes the relationship between individual background models through the outer mixture weights, rather than simply pooling all the available data and training a single background model represented by a GMM. Note that the inner and outer mixture weights are trained by different algorithms. Specifically, the inner mixture weights are estimated using the standard expectation-maximization (EM) algorithm [Huang 2001], while the outer mixture weights are estimated using minimum verification error (MVE) training or evolutionary MVE (EMVE) training, which we will discuss in Sec.

2.2 and Sec. 2.3, respectively. In other words, GMM2 integrates the Bayesian learning and discriminative training algorithms. The objective is to optimize the LR measure by considering the null hypothesis and the alternative hypothesis jointly.

2.1.4. Background Model Selection

In general, the more speakers that are used as background models, the better the characterization of the alternative hypothesis will be. However, it has been found [Reynolds 1995; Rosenberg 1992; Liu 1996; Higgins 1991; Auckenthaler 2000; Sturim 2005] that using a set of pre-selected representative models usually makes the system more effective and efficient than using the entire collection of available speakers. For this reason, we present two approaches for selecting background models to strengthen our WAC- and WGC-based methods.

A. Combining cohort models and the world model

Our first approach selects B+1 background models, comprised of B cohort models used in LMax(U), LAri(U), and LGeo(U), and one world model used in LUBM(U), for WAC in Eq. (2.1)

and WGC in Eq. (2.2). Depending on the definition of a cohort, we consider two commonly-used methods [Reynolds 1995]. One selects the B closest speaker models {λcst 1, λcst 2, …, λcst B} for each target speaker; and the other selects the B/2 closest speaker models {λcst 1, λcst 2, …, λcst B/2}, plus the B/2 farthest speaker models {λfst 1, λfst 2, …, λfst B/2}, for each target speaker. Here, the degree of closeness is measured in terms of the pairwise distance defined in [Reynolds 1995]:

), λ

| (

) λ

| log ( ) λ

| (

) λ

| log ( ) λ , λ (

i j

j j

j i

i i j

i pU

U p U

p U

d = p + (2.3)

where λi and λj are speaker models trained using the i-th speaker’s utterances Ui and the j-th speaker’s utterances Uj, respectively. As a result, each target speaker has a sequence of background models, {Ω, λcst 1, λcst 2, …, λcst B} or {Ω, λcst 1, …, λcst B/2, λfst 1, …, λfst B/2}, for Eqs. (1.7), (2.1), and (2.2).

B. Combining multiple types of anti-models

As shown in Eqs. (1.3) – (1.6), various types of anti-models have been studied for conventional LR measures. However, none of the LR measures developed thus far has proved to be absolutely superior to any other. Usually, LUBM(U) tends to be weak in rejecting impostors with voices similar to the target speaker’s voice, while LMax(U) is prone to falsely rejecting a target speaker; LAri(U) and LGeo(U) are between these two extremes. The advantages and disadvantages of different LR measures motivate us to combine them into a unified LR measure because of the complementary information that each anti-model can contribute.

Consider K different LR measures Li(U), each with an anti-model λ , i = 1,2,…, K. If _i we treat each anti-model λ as a background model, the function Ψ(⋅) in Eq. (1.7) can be _i rewritten as,

(

⁽ ^|^λ ^), ⁽ ^|^λ ^)..., ⁽ ^|^λ ⁾

)

^. Using WAC or WGC to realize Eq. (2.4), we can form a trainable version of the conventional LR measures in Eqs. (1.3) – (1.6), where each anti-model λ , i = 1,…,4, is computed, _i respectively, by

As a result, for Eq. (1.7), each target speaker has the following sequence of background models, {λ₁,λ₂,λ₃,λ₄}. We denote systems that combine multiple anti-models as hybrid anti-model systems.

2.2. Gradient-based Minimum Verification Error Training

After representing Ψ(⋅) as a trainable combination of likelihoods, the task becomes a matter of solving the associated weights. To obtain an optimal set of weights, we propose using minimum verification error (MVE) training [Chou 2003, Rosenberg 1998].

The concept of MVE training stems from MCE training, where the former could be a special case of the latter when the classes to be distinguished are binary. To be specific,

consider a set of class discriminant functions gi(U), i = 0,1,…, M - 1. The misclassification measure in the MCE method [Juang 1997] is defined as

then di(U) is reduced to the mis-verification measure defined in the MVE method:

⎩⎨

where L(U) is the logarithmic LR. We further express L(U) as the following equivalent test

⎩⎨

so that the decision threshold θ can also be included in the optimization process. Then, the mis-verification measure is converted into a value between 0 and 1 using a sigmoid function

))

Next, we define the loss of each hypothesis as the average of the mis-verification measures of the training samples

, where l⁰ denotes the loss associated with false rejection errors, l¹ denotes the loss associated with false acceptance errors, and N0 and N1 are the numbers of utterances from true speakers

and impostors, respectively. Finally, we define the overall expected loss as

where x0 and x1 indicate which type of error is of greater concern in a practical application.

Accordingly, our goal is to find the weights wi in Eq. (2.1) and Eq. (2.2) such that Eq.

(2.15) can be minimized. This can be achieved by using the gradient descent algorithm [Chou 2003]. To ensure that the weights satisfy

∑

_i^N=₁w_i =1, we solve wi by means of an intermediate parameter αi, where

∑

which is similar to the strategy used in [Juang 1997]. Parameter αi is iteratively optimized using

If WAC is used, then In our implementation, the overall expected loss is set as

. ) 1

1 (

0 Target FalseAlarm Target

Miss P C P

D= ×l × + ×l × − (2.24)

Eq. (2.24) simulates the Detection Cost Function (DCF) [Van Leeuwen 2006]

where CMiss denotes the cost of the miss (false rejection) error; CFalseAlarm denotes the cost of the false alarm (false acceptance) error; PMiss ≈ l⁰ is the miss (false rejection) probability;

PFalseAlarm ≈ l¹ is the false alarm (false acceptance) probability; and PTarget is the a priori probability of the target speaker.

2.3. Evolutionary Minimum Verification Error Training

As the gradient descent approach may converge to an inferior local optimum, we propose an evolutionary MVE (EMVE) training method that uses a genetic algorithm (GA) to train the weights wi and the threshold θ in WAC- and WGC-based LR measures. It has been shown in many applications that GA-based optimization is superior to gradient-based optimization, because of GA’s global scope and parallel searching power.

Genetic algorithms belong to a particular class of evolutionary algorithms inspired by the process of natural evolution [Eiben 2003]. As shown in Fig. 2.1, the operators involved in the evolutionary process are: encoding, parent selection, crossover, mutation, and survivor selection. GAs maintain a population of candidate solutions and perform parallel searches in the search space via the evolution of these candidate solutions.

To accommodate GA to EMVE training, the fitness function of GA is set as the reciprocal of the overall expected loss D defined in Eq. (2.15), where and . The details of the GA operations in EMVE training are described in the following.

Target Miss P C x₀ = × )

1 CFalseAlarm ( PTarget

x = × −

Population

Parents

Offspring Initialization

Termination

Parent selection

Survivor selection

Crossover Mutation

Fig. 2.1. The general scheme of a GA.

1) Encoding: Each chromosome is a string {α₁,α₂,...,α_N,θ} of length N + 1, which is the concatenation of all intermediate parameters αi in Eq. (2.16) and the threshold θ in Eq. (2.12).

Chromosomes are initialized by randomly assigning a real value to each gene.

2) Parent selection: Five chromosomes are randomly selected from the population with

replacement, and the one with the best fitness value (i.e., with the smallest overall expected loss) is selected as a parent. The procedure is repeated iteratively until a pre-defined number (which is the same as the population size in this study) of parents is selected. This is known as tournament selection [Eiben 2003].

3) Crossover: We use the N-point crossover [Eiben 2003] in this work. Two chromosomes are randomly selected from the parent population with replacement. The chromosomes can interchange each pair of their genes in the same positions according to a crossover probability pc.

4) Mutation: In most cases, the function of the mutation operator is to change the allele of the gene randomly in the chromosomes. For example, while mutating a gene of a chromosome, we can simply draw a number from a normal distribution at random, and add it to the allele of the gene. However, the method does not guarantee that the fitness will improve steadily. We therefore designed a new mutation operator, called the one-step gradient descent operator (GDO). The concept of the GDO is similar to that of the one-step K-means operator (KMO) [Krishna 1999; Lu 2004; Cheng 2006], which guarantees to improve the fitness function after mutation by performing one iteration of the K-means algorithm.

The GDO performs one gradient descent iteration to update the parameters αi, i = 1, 2, …, N as follows:

i old

i new i

D δ α α

α ∂

− ∂

= , (2.26)

where and are, respectively, the parameter αi in a chromosome after and before

mutation;

new

αi α_i^old

δ is the step size ; and

D α

∂

∂ is computed by Eq. (2.18). Similarly, the GDO for

the threshold θ is computed by

θ , δ θ

θ ∂

− ∂

= _old D

new (2.27)

where and are, respectively, the threshold θ in a chromosome after and before mutation; and

θnew θ^old

θ D

∂

∂ is computed by Eq. (2.23).

5) Survivor selection: We adopt the generational model [Eiben 2003] in which the whole

population is replaced by its offspring.

The process of fitness evaluation, parent selection, crossover, mutation, and survivor selection is repeated following the principle of survival of the fittest to produce better approximations of the optimal solution. Accordingly, it is hoped that the verification errors will decrease from generation to generation. When the maximum number of generations is reached, the best chromosome in the final population is taken as the solution of the weights.

As the proposed EMVE training method searches for the solution in a global manner, it is expected that its computational complexity is higher than that of the gradient-based MVE training. Assume that the population size of GA is P, while the numbers of iterations (or generations) of gradient-based MVE training and EMVE training are k1 and k2, respectively.

The computational complexity of EMVE training is about Pk2/k1 times that of gradient-based MVE training. In our experiments (as shown in Fig. 2.2), the number of generations required for the convergence of EMVE training is roughly equal to the number of iterations required for the convergence of gradient-based MVE training; hence, the EMVE training roughly

requires P times consumption of the gradient-based MVE training.

.4. Experiments and Analysis

o NIST Speaker Recognition Evaluation (NIST SRE

ucted on a 3.2 GHz Intel Pentium IV computer with 1.5 GB of RAM, running Windows XP.

2

We evaluated the proposed approaches via speaker verification experiments conducted on speech data extracted from the Extended M2VTS Database (XM2VTSDB) [Messer 1999].

The first set of experiments followed Configuration II of XM2VTSDB, as defined in [Luettin 1998]. The second set of experiments followed a configuration that was modified from Configuration II of XM2VTSDB to conform t

) [Przybocki 2007; Van Leeuwen 2006].

In the experiments, the population size of the GA was set to 50, the maximum number of generations was set to 100, and the crossover probability pc was set to 0.5 for the EMVE training; the gradient-based MVE training for the WAC and WGC methods was initialized with an equal weight, wi, and the threshold θ was set to 0. For the DCF in Eq. (2.25), the costs CMiss and CFalseAlarm were both set to 1, and the a priori probability PTarget was set to 0.5. This special case of DCF is known as the Half Total Error Rate (HTER) [Lindberg 1998]. All the experiments were cond

2.4.1. Evaluation based on Configuration II

In accordance with Configuration II of XM2VTSDB, the database was divided into three subsets: “Training”, “Evaluation^*”, and “Test”. We used the “Training” subset to build each

* This is usually called the “Development” set by the speech recognition community. We use “Evaluation” in accordance with the configuration of XM2VTSDB.

target speaker’s model and the background models. The “Evaluation” subset was used to optimize the weights wi in Eq. (2.1) or Eq. (2.2), along with the threshold θ. Then, the speaker verification performance was evaluated on the “Test” subset. As shown in Table 2.1, a total of 293 speakers^† in the database were divided into 199 clients (target speakers), 25 “evaluation impostors”, and 69 “test impostors”. Each speaker participated in four recording sessions at about one-month intervals, and each recording session consisted of two shots. In each shot,

to utter three sentences:

) “Joe took father’s green shoe bench out”.

Session Shot 199 clients 25 impostors 69 impostors the speaker was prompted

a) “0 1 2 3 4 5 6 7 8 9”.

b) “5 0 6 9 2 8 1 3 7 4”.

Table 2.1. Configuration II of XM VTSDB.

1 1

Training

Evaluation Test 2

2 1

3 1

Evaluation 2

4 1

Test 2

Each utterance, sampled at 32 kHz, was converted into a stream of 24-order feature vectors by a 32-ms Hamming-windowed frame with 10-ms shifts; and each vector consisted of 12

scale frequency cepstral coefficients [Huang 2001] and their first time derivatives.

We used 12 (2×2×3) utterances/client from sessions 1 and 2 to train each client model,

† We omitted 2 speakers (ID numbers 313 and 342) because of partial data corruption.

represented by a GMM with 64 mixture components. For each client, we used the utterances of the other 198 clients in sessions 1 and 2 to generate the world model, represented by a GMM with 512 mixture components. We then chose B speakers from those 198 clients as the cohort. In the experiments, B was set to 50, and each cohort model was also represented by a GMM with 64 mixture components. Table 2.2 summarizes all the parametric models used in each

ssions, which involved 1,194 (6×199) client trials and 329,544 (24×69×199) impostor trials

Table 2.2 ary of the parametric models used in each system.

System

system.

To optimize the weights, wi, and the threshold, θ, we used 6 utterances/client from session 3 and 24 (4×2×3) utterances/evaluation-impostor over the four sessions, which yielded 1,194 (6×199) client samples and 119,400 (24×25×199) impostor samples. To speed up the gradient-based MVE and EMVE training processes, only 2,250 impostor samples randomly selected from the total of 119,400 samples were used. In the performance evaluation, we tested 6 utterances/client in session 4 and 24 utterances/test-impostor over the four se

. A summ

H0 H

a 64-mixture client GMM a 512-mixture world model B 64-mixture cohort GMMs

LUBM √ √

LMax √ √

LAri √ √

LGeo √ √

WG C √ √ √

WAC √ √ √

A. Experiment results

First, we compared the learning ability of gradient-based MVE training and EMVE training in

the proposed WGC- and WAC-based LR measures. The background models comprised either (i) the world model and the 50 closest cohort models (“w_50c”), or (ii) the world model and the 25 closest cohort models, plus the 25 farthest cohort models (“w_25c_25f”). The WGC-

nd WAC-based LR systems were implemented in four ways:

MVE training and “w_50c” (“WGC_MVE_w_50c”;

training and “w_25c_25f” (“WGC_MVE_w_25c_25f”;

g EMVE training and “w_50c” (“WGC_EMVE_w_50c”; “WAC_EMVE_w_50c”),

and “w_25c_25f” (“WGC_EMVE_w_25c_25f”;

“WAC_EMVE_w_25c_25f”).

an the EMVE training method without DO and the gradient-based MVE training method.

For the performance comparison, we used the following LR systems as our baselines:

a) Using gradient-based

“WAC_MVE_w_50c”), b) Using gradient-based MVE

在文檔中鑑別式訓練法於語者驗證之研究 (頁 23-0)