• 沒有找到結果。

Evolutionary Minimum Verification Error Training

2 Improving the Characterization of the Alternative Hypothesis via Minimum

2.3 Evolutionary Minimum Verification Error Training

As the gradient descent approach may converge to an inferior local optimum, we propose an evolutionary MVE (EMVE) training method that uses a genetic algorithm (GA) to train the weights wi and the threshold θ in WAC- and WGC-based LR measures. It has been shown in many applications that GA-based optimization is superior to gradient-based optimization, because of GA’s global scope and parallel searching power.

Genetic algorithms belong to a particular class of evolutionary algorithms inspired by the process of natural evolution [Eiben 2003]. As shown in Fig. 2.1, the operators involved in the evolutionary process are: encoding, parent selection, crossover, mutation, and survivor selection. GAs maintain a population of candidate solutions and perform parallel searches in the search space via the evolution of these candidate solutions.

To accommodate GA to EMVE training, the fitness function of GA is set as the reciprocal of the overall expected loss D defined in Eq. (2.15), where and . The details of the GA operations in EMVE training are described in the following.

Target Miss P C x0 = × )

1

1 CFalseAlarm ( PTarget

x = × −

Population

Parents

Offspring Initialization

Termination

Parent selection

Survivor selection

Crossover Mutation

Fig. 2.1. The general scheme of a GA.

1) Encoding: Each chromosome is a string 12,...,αN,θ} of length N + 1, which is the concatenation of all intermediate parameters αi in Eq. (2.16) and the threshold θ in Eq. (2.12).

Chromosomes are initialized by randomly assigning a real value to each gene.

2) Parent selection: Five chromosomes are randomly selected from the population with

replacement, and the one with the best fitness value (i.e., with the smallest overall expected loss) is selected as a parent. The procedure is repeated iteratively until a pre-defined number (which is the same as the population size in this study) of parents is selected. This is known as tournament selection [Eiben 2003].

3) Crossover: We use the N-point crossover [Eiben 2003] in this work. Two chromosomes are randomly selected from the parent population with replacement. The chromosomes can interchange each pair of their genes in the same positions according to a crossover probability pc.

4) Mutation: In most cases, the function of the mutation operator is to change the allele of the gene randomly in the chromosomes. For example, while mutating a gene of a chromosome, we can simply draw a number from a normal distribution at random, and add it to the allele of the gene. However, the method does not guarantee that the fitness will improve steadily. We therefore designed a new mutation operator, called the one-step gradient descent operator (GDO). The concept of the GDO is similar to that of the one-step K-means operator (KMO) [Krishna 1999; Lu 2004; Cheng 2006], which guarantees to improve the fitness function after mutation by performing one iteration of the K-means algorithm.

The GDO performs one gradient descent iteration to update the parameters αi, i = 1, 2, …, N as follows:

i old

i new i

D δ α α

α ∂

− ∂

= , (2.26)

where and are, respectively, the parameter αi in a chromosome after and before

mutation;

new

αi αiold

δ is the step size ; and

i

D α

∂ is computed by Eq. (2.18). Similarly, the GDO for

the threshold θ is computed by

θ , δ θ

θ ∂

− ∂

= old D

new (2.27)

where and are, respectively, the threshold θ in a chromosome after and before mutation; and

θnew θold

θ D

∂ is computed by Eq. (2.23).

5) Survivor selection: We adopt the generational model [Eiben 2003] in which the whole

population is replaced by its offspring.

The process of fitness evaluation, parent selection, crossover, mutation, and survivor selection is repeated following the principle of survival of the fittest to produce better approximations of the optimal solution. Accordingly, it is hoped that the verification errors will decrease from generation to generation. When the maximum number of generations is reached, the best chromosome in the final population is taken as the solution of the weights.

As the proposed EMVE training method searches for the solution in a global manner, it is expected that its computational complexity is higher than that of the gradient-based MVE training. Assume that the population size of GA is P, while the numbers of iterations (or generations) of gradient-based MVE training and EMVE training are k1 and k2, respectively.

The computational complexity of EMVE training is about Pk2/k1 times that of gradient-based MVE training. In our experiments (as shown in Fig. 2.2), the number of generations required for the convergence of EMVE training is roughly equal to the number of iterations required for the convergence of gradient-based MVE training; hence, the EMVE training roughly

requires P times consumption of the gradient-based MVE training.

.4. Experiments and Analysis

o NIST Speaker Recognition Evaluation (NIST SRE

ucted on a 3.2 GHz Intel Pentium IV computer with 1.5 GB of RAM, running Windows XP.

2

We evaluated the proposed approaches via speaker verification experiments conducted on speech data extracted from the Extended M2VTS Database (XM2VTSDB) [Messer 1999].

The first set of experiments followed Configuration II of XM2VTSDB, as defined in [Luettin 1998]. The second set of experiments followed a configuration that was modified from Configuration II of XM2VTSDB to conform t

) [Przybocki 2007; Van Leeuwen 2006].

In the experiments, the population size of the GA was set to 50, the maximum number of generations was set to 100, and the crossover probability pc was set to 0.5 for the EMVE training; the gradient-based MVE training for the WAC and WGC methods was initialized with an equal weight, wi, and the threshold θ was set to 0. For the DCF in Eq. (2.25), the costs CMiss and CFalseAlarm were both set to 1, and the a priori probability PTarget was set to 0.5. This special case of DCF is known as the Half Total Error Rate (HTER) [Lindberg 1998]. All the experiments were cond

2.4.1. Evaluation based on Configuration II

In accordance with Configuration II of XM2VTSDB, the database was divided into three subsets: “Training”, “Evaluation*”, and “Test”. We used the “Training” subset to build each

* This is usually called the “Development” set by the speech recognition community. We use “Evaluation” in accordance with the configuration of XM2VTSDB.

target speaker’s model and the background models. The “Evaluation” subset was used to optimize the weights wi in Eq. (2.1) or Eq. (2.2), along with the threshold θ. Then, the speaker verification performance was evaluated on the “Test” subset. As shown in Table 2.1, a total of 293 speakers in the database were divided into 199 clients (target speakers), 25 “evaluation impostors”, and 69 “test impostors”. Each speaker participated in four recording sessions at about one-month intervals, and each recording session consisted of two shots. In each shot,

to utter three sentences:

) “Joe took father’s green shoe bench out”.

2

Session Shot 199 clients 25 impostors 69 impostors the speaker was prompted

a) “0 1 2 3 4 5 6 7 8 9”.

b) “5 0 6 9 2 8 1 3 7 4”.

c

Table 2.1. Configuration II of XM VTSDB.

1 1

Training

Evaluation Test 2

2 1

2

3 1

Evaluation 2

4 1

Test 2

Each utterance, sampled at 32 kHz, was converted into a stream of 24-order feature vectors by a 32-ms Hamming-windowed frame with 10-ms shifts; and each vector consisted of 12

scale frequency cepstral coefficients [Huang 2001] and their first time derivatives.

We used 12 (2×2×3) utterances/client from sessions 1 and 2 to train each client model,

We omitted 2 speakers (ID numbers 313 and 342) because of partial data corruption.

represented by a GMM with 64 mixture components. For each client, we used the utterances of the other 198 clients in sessions 1 and 2 to generate the world model, represented by a GMM with 512 mixture components. We then chose B speakers from those 198 clients as the cohort. In the experiments, B was set to 50, and each cohort model was also represented by a GMM with 64 mixture components. Table 2.2 summarizes all the parametric models used in each

ssions, which involved 1,194 (6×199) client trials and 329,544 (24×69×199) impostor trials

Table 2.2 ary of the parametric models used in each system.

System

1

system.

To optimize the weights, wi, and the threshold, θ, we used 6 utterances/client from session 3 and 24 (4×2×3) utterances/evaluation-impostor over the four sessions, which yielded 1,194 (6×199) client samples and 119,400 (24×25×199) impostor samples. To speed up the gradient-based MVE and EMVE training processes, only 2,250 impostor samples randomly selected from the total of 119,400 samples were used. In the performance evaluation, we tested 6 utterances/client in session 4 and 24 utterances/test-impostor over the four se

.

. A summ

H0 H

a 64-mixture client GMM a 512-mixture world model B 64-mixture cohort GMMs

LUBM √ √

LMax √ √

LAri √ √

LGeo √ √

WG C √ √ √

WAC √ √ √

A. Experiment results

First, we compared the learning ability of gradient-based MVE training and EMVE training in

the proposed WGC- and WAC-based LR measures. The background models comprised either (i) the world model and the 50 closest cohort models (“w_50c”), or (ii) the world model and the 25 closest cohort models, plus the 25 farthest cohort models (“w_25c_25f”). The WGC-

nd WAC-based LR systems were implemented in four ways:

MVE training and “w_50c” (“WGC_MVE_w_50c”;

training and “w_25c_25f” (“WGC_MVE_w_25c_25f”;

g EMVE training and “w_50c” (“WGC_EMVE_w_50c”; “WAC_EMVE_w_50c”),

and “w_25c_25f” (“WGC_EMVE_w_25c_25f”;

“WAC_EMVE_w_25c_25f”).

an the EMVE training method without DO and the gradient-based MVE training method.

For the performance comparison, we used the following LR systems as our baselines:

a

a) Using gradient-based

“WAC_MVE_w_50c”), b) Using gradient-based MVE

“WAC_MVE_w_25c_25f”), c) Usin

and

d) Using EMVE training

Figs. 2.2(a) and 2.2(b) show the learning curves of different MVE training methods for WGC and WAC on the “Evaluation” subset, respectively, where

“WGC_EMVE_w_50c_withoutGDO” and “WGC_EMVE_w_25c_25f_withoutGDO” denote the EMVE training algorithms that use the conventional mutation operator, which changes the allele of the gene in a chromosome at random, while the others are based on the GDO mutation. From Fig. 2.2, we observe that the GDO-based EMVE training method reduces the overall expected loss more effectively and steadily th

G

a) LUBM(U) (“Lubm”),

b) LMax(U) with the 50 closest cohort models (“Lmax_50c”),

c) LGeo(U) with the 50 closest cohort models (“Lgeo_50c”),

d) LGeo(U) with the 25 closest cohort models and the 25 farthest cohort models (“Lgeo_25c_25f”),

(a) WGC methods

(b) WAC methods

Fig. 2.2. The learning curves of gradient-based MVE and EMVE for the “Evaluation” subset Configuration II.

in

e) LAri(U) with the 50 closest cohort models (“Lari_50c”), and

f) LAri(U) with the 25 closest cohort models and the 25 farthest cohort models (“Lari_25c_25f”).

e is no significant difference betw

oor performance. The hybrid anti-model systems were implemented the following ways:

Fig. 2.3 shows the Detection Error Tradeoff (DET) curves [Martin 1997] obtained by evaluating the above systems using the “Test” subset, where Fig. 2.3(a) compares the WGC-based approach and the geometric mean approach, while Fig. 2.3(b) compares the WAC-based approach and the arithmetic mean approach. From the figure, we observe that all the WGC-based LR systems outperform the baseline LR systems “Lubm”, “Lmax_50c”,

“Lgeo_50c”, and “Lgeo_25c_25f”, while all the WAC-based LR systems outperform the baseline LR systems “Lubm”, “Lari_50c”, and “Lari_25c_25f”. From Fig. 2.3(a), we observe that “Lgeo_25c_25f” yields the poorest performance. This is because the heuristic geometric mean can produce some singular scores if any cohort model λi is poorly matched with the input utterance U, i.e., p(U| λi) → 0. In contrast, the results show that the WGC-based LR systems sidestep this problem with the aid of the weighted strategy. Figs. 2.3(a) and 2.3(b) also show that “WGC_EMVE_w_50c”, “WGC_EMVE_w_25c_25f”, and

“WAC_EMVE_w_25c_25f” outperform “WGC_MVE_w_50c”, “WGC_MVE_w_25c_25f”, and “WAC_MVE_w_25c_25f”, respectively. However, ther

een “WAC_MVE_w_50c” and “WAC_EMVE_w_50c”.

In addition to the above systems, we also evaluated the WAC- and WGC-based LR measures using the hybrid anti-model defined in Eq. (2.4). The hybrid anti-model comprised five conventional anti-models extracted from “Lubm”, “Lmax_50c”, “Lgeo_50c”,

“Lari_50c”, and “Lari_25c_25f”. Note that the anti-model of “Lgeo_25c_25f” was not included because of its p

in

a) Using WAC and gradient-based MVE training (“WAC_MVE_5anti”), b) Using WGC and gradient-based MVE training (“WGC_MVE_5anti”), c) Using WAC and EMVE training (“WAC_EMVE_5anti”), and

DET curves. Clearly, all the hybrid anti-model s using either WAC or WGC methods outperform any baseline LR system with a ngle anti-model.

d) Using WGC and EMVE training (“WGC_EMVE_5anti”).

Fig. 2.4 compares the performance of the hybrid anti-model systems with all the baselines systems, evaluated on the “Test” subset in

system si

(a) Geometric mean versus WGC

(b) Arithmetic mean versus WAC

Fig. 2.3. DET curves for the “Test” subset in Configuration II.

Fig. 2.4. Hybrid anti-model systems versus all baselines: DET curves for the “Test” subset in Configuration II.

B. Discussion

Table 2.3 summarizes the above experiment results in terms of the DCF, which reflects the performance at a specific operating point on the DET curve. For each baseline system, the value of the decision threshold θ was carefully tuned to minimize the DCF in the

“Evaluation” subset, and then applied to the “Test” subset. However, the decision thresholds of the proposed WAC- and WGC-based LR measures were optimized automatically using the

“Evaluation” subset, and then applied to the “Test” subset.

Table 2.3. DCFs for the “Evaluation” and “Test” subsets in Configuration II.

System min DCF for “Evaluation” DCF for “Test”

Lubm 0.0651 0.0545

Lmax_50c 0.0762 0.0575

Lari_50c 0.0677 0.0526

Lari_25c_25f 0.0587 0.0496 Lgeo_50c 0.0749 0.0542

WGC_MVE_w_50c 0.0576 0.0450

WGC_EMVE_w_50c 0.0488 0.0417

WGC_MVE_w_25c_25f 0.0633 0.0478

WGC_EMVE_w_25c_25f 0.0493 0.0429

WAC_MVE_w_50c 0.0576 0.0460

WAC_EMVE_w_50c 0.0571 0.0443

WAC_MVE_w_25c_25f 0.0573 0.0462

WAC_EMVE_w_25c_25f 0.0543 0.0444

WGC_MVE_5anti 0.0588 0.0475

WGC_EMVE_5anti 0.0568 0.0460

WAC_MVE_5anti 0.0634 0.0480

WAC_EMVE_5anti 0.0597 0.0469

Several conclusions can be drawn from Table 2.3. First, all the proposed WAC- and WGC-based LR systems with either the hybrid anti-model or the background model set (the world model plus a cohort) outperform all the baseline LR systems. Second, the performances of the proposed systems using the background model set are slightly better than those achieved using the hybrid anti-model. Third, the performances of the WAC- and WGC-based

LR systems are similar. Fourth, EMVE training is better than MVE training. Among the systems, “WGC_EMVE_w_50c” achieves the best performance with a 15.93% relative improvement in terms of the DCF for the “Test” subset, compared to the best baseline system

“Lari_25c_25f”.

2.4.2. Evaluation based on the NIST SRE-like Configuration

To conform to NIST SRE [Przybocki 2007; Van Leeuwen 2006], we conducted another series of experiments on XM2VTSDB, which was re-configured as shown Table 2.4. The 293 speakers in XM2VTSDB were divided into 100 clients (target speakers), 100 background speakers, 24 “development impostors”, and 69 “test impostors”. As shown in the table, the

“Development” set comprised two subsets: “Development training” and “Development test”.

In the “Development training” subset, we pooled the utterances of 100 background speakers from sessions 1 and 2 to build a world model (UBM), represented by a GMM with 512 mixture components. For each background speaker, we used 12 (2×2×3) utterances/background-speaker from sessions 1 and 2 to generate his/her model. The cohort for each background speaker was selected from the other 99 background speakers. In the

“Development test” subset, to estimate the weights wi and the threshold θ, we used 12 (2×2×3) utterances/background-speaker from sessions 3 and 4 as well as 24 (4×2×3) utterances/development-impostor over the four sessions. This yielded 1,200 (12×100) client samples and 57,600 (24×24×100) impostor samples. To speed up the gradient-based MVE and EMVE training processes, only 5,760 impostor samples randomly selected from the total of 57,600 samples were used.

For each client (target speaker), we used 12 (2×2×3) utterances/client from sessions 1 and 2 to generate the client GMM. The cohort models for each client were selected from the

GMMs of the 100 background speakers in the “Development training” subset. The parametric models used in each system were the same as those in Table 2.2. In addition, we implemented two current state-of-the-art systems in the text-independent speaker verification task, namely T-norm [Auckenthaler 2000] and “Lubm_MAP”. “Lubm_MAP” is based on the UBM-MAP adaptation method [Reynolds 2000]; each client model with 512 mixture Gaussian components was adapted from the UBM via the maximum a posteriori (MAP) estimation [Gauvain 1994] according to the speaker’s 12 (2×2×3) “Training” utterances from sessions 1 and 2.

In the performance evaluation, we tested 12 (2×2×3) utterances/client from sessions 3 and 4, and 24 (4×2×3) utterances/test-impostor over the four sessions, which involved 1,200 (12×100) client trials and 165,600 (24×69×100) impostor trials, respectively.

Table 2.4. The NIST SRE-like configuration of XM2VTSDB.

Session Shot 100 clients 100 background speakers 24 impostors 69 impostors

1 1

Training

(client models) Development training (UBM, a cohort)

Development test ( wi and θ )

Test 2

2 1

2

3 1

Test Development test ( wi and θ ) 2

4 1

2

A. Experiment results

As in Section 2.4.1, we implemented four WGC-based LR systems: “WGC_MVE_w_50c”,

“WGC_EMVE_w_50c”, “WGC_MVE_w_25c_25f”, and “WGC_EMVE_w_25c_25f”; four WAC-based LR systems: “WAC_MVE_w_50c”, “WAC_EMVE_w_50c”,

“WAC_MVE_w_25c_25f”, and “WAC_EMVE_w_25c_25f”; and four hybrid anti-model

systems: “WAC_MVE_5anti”, “WAC_EMVE_5anti”, “WGC_MVE_5anti”, and

“WGC_EMVE_5anti”. For the performance comparison, we used five conventional LR systems: “Lubm”, “Lmax_50c”, “Lgeo_50c”, “Lari_50c”, and “Lari_25c_25f”, plus two state-of-the-art systems: “Lubm_MAP” and the T-norm system with the 50 closest cohort models (“Tnorm_50c”), as our baselines.

Since the experiment results in Section 2.4.1 show that the performance of the proposed WGC- and WAC-based LR systems using EMVE training is better than that of the systems using gradient-based MVE training, Fig. 2.5 only compares the performance of the proposed WGC- and WAC-based LR systems using EMVE training with two state-of-the-art systems and two best baseline systems in Section 2.4.1, namely “Lubm” and “Lari_25c_25f”, evaluated on the “Test” subset in DET curves. From the figure, we observe that all the proposed WGC- and WAC-based LR systems using EMVE training outperform

“Lubm_MAP”, “Tnorm_50c”, “Lubm”, and “Lari_25c_25f”. Interestingly, the baseline system “Lubm” outperforms “Lubm_MAP”, which is widely recognized as a state-of-the-art method for the text-independent speaker verification task. This may be because the training and test utterances in XM2VTSDB have the same content.

Table 2.5 summarizes the experiment results for all systems in terms of the DCF. For each baseline system, the decision threshold θ was tuned to minimize the DCF on the

“Development test” subset, and then applied to the “Test” subset. The decision thresholds of the proposed methods were optimized automatically using the “Development test” subset, and then applied to the “Test” subset. From Table 2.5, it is clear that all the proposed WGC- and WAC-based LR systems using either gradient-based MVE training or EMVE training outperform all the conventional LR systems “Lubm”, “Lmax_50c”, “Lgeo_50c”, “Lari_50c”, and “Lari_25c_25f”, and two state-of-the-art systems “Lubm_MAP” and “Tnorm_50c”. The DCFs for the “Test” subset demonstrate that “WGC_EMVE_w_50c” achieved a 13.01%

relative improvement over “Tnorm_50c” – the best baseline system.

Fig. 2.5. DET curves for the “Test” subset in the NIST SRE-like configuration.

We also evaluated the training and verification time of the above systems. In the offline training phase, in addition to training 100 background speaker models and a UBM, the proposed WAC and WGC methods need to train the weight wi. From the fourth column of Table 2.5, we observe that the EMVE training is slower than the gradient-based MVE training and the training time of WGC is slightly faster than that of WAC. The computational cost in gradient-based MVE or EMVE training mainly comes from the calculation of the likelihoods of each training utterance with respect to the background speaker models and the UBM and the selection of the cohort models for each background speaker. The fifth column of Table 2.5 shows the training time for enrolling a new target speaker. “Lubm_MAP” and “Lubm” need less enrollment time than the other systems because they need not select the cohort models for the new target speaker. The last column of Table 2.5 shows the verification time for an input

test utterance. The average duration of the test utterances is around 1.5 sec. As expected,

“Lubm_MAP” is the fastest method, since only one background model (i.e., UBM) is involved and the fast scoring scheme [Reynolds 2000] is used. Although the proposed systems are slightly slower than the baseline systems because both the cohort models and the UBM are involved, they are still capable of supporting a real-time response.

Table 2.5. DCFs for the “Development test” and “Test” subsets, together with the running

time evaluation in the NIST SRE-like configuration.

System

min DCF for

“Development test”

DCF for

“Test”

Training time for the weights wi in WAC/WGC (offline)

Training time for enrolling a target speaker

Verification time for an input test utterance

Lubm_MAP 0.0704 0.0601 5.79sec 0.08sec

Lubm 0.0575 0.0573 7.87sec 0.12sec

Tnorm_50c 0.0607 0.0569 27.46sec 0.75sec

Tnorm_50c 0.0607 0.0569 27.46sec 0.75sec

相關文件