• 沒有找到結果。

Concepts Related to the Characteristic Vector

3 Improving the Characterization of the Alternative Hypothesis Using Kernel

3.3 Concepts Related to the Characteristic Vector

where σ is a tunable parameter, then is equivalent to the following Radial Basis Function (RBF) kernel with two inputs x1 and x2:

ainable mechanism, which tries to optimally exploit useful information from background n or use a combination of existing approaches.

eaker verification method would be to fuse multiple LR measures directly. Similar to the fusion approaches in [Ben-Yacoub 1999; Cheng 2005], we define a fusion-based LR as

(3.20)

where , and

3.3. Concepts Related to the Characteristic Vector

In this section, we compare the proposed classifiers with several approaches related to the characteristic vector. It is worth noting that the major advantage of our classifiers lies in a tr

models, rather than make an ad hoc modificatio

3.3.1. Direct Fusion of Multiple LRs

The most intuitive way to improve the conventional LR-based sp

⎩⎨

preliminary result reported in [Chao 2006] shows that, compared to approaches that use a As with WGC and WAC, the weight vector w can be trained using KFD or SVM. A

single LR, such a fusion scheme improves speaker verification performance noticeably.

However, we found that direct fusion is often dominated by one particular LR, or it is limited y some inferior LRs.

e U can be projected into t

e, the proposed methods are expected to be more effective than the anchor modeling approach.

b

3.3.2. Relation to the Anchor Modeling Approach

The concept of our methods is similar to that of the anchor modeling approach [Sturim 2001;

Mami 2006] used in speaker indexing and speaker identification applications. The objective of the anchor modeling approach is to construct a speaker space based on a set of pre-trained representative models {A1,A2,…,AN}, called anchor models. Then, any speech utteranc

he space, and represented as a characteristic vector x [Sturim 2001],

x = [p(U |A1) p(U |A2) … p(U |AN)]′. (3.22) The speaker of an unknown utterance U can be identified by computing the distance between the characteristic vector x and the typical vectors of the target speakers. The characteristic vector defined in Eq. (3.22) is similar to the characteristic vector used in this study. However, to find the location of a target speaker in the speaker space, the anchor modeling approach only considers the projection of the speech utterance from the target speaker, which is different from the proposed discriminative framework. More specifically, the decision functions based on WGC and WAC characterize a target speaker by locating the boundary that optimally separates the characteristic vectors of a target speaker from those of non-target speakers; henc

3.4. Experiments and Analysis

We conducted the speaker-verification experiments on two databases: the XM2VTSDB and (ISCSLP2006-SRE) database [Zheng 2006].

3.4.1. Evaluation on the XM2VTSDB

technique can be intractable when a large number of training samples are involved, we also 0 to 2,250 for estimating w.

thest cohort models (“w_10c_10f”). The weight ector was optimized by kernel-based discrimination solutions (KFD or SVM). We derived the ISCSLP2006 speaker recognition evaluation

The first set of experiments was conducted on XM2VTSDB following Configuration II. We built the world model with 256 Gaussian mixture components. The cohort size B was set to 20.

The remaining experiment setup was same as that in Section 2.4.1. Because a kernel-based

reduced the number of evaluation-impostor samples from 119,40

A. Weighted Geometric Combination versus Geometric Mean

The first experiment evaluated the proposed weighted geometric combination of background models, i.e., LWGC(U) defined in Eq. (3.3). The set of background models was comprised of (i) the world model and the 20 closest cohort models (“w_20c”), or (ii) the world model and the 10 closest cohort models, plus the 10 far

v

the following eight WGC-based systems:

a) KFD with k1(⋅) defined in Eq. (3.17) and “w_20c” (“WGC_dot_KFD_w_20c”),

b) KFD with k1(⋅) defined in Eq. (3.17) and “w_10c_10f” (“WGC_dot_KFD_w_10c_10f”), c) SVM with k1(⋅) defined in Eq. (3.17) and “w_20c” (“WGC_dot_SVM_w_20c”),

d) SVM with k1(⋅) defined in Eq. (3.17) and “w_10c_10f” (“WGC_dot_SVM_w_10c_10f”), e) KFD with k2(⋅) defined in Eq. (3.19) and “w_20c” (“WGC_RBF_KFD_w_20c”),

f) KFD with k2(⋅) defined in Eq. (3.19) and “w_10c_10f” (“WGC_RBF_KFD_w_10c_10f”), g) SVM with k2(⋅) defined in Eq. (3.19) and “w_20c” (“WGC_RBF_SVM_w_20c”), and h) SVM with k2(⋅) defined in Eq. (3.19) and “w_10c_10f” (“WGC_RBF_SVM_w_10c_10f”).

th SVM and KFD used an RBF kernel function k2(⋅) with σ = 5. We used the SSVM tool

[Lee M was set to

:

losest cohort models (“Geo_20c”), and

) L (U) with the 10 closest cohort models plus the 10 farthest cohort models

of Detection Error Tradeoff (DET) curves [Martin 1997].

Figu

with k1(⋅). Thus, in the subsequent experiments, we focused on vestigating the performance achieved by the kernel-based discrimination solutions using the kernel function k2(⋅).

Bo

2001] to implement the SVM experiments, where the parameter C of SVβ 1.

For the performance comparison, we used three systems as our baselines

a) LUBM(U) (“GMM-UBM”), b) LGeo(U) with the 20 c c Geo

(“Geo_10c_10f”).

Fig. 3.1 shows the speaker verification results of the above systems evaluated on the XM2VTSDB “Test” subset in terms

res 3.1(a) and 3.1(b) compare the DET curves derived by KFD-based systems and SVM-based systems, respectively.

From Fig. 3.1, we observe that all the WGC-based systems with kernel functions k1(⋅) or k2(⋅) outperform the baseline systems “GMM-UBM”, “Geo_20c”, and “Geo_10c_10f”. We also observe that “Geo_10c_10f” in Fig. 3.1(a) yields the poorest performance. In addition, both Fig. 3.1(a) and Fig. 3.1(b) show that the WGC-based systems with k2(⋅) outperform the WGC-based systems

in

(a)

(b)

Fig. 3.1. Geometric Mean versus WGC: DET curves for the “Test” subset in XM2VTSDB.

B. Weighted Arithmetic Combination versus Arithmetic Mean

The second experiment evaluated the proposed weighted arithmetic combination of background models, i.e., LWAC(U) defined in Eq. (3.6). We implemented the WAC-based

),

k2(⋅) with σ = 60. For the

orm three systems as our baselines:

c) e 10 closest cohort models plus the 10 farthest cohort models

, and “Ari_10c_10f”. We also bserve that the performances of SVM and KFD are similar.

systems using the kernel-based discrimination solution in four ways:

a) KFD with “w_20c” (“WAC_RBF_KFD_w_20c”), b) KFD with “w_10c_10f” (“WAC_RBF_KFD_w_10c_10f”

c) SVM with “w_20c” (“WAC_RBF_SVM_w_20c”), and d) SVM with “w_10c_10f” (“WAC_RBF_SVM_w_10c_10f”).

In the above cases, SVM and KFD used an RBF kernel function perf ance comparison, we used

a) LUBM(U) (“GMM-UBM”),

b) LAri(U) with the 20 closest cohort models (“Ari_20c”), and LAri(U) with th

(“Ari_10c_10f”).

Fig. 3.2 shows the results of the above systems evaluated on the XM2VTSDB “Test” subset in terms of DET curves. Clearly, all the WAC-based systems based on either KFD or SVM outperform the baseline systems “GMM-UBM”, “Ari_20c”

o

Fig. 3.2. Arithmetic Mean versus WAC: DET curves for the “Test” subset in XM2VTSDB.

C. Discussion

An analysis of the experiment results based on the DCF with CMiss =1, CFa =1, and is given in Table 3.1. In addition to the above systems, we evaluated four related systems:

5 .

=0

Target

P

a) LMax(U) with the 20 closest cohort models (“Max_20c”);

b) LBengio(U) using an RBF kernel function with σ = 10 (“GMM-UBM/SVM”);

c) LFusion(U) with a fusion of five baseline LR measures, namely, “GMM-UBM”, “Max_20c”,

“Ari_20c”, “Ari_10c_10f”, and “Geo_20c”, by KFD (“Fusion_KFD”); and

d) LFusion(U) with a fusion of five baseline LR measures, namely, “GMM-UBM”, “Max_20c”,

“Ari_20c”, “Ari_10c_10f”, and “Geo_20c”, by SVM (“Fusion_SVM”).

In the fusion systems, KFD and SVM used an RBF kernel function with σ = 5. For each

approach, the decision threshold was carefully tuned to minimize the DCF using the

“Evaluation” subset, and then applied to the “Test” subset.

Table 3.1. DCFs for the “Evaluation” and “Test” subsets in the XM2VTS database System min DCF for “Evaluation” actual DCF for “Test”

GMM-UBM 0.0633 0.0519

Max_20c 0.0776 0.0635

Ari_20c 0.0676 0.0535

Ari_10c_10f 0.0589 0.0515

Geo_20c 0.0734 0.0583

GMM-UBM/SVM 0.0590 0.0508

Fusion_KFD 0.0496 0.0475

Fusion_SVM 0.0505 0.0469

WGC_RBF_KFD_w_20c 0.0247 0.0357

WGC_RBF_KFD_w_10c_10f 0.0232 0.0389

WGC_RBF_SVM_w_20c 0.0320 0.0414

WGC_RBF_SVM_w_10c_10f 0.0310 0.0417

WAC_RBF_KFD_w_20c 0.0462 0.0443

WAC_RBF_KFD_w_10c_10f 0.0469 0.0445

WAC_RBF_SVM_w_20c 0.0460 0.0454

WAC_RBF_SVM_w_10c_10f 0.0479 0.0450

Several conclusions can be drawn from Table 3.1. First, the two direct fusion systems,

“Fusion_KFD” and “Fusion_SVM”, as well as “GMM-UBM/SVM”, outperform the baseline LR systems. Second, the proposed WGC- and WAC-based systems not only outperform all the baseline LR systems, “GMM-UBM”, “Max_20c”, “Ari_20c”, “Ari_10c_10f”, and

“Geo_20c”, they are also better than the fusion systems and the “GMM-UBM/SVM” system.

The WGC- and WAC-based SVM systems are better than the “GMM-UBM/SVM” system because they consider multiple background models (including the world model), whereas the

“GMM-UBM/SVM” system only considers the world model. Third, the WGC-based systems slightly outperform the WAC-based systems. Fourth, both KFD and SVM perform well in terms of finding nonlinear discrimination solutions. From the actual DCF for the “Test”

subset, we observe that “WGC_RBF_KFD_w_20c” achieved a 30.68% relative improvement

compared to “Ari_10c_10f” – the best baseline LR system. Table 3.2 compares the correlation of correct and incorrect decisions between “WGC_RBF_KFD_w_20c” and

“Ari_10c_10f” for the actual DCF [Van Leeuwen 2006]. Based on McNemar’s test [Gillick 1989] with a significance level = 0.001, we can conclude that “WGC_RBF_KFD_w_20c”

performs significantly better than “Ari_10c_10f”, since the resulting P-value < 0.001.

Table 3.2. Comparison of errors made by “WGC_RBF_KFD_w_20c” and “Ari_10c_10f,”

where P and N denote the number of positive (target speaker) trials and the number of negative (impostor) trials, respectively. There are 1,194 P and 329,544 N in total.

Trial counts Ari_10c_10f

Correct Incorrect WGC_RBF_KFD_w_20c Correct 1,107P + 315,200N 32P + 6,019N

Incorrect 5P + 3,056N 50P + 5,269N

3.4.2. Evaluation on the ISCSLP2006-SRE Database

We also evaluated the proposed methods on a text-independent single-channel speaker verification task conforming to the ISCSLP2006 Speaker Recognition Evaluation (ISCSLP2006-SRE) Plan [Chinese Corpus Consortium 2006]. Unlike the XM2VTSDB task, the ISCSLP2006-SRE database was divided into two subsets: a “Development Data Set” and an “Evaluation Data Set”. The “Development Data Set” contained 300 speakers. Each speaker made two utterances, each of which was cut into one long segment, which was longer than 30 seconds, and several short segments. In the experiments, we collected each speaker’s two long segments to build a UBM with 1,024 Gaussian mixture components, and used the two long segments per speaker to train each speaker’s 1024-mixture GMM through UBM-MAP adaptation. For each speaker, B speakers’ GMMs were chosen from the other 299 speakers as

the cohort models. The remaining short segments of all the speakers were used to estimate θ, w, and w0. In the implementation, each short segment served as a positive sample for its associated speaker, but acted as a negative sample for each of the 20 randomly-selected speakers from the remaining 299 speakers. This yielded 1,551 positive samples and 31,020 (1,551×20) negative samples for estimating θ or w0. Moreover, we used 1,551 positive samples and 1,551 randomly-selected negative samples to estimate w in the proposed systems.

The “Evaluation Data Set” contained 800 target speakers that did not overlap with the speakers in the “Development Data Set”. Each target speaker made one long training utterance, ranging in duration from 21 to 85 seconds, with an average length of 37.06 seconds. This was used to generate the speaker’s 1024-mixture GMM through UBM-MAP adaptation. For each target speaker, B speakers’ GMMs were chosen from the 300 speakers in the “Development Data Set” as the cohort models. In addition, there were 5,933 test utterances (trials) in the “Evaluation Data Set”, each of which ranged in duration from 5 seconds to 54 seconds, with an average length of 15.66 seconds. Each test utterance was associated with the claimed speaker’s ID, and the task involved judging whether it was true or false. The answer sheet was released after the evaluation finished.

The acoustic feature extraction process was same as that applied in the XM2VTSDB task.

A. Experiment results

The GMM-UBM and T-norm systems are the current state-of-the-art approaches for the text-independent speaker verification task. Thus, in this part, we focus on the performance improvement of our methods over these two baseline systems. As with the GMM-UBM system, we used the fast scoring method [Reynolds 2000] for likelihood ratio computation in

the proposed methods. Both the target speaker model λ and the B cohort models were adapted from the UBM Ω. Because the mixture indices were retained after UBM-MAP adaptation, each element of the characteristic vector x was computed approximately by only considering the C mixture components corresponding to the top C scoring mixtures in the UBM [Reynolds 2000]. In our experiments, C was set to 5, and B was set to 20.

The experiment results of the XM2VTSDB task showed that there was no significant performance difference between the two cohort selection methods used to construct the characteristic vector x. Thus, in the following experiments, we only used one type of characteristic vector, i.e., the vector associated with the UBM and the 20 closest cohort models (“w_20c”), to compute WGC- and WAC-based decision functions. This yielded the following four systems:

a) LWGC(U) using SVM with k2(⋅) and “w_20c” (“WGC_RBF_SVM_w_20c”), b) LWGC(U) using KFD with k2(⋅) and “w_20c” (“WGC_RBF_KFD_w_20c”), c) LWAC(U) using SVM with k2(⋅) and “w_20c” (“WAC_RBF_SVM_w_20c”), and d) LWAC(U) using KFD with k2(⋅) and “w_20c” (“WAC_RBF_KFD_w_20c”).

We compared the proposed systems with the GMM-UBM system, the T-norm system with the 50 closest cohort models (“Tnorm_50c”), and Bengio et al.’s system (“GMM-UBM/SVM”). The kernel parameters for SVM and KFD were same as those used in the XM2VTSDB task. Following the ISCSLP2006-SRE Plan, the performance was measured by the DCF with CMiss =10, CFa =1, and PTarget =0.05. In each system, the decision threshold was tuned to minimize the DCF using the (1,551 + 31,020) samples in the

“Development Data Set”, and then applied to the “Evaluation Data Set”. Table 3.3 summarizes the minimum DCFs and the actual DCFs derived from 5,933 trials in the

“Evaluation Data Set”, and Fig. 3.3 shows the experiment results for all systems in terms of

DET curves. It is clear that all the proposed systems outperform “GMM-UBM”,

“Tnorm_50c”, and “GMM-UBM/SVM.” The actual DCFs in Table 3.3 show that

“WGC_RBF_KFD_w_20c” achieved a 52.72% relative improvement over “Tnorm_50c”.

Table 3.4 compares the correlation of correct and incorrect decisions between

“WGC_RBF_KFD_w_20c” and “Tnorm_50c” for the actual DCF. Based on McNemar’s test with a significance level = 0.001, we can conclude that “WGC_RBF_KFD_w_20c” performs significantly better than “Tnorm_50c”, since the resulting P-value < 0.001.

Table 3.3. Minimum DCFs and actual DCFs for the ISCSLP2006-SRE “Evaluation Data

Set”

Minimum DCFs Actual DCFs

GMM-UBM 0.0184 0.0228

Tnorm_50c 0.0151 0.0184

GMM-UBM/SVM 0.0143 0.0146

WGC_RBF_KFD_w_20c 0.0081 0.0087

WAC_RBF_KFD_w_20c 0.0087 0.0112

WGC_RBF_SVM_w_20c 0.0091 0.0105

WAC_RBF_SVM_w_20c 0.0093 0.0105

Table 3.4. Comparison of errors made by “WGC_RBF_KFD_w_20c” and “ Tnorm_50c”,

where P and N denote the number of positive (target speaker) trials and the number of negative (impostor) trials, respectively. There are 347 P and 5,586 N in total.

Trial counts Tnorm_50c

Correct Incorrect WGC_RBF_KFD_w_20c Correct 342P + 5,508N 2P + 52N

Incorrect 0P + 12N 3P + 14N

Fig. 3.3. Baseline systems versus WAC and WGC: DET curves for the ISCSLP2006-SRE

“Evaluation Data Set”. The stars and circles indicate the actual and minimum DCFs, respectively.

Chapter 4

Improving GMM-UBM Speaker Verification Using Discriminative Feedback Adaptation

In this chapter, we focus on the discussion of the current state-of-the-art GMM-UBM approach [Reynolds 2000] for text-independent speaker verification that uses the UBM-MAP technique to generate the target model λ and the anti-model λ . This approach pools all speech data from a large number of background speakers to form a universal background model (UBM) as λ via the expectation-maximization (EM) algorithm. It then adapts the UBM to λ via the maximum a posteriori (MAP) estimation technique. GMM-UBM is effective because its generalization ability allows λ to handle acoustic patterns not covered by the limited training data of the target speaker. However, since λ and λ are trained according to separate criteria, the optimization procedure can not distinguish a target speaker from background speakers optimally. In particular, since GMM-UBM uses a common UBM λ for all target speakers, it tends to be weak in rejecting impostors’ voices that are similar to the target speaker’s voice. Moreover, as λ is derived from λ , both models may correspond to a similar probability distribution.

One possible way to improve the performance of GMM-UBM is to use discriminative training methods, such as the minimum classification error (MCE) method [Juang 1997] and the maximum mutual information (MMI) method [Ma 2003]. In [Rosenberg 1998], a minimum verification error (MVE) training method is developed by adapting MCE training to the binary classification problem, in which the parameters of λ and λ are estimated using the generalized probabilistic descent (GPD) approach [Chou 2003]. However, as the MVE training method requires a large number of positive and negative samples to estimate a model’s parameters, it tends to over-train the model if the amount of training data is insufficient. In addition, it is difficult to select the optimal stopping point in GPD-based training.

To resolve the limitation of MVE training, we propose a framework called discriminative feedback adaptation (DFA), which improves the discrimination ability of GMM-UBM while preserving its generalization ability. The rationale behind DFA is that only mis-verified training samples are considered in the discriminative training process, rather than all the training samples used in the conventional MVE method. More specifically, DFA regards the UBM and the target speaker model obtained by the GMM-UBM approach as initial models, and then reinforces the discriminability between the models by using the mis-verified training samples. Since the reinforcement is based on model adaptation rather than training from scratch, it does not destroy the generalization ability of the two models, even if they are updated iteratively until convergence. However, recognizing that a small number of mis-verified training samples may not be able to adapt a large number of model parameters, to implement DFA, we propose two adaptation techniques: a linear regression-based minimum verification squared-error (LR-MVSE) adaptation method and an eigenspace-based minimum verification squared-error (E-MVSE) adaptation method.

LR-MVSE is motivated by the minimum classification error linear regression (MCELR)

techniques [Chengalvarayan 1998; Wu 2002; He 2003], which have been studied in the context of automatic speech recognition; while E-MVSE is motivated by the MCE/eigenvoice technique [Valente 2003], which has been studied in the context of speaker identification.

The remainder of this chapter is organized as follows. In Section 4.1, we introduce the proposed DFA framework. Sections 4.2 and 4.3 describe, respectively, the proposed LR-MVSE and E-MVSE adaptation techniques used to implement DFA. Section 4.4 presents simplified versions of LR-MVSE and E-MVSE. Then, in Section 4.5, we detail the experiment results.

4.1. Discriminative Feedback Adaptation

Fig. 4.1 shows a block diagram of the proposed discriminative feedback adaptation (DFA) framework, which is divided into two phases. The first phase, indicated by the dotted line, utilizes the conventional GMM-UBM approach. The initial target speaker model and the UBM obtained in the first phase serve as the initial models for DFA in the second phase. The basic strategy of DFA is to reinforce the discriminability between the initial target speaker model and the UBM for ambiguous data that is mis-verified by the GMM-UBM approach. The reinforcement strategy is based on two concepts. First, since the GMM-UBM approach uses a single anti-model, UBM, for all target speakers, it tends to be weak in rejecting impostors’

voices that are similar to the target speaker’s voice. To resolve this problem, DFA tries to generate a discriminative anti-model exclusively for each target speaker by using the negative samples from the cohort [Rosenberg 1992] of each target speaker to adapt both λ and λ . Since the models may affect each other, the DFA framework also uses the positive samples to avoid increasing the miss probability while reducing the false alarm probability. The resulting λ and

λ are then updated iteratively. Second, since the DFA framework only uses mis-verified training samples as adaptation data in each iteration, it actually fine-tunes the model’s parameters based on a small amount of adaptation data. It thus preserves the generalization ability of the GMM-UBM approach while reinforcing the discrimination between H0 and H1. To implement the above concepts, we developed the following algorithms.

UBM

MAP Initial target speaker

model

Target speaker

model

Anti-model Target

speaker data

Background speaker data

Cohort data

DFA

EM Cohort

selection

Fig. 4.1. The proposed discriminative feedback adaptation framework.

4.1.1. Minimum Verification Squared-Error (MVSE) adaptation strategy

We modify the minimum verification error (MVE) training method [Rosenberg 1998] to fit our requirement that only mis-verified training samples should be considered. This is called the minimum verification squared-error (MVSE) adaptation strategy. The goal of DFA is to minimize the overall expected loss D, defined as

1,

1 0

0l xl

x

D= + (4.1) where x0 and x1 reflect which type of error is of more concern in a practical application; and li is a loss function that describes the average false rejection loss (i = 0) or false acceptance loss

(i = 1), defined as where N0 and N1 are the numbers of training utterances from the target speaker and the cohort, respectively; and d(U) is a mis-verification measure defined as

⎩⎨

where L(U) is the logarithmic LR defined as

,

To reflect the requirement that only mis-verified training utterances should be considered,

To reflect the requirement that only mis-verified training utterances should be considered,

相關文件