Simplified Versions of LR-MVSE and E-MVSE

4 Improving GMM-UBM Speaker Verification Using Discriminative Feedback

4.4 Simplified Versions of LR-MVSE and E-MVSE

As far as reliability is concerned, a target speaker model trained with the GMM-UBM approach may be effective in characterizing the target speaker’s voice. In contrast, a UBM generated from a number of background speakers may not be able to represent the imposters with respect to each specific target speaker. In other words, it may not be able to distinguish between imposters and the target speaker. Thus, it is more important to reinforce discriminability in the UBM than in the target speaker model. Moreover, in our experience, the training samples of target speakers are seldom mis-verified; i.e., nearly all the mis-verified training samples are from the cohort. Accordingly, to adapt the UBM to the target speaker dependent anti-model, it might be sufficient to use only negative training samples in our DFA framework. In this case, the training goal can be simplified to one of minimizing the average false acceptance (false alarm) loss l¹. For LR-MVSE adaptation, the parameter W is iteratively optimized using

) ,

Z +1, are iteratively optimized using

∑

versions of LR-MVSE and E-MVSE are about one-quarter of the training times of the respective original versions.

4.5. Experiments and Analysis

A. Experiment setup

In our experiments, we used the NIST 2001 cellular speaker recognition evaluation (NIST2001-SRE) database, and divided it into two subsets: an evaluation set and a development set. The evaluation set contained 74 male and 100 female speakers. On average, each speaker had approximately 2 minutes of training utterances and 10 test segments. The development set contained 38 males and 22 females as background speakers that did not overlap with the speakers in the evaluation set. To scale up the number of background speakers, we also included 139 male and 191 female speakers extracted from the NIST2002-SRE corpus. Thus, we collected the training utterances of 177 male and 213 female background speakers to build two gender-dependent UBMs, each containing 1,024 mixture components. To train each target speaker’s GMM, we only adapted the mean vectors from the speaker’s corresponding gender-dependent UBM in the GMM-UBM method. Then, for each male or female target speaker, we chose the B closest speakers from the 177 male or 213 female background speakers, respectively, as a cohort based on the degree of closeness measured in terms of the pairwise distance defined in Eq. (2.3). For each cohort speaker, we extracted J 3-second speech segments from his/her training utterances as negative samples of

a target speaker. Thus, each target speaker had J×B negative samples in total. All the 3-second segments extracted from each target speaker’s training utterances served as positive samples in LR-MVSE or E-MVSE adaptation.

To remove silence/noise frames, we processed all the speech data with a Voice Activity Detector (VAD). Then, using a 32-ms Hamming-windowed frame with 10-ms shifts, we converted each utterance into a stream of 30-dimensional feature vectors, each consisting of 15 Mel-scale frequency cepstral coefficients (MFCCs) and their first time derivatives. To compensate for channel mismatch effects, we applied feature warping [Pelecanos 2001] after MFCC extraction.

In the experiments, a and b in the s function defined in Eq. (4.1) were set at 3 and 0.01, respectively. For E-MVSE adaptation, we generated two gender-dependent Z-dimensional eigenspaces using the GMMs of the 177 male and 213 female background speakers, respectively, with Z set to 70 or 140. The LR-MVSE and E-MVSE adaptation procedures were trained until they almost converged, i.e., until the number of mis-verified training samples approximated zero. For the overall expected loss D, x0 and x1 were set as CMiss × PTarget and CFalseAlarm × (1 - PTarget), respectively, according to the NIST Detection Cost Function (DCF) in Eq. (2.25). Following the NIST2001-SRE protocol, CMiss, CFalseAlarm, and PTarget were set at 10, 1, and 0.01, respectively.

B. Experiment results

To evaluate the performance of the DFA framework, we used the Detection Error Tradeoff (DET) curve and the NIST DCF; the latter reflects the performance at a single operating point on the former. We implemented the proposed DFA framework in three ways:

a) LR-MVSE adaptation (“MAP + LR-MVSE”),

b) E-MVSE adaptation with the first 70 eigenvectors (“MAP + E-MVSE70”), and c) E-MVSE adaptation with the first 140 eigenvectors (“MAP + E-MVSE140”).

For the performance comparison, we used two baseline systems:

a) GMM-UBM (“MAP”) and

b) conventional MVE (MCE) training with the sigmoid function (“MAP + MVE”).

The target speaker GMM and the UBM obtained from the GMM-UBM method served as the initial models for the proposed DFA-related methods and the conventional MVE method.

Fig. 4.3 plots the minimum DCFs against the total number of negative training samples per target speaker for each adaptation method. The experiments involved 2,038 target speaker trials and 20,380 impostor trials of the evaluation set. We considered different numbers of negative samples, but not different numbers of positive samples because the same target speaker data had been used to train the initial target speaker model in the GMM-UBM method.

From the figure, we observe that “MAP + E-MVSE70” achieves the lowest minDCF in cases where the adaptation data only includes 6 or 12 negative training samples per target speaker;

while “MAP + LR-MVSE” achieves the lowest minDCF in cases where the adaptation data includes 36 or 60 negative training samples per target speaker. As expected, a small amount of adaptation data favors the methods in which a smaller number of model parameters must be estimated. Note that the larger the number of negative training samples used, the lower the minDCF that can be achieved.

Fig. 4.3. The minimum DCFs versus the number (J×B) of 3-second negative training samples per target speaker.

Fig. 4.4 shows the DET curves obtained by evaluating the above systems for the case with 60 negative training samples per target speaker. It is clear that the performances of the three proposed methods, “MAP + LR-MVSE”, “MAP + E-MVSE70”, and “MAP + E-MVSE140”, are comparable; and they all outperform the conventional methods “MAP” and “MAP + MVE”.

Interestingly, the performance of “MAP + MVE” is not always better than that of “MAP”. This is because MVE tends to over-train the models obtained from the GMM-UBM method, and it is difficult to select the optimal stopping point in MVE training.

Fig. 4.4. Experiment results in DET curves. The circles indicate the minimum DCFs.

In the above experiments, we found that nearly all the mis-verified training samples in each adaptation iteration were negative training samples. Thus, we further compared the simplified versions of the LR-MVSE and E-MVSE methods with the respective original versions. Fig. 4.5 shows the DET curves for the case of 60 negative training samples per target speaker. It is clear that the simplified versions perform comparably to the respective original versions. This confirms our assumption that reinforcing the discriminability in the UBM is more beneficial than reinforcing the discriminability in the target speaker model.

Table 4.1 summarizes the minimum DCFs of each system shown in Figs. 4.4 and 4.5.

We observe that “MAP + LR-MVSE” achieves a 14.35% relative DCF reduction over the baseline GMM-UBM system (“MAP”) and a 9.22% relative DCF reduction over the “MAP + MVE” method. In fact, “MAP + simLR-MVSE” even performs slightly better than the original version “MAP + LR-MVSE”.

(a) LR-MVSE vs. the simplified version of LR-MVSE (simLR-MVSE)

(b) E-MVSE70 vs. the simplified version of E-MVSE70 (simE-MVSE70)

Fig. 4.5. The DET curves of the LR-MVSE and E-MVSE systems and their simplified versions. The circles indicate the minimum DCFs.

Table 4.1. Summary of the minimum DCFs in Figs. 4.4 and 4.5.

Methods minDCF MAP 0.0460 MAP + MVE 0.0434 MAP + LR-MVSE 0.0394 MAP + E-MVSE70 0.0413 MAP + E-MVSE140 0.0415 MAP + simLR-MVSE 0.0390 MAP + simE-MVSE70 0.0420 MAP + simE-MVSE140 0.0416

Chapter 5 Conclusions

In this dissertation, we have proposed a framework to improve the characterization of the alternative hypothesis for speaker verification. The framework is built on either a weighted arithmetic combination (WAC) or a weighted geometric combination (WGC) of useful information extracted from a set of pre-trained background models. The proposed combinations are more effective and robust than the simple geometric mean and arithmetic mean used in conventional approaches. The parameters associated with WAC or WGC are then optimized using the minimum verification error (MVE) criterion, such that both the false acceptance probability and the false rejection probability are minimized. In addition to applying the conventional gradient-based MVE training method to this problem, we also proposed an evolutionary MVE (EMVE) training scheme to further reduce the verification errors. The results of our speaker verification experiments conducted on the Extended M2VTS Database (XM2VTSDB) demonstrate that the proposed systems along with the MVE or EMVE training achieve higher verification accuracy than conventional LR-based approaches. Although they need more training time than conventional LR-based approaches in the offline training phase, the increase of the training time for enrolling a new target speaker or the verification time for an input test utterance is negligible. The proposed systems are still capable of supporting a real-time response.

Alternatively, we have also presented two novel WGC- and WAC-based decision functions for solving the speaker-verification problem. The new decision functions are treated as nonlinear discriminant classifiers that can be solved by using kernel-based techniques, such as the Kernel Fisher Discriminant and Support Vector Machine, to optimally separate samples of the null hypothesis from those of the alternative hypothesis. The proposed approaches have two advantages over existing methods. The first is that they embed a trainable mechanism in the decision functions. The second is that they convert variable-length utterances into fixed-dimension characteristic vectors, which are easily processed by kernel discriminant analysis. The results of experiments on two speaker verification tasks, the XM2VTSDB and ISCSLP2006-SRE tasks, show notable improvements in performance over classical approaches. It is worth noting that although we only consider the speaker verification problem in this dissertation, the above proposed approach is not limited to this application. It can be applied to other types of data and hypothesis testing problems.

Finally, we have proposed a discriminative feedback adaptation (DFA) framework to improve the state of the art GMM-UBM speaker verification approach. The framework not only preserves the generalization ability of the GMM-UBM approach, but also reinforces the discrimination between H0 and H1. Our method is based on the minimum verification squared-error (MVSE) adaptation strategy, which is modified from the MVE training method so that only mis-verified training utterances are considered. Because a small number of mis-verified training samples may not be able to adapt a large number of model parameters, to implement DFA, we developed two adaptation techniques: the linear regression-based minimum verification squared-error (LR-MVSE) method and the eigenspace-based minimum verification squared-error (E-MVSE) method. In addition, we use a fast LR scoring approach and the simplified version of LR-MVSE or E-MVSE to improve the efficiency and effectiveness of the DFA framework. The results of experiments conducted on the

NIST2001-SRE database show that the proposed DFA framework can substantially improve the performance of the conventional GMM-UBM approach.

Bibliography

Auckenthaler, R., M. Carey, and H. Lloyd-Thomas, “Score Normalization for Text-Independent Speaker Verification System”, Digital Signal Processing, vol. 10, no.

1, pp. 42-54, 2000.

Bengio, S. and J. Mariéthoz, “Learning the Decision Function for Speaker Verification”, in Proc. ICASSP, Salt Lake City, USA, 2001, pp. 425-428.

Ben-Yacoub, S., “Multi-modal Data Fusion for Person Authentication Using SVM”, in Proc.

AVBPA, Washington DC, USA, 1999, pp. 25-30.

Bimbot, F., J. F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, erlin

T. M Text vol.

, J. Ortega-Garcia, D. Petrovska-Delacretaz, and D. A. Reynolds, “A Tutorial on -Independent Speaker Verification”, EURASIP Journal on Applied Signal Processing, 4, pp. 430-451, 2004.

Burges, C., “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, vol.2, pp. 121-167, 1998.

Campbell, W. M., J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo,

“Support Vector Machines for Speaker and Language Recognition”, Computer Speech and Language, vol. 20, no. 2-3, pp. 210-229, 2006.

Campbell, W. M., D. E. Sturim, and D. A. Reynolds, “Support Vector Machine Using GMM Supervectors for Speaker Verification”, IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308-311, 2006.

Campbell, W. M., J. P. Campbell, T. P. Gleason, D. A. Reynolds, and W. Shen, “Speaker Verification Using Support Vector Machines and High-Level Features”, IEEE Trans.

Audio , Speech and Language Processing, vol. 15, no. 7, pp. 2085-2094, 2007.

Chao, Y. H., W. H. Tsai, H. M. Wang, and R. C. Chang, “A Kernel-based Discrimination Framework for Solving Hypothesis Testing Problems with Application to Speaker Verification”, in Proc. ICPR., Hong Kong, China, 2006, pp. 229-232.

Cheng, S. S., Y. H. Chao, H. M. Wang, and H. C. Fu, “A Prototypes Embedded Genetic Algorithm for K-means Clustering”, in Proc. ICPR2006.

Cheng, H. T., Y. H. Chao, S. L. Yen, C. S. Chen, H. M. Wang, and Y. P. Hung, “An Efficient Approach to Multi-Modal Person Identity Verification by Fusing Face and Voice Information”, in Proc. ICME, Amsterdam, The Netherlands, July 2005.

Chengalvarayan, R., “Speaker Adaptation Using Discriminative Linear Regression on Time-Varying Mean Parameters in Trended HMM”, IEEE Signal Processing Letters, vol.

5, no. 3, pp. 63-65, 1998.

Chinese Corpus Consortium (CCC), “Evaluation Plan for ISCSLP’2006 Special Session on Speaker Recognition”, 2006.

Chou, W. and B. H. Juang, Pattern Recognition in Speech and Language Processing, CRC Press, 2003.

Duda, R. O., P. E. Hart, and D. G. Stork, Pattern Classification, 2nd. ed., John Wiley & Sons, New York, 2001.

Eiben, A. E. and J. E. Smith, Introduction to Evolutionary Computing, Springer, Berlin, 2003.

Faundez-Zanuy, M. and E. Monte-Moreno, “State-of-the-Art in Speaker Recognition”, IEEE Aerospace and Electronic Systems Magazine, vol. 20, no. 5, pp.7-12, 2005.

Fauve, B. G. B., D. Matrouf, N. Scheffer, J. F. Bonastre, and J. S. D. Mason, “State-of-the-Art ormance in Text-Independent Speaker Verification Through Open-Source Software”,

Trans. Audio, Speech and Language Processing, vol. 15, no. 7, pp. 1960-1968, 2007.

Perf IEEE

Gauvain, J. L. and C. H. Lee, “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observation of Markov Chains”, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 2, pp. 291-298, 1994.

Gillick, L. and S. J. Cox, “Some Statistical Issues in the Comparison of Speech Recognition Algorithms”, in Proc. ICASSP, Glasgow, UK, 1989, pp. 532-535.

He, X. D. and W. Chou, “Minimum Classification Error Linear Regression for Acoustic Model Adaptation of Continuous Density HMMs”, in Proc. ICASSP2003.

He, X. D. and W. Chou, “Minimum Classification Error (MCE) Model Adaptation of Continuous Density HMMs”, in Proc. Eurospeech2003.

Herbrich, R., Learning Kernel Classifiers: Theory and Algorithms, MIT Press, Cambridge, 2002.

Higgins, A., L. Bahler, and J. Porter, “Speaker Verification Using Randomized Phrase Prompting”, Digital Signal Processing, vol. 1, no. 2, pp. 89-106, 1991.

Huang, X., A. Acero, and H. W. Hon, Spoken Language Processing, Prentics Hall, New Jersey, 2001.

Juang, B. H., W. Chou, and C. H. Lee, “Minimum Classification Error Rate Methods for Speech Recognition”, IEEE Trans. Speech and Audio Processing, vol. 5, pp. 257-265, 1997.

Krishna, K. and M. N. Murty, “Genetic K-Means Algorithm”, IEEE Trans. Systems, Man, and Cybernetics – Part B, vol. 29, no. 3, pp. 433-439, June 1999.

Kuhn, R., J. C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid Speaker Adaptation in Eigenvoice Space”, IEEE Trans. on Speech and Audio Processing, vol. 8, no. 6, pp.

695-707, 2000.

Kuo, H. K. J., C. H. Lee, I. Zitouni, and E. Fosler-Lussiert, “Minimum Verification Error Training for Topic Verification”, in Proc. ICASSP2003.

Lee, Y. J. and O. L. Mangasarian, “SSVM: Smooth Support Vector Machine for Classification”, Computational Optimization and Applications, vol. 20, no. 1, pp. 5-22, 2001.

Lindberg, J., J. Koolwaaij, H. P. Hutter, D. Genoud, J. B. Pierrot, M. Blomberg, and F.

Bimbot, “Techniques for A Priori Decision Threshold Estimation in Speaker Verification”, in Proc. RLA2C, pp. 89-92, 1998.

Liu, C. S., H. C. Wang, and C. H. Lee, “Speaker Verification Using Normalized Log-Likelihood Score”, IEEE Trans. Speech and Audio Processing, vol. 4, no. 1, pp.

56-60, 1996.

Lu, Y., S. Lu, F. Fotouhi, Y. Deng, and S. J. Brown, “FGKA: A Fast Genetic K-means Clustering Algorithm”, in Proc. ACM Symposium on Applied Computing, 2004.

Luettin, J. and G. Maitre, Evaluation Protocol for the Extended M2VTS Database (XM2VTSDB), IDIAP-COM 98-05, IDIAP, 1998.

Ma, C. and E. Chang, “Comparison of Discriminative Training Methods for Speaker Verification”, Proc. ICASSP2003.

Mami, Y. and D. Charlet, “Speaker Recognition by Location in the Space of Reference Speakers”, Speech Communication, vol. 48, pp. 127-141, 2006.

Martin, A., G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET Curve in Assessment of Detection Task Performance”, in Proc. Eurospeech1997.

McDermott, E., T. J. Hazen, J. Le Roux, A. Nakamura, and S. Katagiri, “Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error”, IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 1, pp.

203-223, 2007.

Messer, K., J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The Extended M2VTS Database”, in Proc. AVBPA1999.

Mika, S., G. Rätsch, J. Weston, B. Schölkopf, and K. R. Müller, “Fisher Discriminant Analysis with Kernels”, in Proc. Neural Networks for Signal Processing IX, Madison, WI, USA, 1999, pp. 41-48.

Mika, S., “Kernel Fisher Discriminants”, Ph.D thesis, University of Technology, Berlin, 2002.

Pelecanos, J. and S. Sridharan, “Feature Warping for Robust Speaker Verification”, in Proc.

Odyssey2001.

Przybocki, M. A., A. F. Martin, and A. N. Le, “NIST Speaker Recognition Evaluations Utilizing the Mixer Corpora—2004, 2005, 2006”, IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 7, pp. 1951-1959, 2007.

Rahim, M. G. and C. H. Lee, “String based Minimum Verification Error (SB-MVE) Training for Flexible Speech Recognition”, Computer Speech and Language, vol. 11, no. 2, pp.

147-160, 1997.

Reynolds, D. A., “Speaker Identification and Verification Using Gaussian Mixture Speaker Models”, Speech Communication, vol.17, no. 1-2, pp. 91-108, 1995.

Reynolds, D. A., T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, Digital Signal Processing, vol. 10, no. 1, pp. 19-41, 2000.

Rosenberg, A. E., J. DeLong, C. H. Lee, B. H. Juang and F. K. Soong, “The Use of Cohort Normalized Scores for Speaker Verification”, in Proc. ICSLP1992.

Rosenberg, A. E., O. Siohan, and S. Parthasarathy, “Speaker Verification Using Minimum Verification Error Training”, in Proc. ICASSP1998.

Siohan, O., A. E. Rosenberg, and S. Parthasarathy, “Speaker Identification Using Minimum Classification Error Training”, Proc. ICASSP1998.

Siu, M. H., B. Mak, and W. H. Au, “Minimization of Utterance Verification Error Rate as a Constrained Optimization Problem”, IEEE Signal Processing Letters, vol. 13, no. 12, pp.

760-763, 2006.

Strang, G., Linear Algebra and Its Applications, 4th. ed., Brooks/Cole, 2005.

Sturim, D. E., D. A. Reynolds, E. Singer, and J. P. Campbell, “Speaker Indexing in Large Audio Databases Using Anchor Models”, in Proc. ICASSP, Salt Lake City, USA, 2001, vol.1, pp. 429-432.

Sturim, D. E. and D. A. Reynolds, “Speaker Adaptive Cohort Selection for Tnorm in Text-Independent Speaker Verification”, in Proc. ICASSP2005.

Sukkar, R. A., A. R. Setlur, M. G. Rahim, and C. H. Lee, “Utterance Verification of Keyword Strings Using Word-Based Minimum Verification Error (WB-MVE) Training”, in Proc.

ICASSP1996.

Sukkar, R. A., “Subword-based Minimum Verification Error (SB-MVE) Training for Task Independent Utterance Verification”, in Proc. ICASSP1998.

Thyes, O., R. Kuhn, P. Nguyen, and J.-C. Junqua, “Speaker Identification and Verification Using Eigenvoices”, in Proc. ICSLP2000.

Valente, F. and C. Wellekens, “Minimum Classification Error/Eigenvoices Training for Speaker Identification”, in Proc. ICASSP2003.

Van Leeuwen, D. A., A. F. Martin, M. A. Przybocki, and J. S. Bouten, “NIST and NFI-TNO Evaluations of Automatic Speaker Recognition”, Computer Speech and Language, vol.

20, pp. 128-158, 2006.

Vapnik, V., Statistical Learning Theory, John Wiley & Sons, New York, 1998.

Wan, V. and S. Renals, “Speaker Verification Using Sequence Discriminant Support Vector Machines”, IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 203-210, 2005.

Wu, J. and Q. Huo, “Supervised Adaptation of MCE-Trained CDHMMs Using Minimum Classification Error Linear Regression”, in Proc. ICASSP2002.

Zheng, T. F., Z. Song, L. Zhang, M. Brasser, W. Wu, and J. Deng, “CCC Speaker Recognition Evaluation 2006: Overview, Methods, Data, Results and Perspective”, in Proc. ISCSLP, Kent Ridge, Singapore, Dec. 2006.

在文檔中鑑別式訓練法於語者驗證之研究 (頁 85-0)