Gaussian Mixture Models Training Method

CHAPTER 2 THE PROPOSED SYSTEM

2.3 Gaussian Mixture Models Training Method

For a text-independent speaker identification or verification system, we do not limit what the speaker will say. In [5-8, 12, 13], the Gaussian mixture models (GMM) has been used to represent speaker’s speech feature distribution.

The GMM can be denoted as  



_m,μ_m,_m



, m1,2,...,M, where M is the mixture

For a L -dimensional feature vector, ~x_h_{rea l}, we can calculate its probability in the GMM

as below

According to our system architecture, there are two parts: the training part and the testing part. In the training part, the GMM for each speaker is established. In the testing part, the speech feature vectors of a speaker are input into each speaker’s GMM to calculate the corresponding probabilities. Then the speech is considered to be spoken by the speaker with the highest probability.

In the training part, for each speaker, his speeches are collected as the training speeches.

First, the feature vectors of these training speeches are extracted. Secondly, the K-means

15 feature vectors are used to estimate the maximum likelihood model parameters by the iterative expectation maximization (EM) algorithm [5, 15]. The EM algorithm refines the GMM parameters iteratively and monotonically increases the likelihood of the estimated model.

The feature vectors, X~



~x,x~ ,...,~xH_real



 1 , are assumed independent. The probability of X~

probability of a frame less than 10^²⁵ is reset to 10^²⁵. For implementation convenience, the

log-likelihood probability is used as the below formula

 

^~ ^log





In the testing part, for an input speech, the feature vectors for all real speech frames are extracted and the log-likelihood probability for each speaker’s GMM through the above methods is evaluated. The speech is determined to be spoken by the speaker Sˆ with the highest probability

 

^~ ^,

log max ˆ arg

1 i

i p

S X



  (18)

where S is the number of speakers, i is the i-th speaker.

CHAPTER 3 EXPERIMENT RESULTS

In this chapter, we present the experiment results of our system. The databases used in our experiments are the CMU PDA Database and our own database from our lab members’

speeches.

The CMU PDA Database is a free database that is released by the Carnegie Mellon University in the internet. There are 16 speakers in this database. 51 different speeches are recorded for each speaker. When a speaker speaks, the speech that is spoken by the speaker is recorded by 5 record devices at the same time. So each speech of a speaker has five record files, and each speaker has total 255 record files. The sampling frequency is 16000 Hz in this database. The durations of these speeches are 3-5 seconds.

In our database, we record the speeches from our lab members. There are 8 speakers. For each speaker, 5 different speeches are recorded. And each speech is spoken 5 times using the same record device. Thus, each speaker has total 25 record files. The sampling frequency is 44100 Hz in our database. The durations of these speeches are 10-15 seconds.

In our experiments, we take 100 speeches as the training speeches and the remaining 155 speeches as the testing speeches for each speaker in the CMU PDA Database. And the 30 percentile energy is used as threshold t_silent in this database. In our database, we use 10 speeches as the training speeches and the remaining 15 speeches as the testing speeches. And

the 20 percentile energy is used as threshold t_silent in our database. For each experiment, we take different training speeches to do 4 times. The experiment result shows the average identification rates and the standard deviations.

3.1 Feature Dimension and Mixture Number Decision

For our system, we need to determine the dimension L of the MFCC feature vector and the mixture number M of the GMM. The identification rate will be affected by different L and M . In this experiment, we try different dimensions L15, 16, 17,18, 19, and different mixture numbers M 4, 6, 8,10, 12. The results of CMU PDA Database are shown in Fig.

6 and Table 2. And the results of our database are shown in Fig. 7 and Table 3.

According to this experiment results, we choose the dimension L16 and the mixture number M 8 with the highest identification rate of CMU PDA Database. In our database,

10 ,

18 

 M

L and L18, M 12 have the same identification rate. The more mixture number makes the computing complexity increased; hence we choose the dimension L18 and the mixture number M 10 of our database.

Fig. 6. Identification rates using different dimensions and mixture numbers for CMU PDA Database.

IR(SD) L=15 L=16 L=17 L=18 L=19

M=4 98.94%(0.24%) 99.04%(0.40%) 98.85%(0.55%) 98.66%(0.94%) 98.86%(0.63%) M=6 99.28%(0.28%) 99.35%(0.18%) 98.91%(0.86%) 98.80%(1.39%) 97.88%(1.79%) M=8 99.35%(0.23%) 99.53%(0.25%) 96.96%(2.04%) 98.24%(1.54%) 98.86%(0.94%) M=10 99.44%(0.31%) 99.20%(0.58%) 96.85%(2.27%) 97.81%(2.16%) 99.20%(0.45%) M=12 99.49%(0.34%) 98.88%(0.70%) 98.40%(1.65%) 98.87%(0.97%) 99.47%(0.35%)

Table 2. Identification rates (IR) and standard deviations (SD) using different dimensions and mixture numbers for CMU PDA Database.

Fig. 7. Identification rates using different dimensions and mixture numbers for our database.

IR(SD) L=15 L=16 L=17 L=18 L=19

M=4 95.21%(2.49%) 97.29%(2.08%) 98.13%(2.19%) 98.96%(0.80%) 98.96%(0.80%) M=6 97.08%(3.23%) 98.33%(1.80%) 98.13%(2.19%) 98.75%(1.08%) 98.54%(1.42%) M=8 96.67%(2.45%) 98.54%(1.42%) 98.75%(1.08%) 98.75%(1.08%) 98.75%(1.08%) M=10 97.08%(2.59%) 97.92%(1.73%) 98.54%(1.42%) 99.17%(0.68%) 98.96%(0.80%) M=12 96.88%(2.29%) 97.92%(1.73%) 98.96%(0.80%) 99.17%(0.68%) 98.96%(0.80%)

Table 3. Identification rates (IR) and standard deviations (SD) using different dimensions and mixture numbers for our database.

3.2 Comparison of Different Threshold

t_silent

In our channel effect remover, we need to set a threshold t_silent to classify frames as silent frames or real speech frames. The t_silent is set according to the percentile of the frame energy. Different percentile energies affect the identification rate. In this experiment of CMU PDA Database, we use different percentiles 10%, 15%, 20%, 25%, 30%, 35%, and 40% with

8 ,

16 

 M

L . The experiment results are shown in Table 4.

According to the results, we choose the threshold t_silent of percentile 30% for CMU PDA Database.

For our database, we also use different percentiles 10%, 15%, 20%, 25%, 30%, 35%, and 40% with L18, M 10. The experiment results are shown in Table 5.

Table 4. Identification rates and standard deviations using different percentiles of CMU PDA Database.

According to the results, we choose the threshold t_silent of percentile 20% for our database.

3.3 Comparison of Different Methods

In this experiment, we compare the identification rates of the proposed method and other methods using different feature vectors. According to the above experimental results of CMU PDA Database, L16, M 8 has the highest identification rate, thus it is used in this experiment. For our database, L18, M 10 has the highest identification rate, thus this is used in the experiment.

The methods used in this experiment include the proposed method, the MFCC, the MFCC of the real speech frames, the MFCC using the traditional cepstral mean subtraction (CMS) [9, 10], the delta-cepstrum of MFCC [11], and the MFCC using CMS of the real speech frames. The experimental results of CMU PDA Database are shown in Table 6. And

Identification Rates Standard Deviations

10% 98.54% 0.42%

15% 98.96% 0.80%

20% 99.17% 0.68%

25% 96.88% 3.00%

30% 97.08% 3.08%

35% 96.67% 2.81%

40% 96.46% 3.22%

Table 5. Identification rates and standard deviations using different percentiles of our database.

the experiment results of our database are shown in Table 7.

These experiment results show that the identification rate of the proposed method is the highest, it is increased 4.09% relative to that of using the original MFCC for CMU PDA Database, and increased 2.5% for our database. And the identification rate of the proposed

Identification Rates Standard Deviations

MFCC 95.44% 3.02%

Real speech frames 96.59% 2.41%

MFCC using CMS 98.17% 1.33%

Delta-cepstrum of MFCC 99.02% 0.65%

Real Speech frames using CMS 99.28% 0.60%

Proposed method 99.53% 0.25%

Table 6. Identification rates and standard deviations of different methods for CMU PDA Database.

Identification Rates Standard Deviations

MFCC 96.67% 1.18%

Real speech frames 97.71% 1.05%

MFCC using CMS 98.13% 0.80%

Delta-cepstrum of MFCC 98.54% 0.80%

Real Speech frames using CMS 98.75% 0.48%

Proposed method 99.17% 0.68%

Table 7. Identification rates and standard deviations of different methods for our database.

method is increased 0.25% relative to that of using the real speech frames using CMS for CMU PDA Database, and increased 0.42% for our database. The proposed method has the highest identification rate and the lowest standard deviations for CMU PDA Database, and has the highest identification rate for our database.

3.4 System Robustness Testing

In this experiment, we test the robustness of our system. For the two databases, we use half training speeches to train the GMMs for each speaker. The experiment results are shown in Tables 8, 9 to compare the mentioned methods with our proposed method.

Identification Rates Standard Deviations MFCC + Delta-cepstrum (L32) 88.43% 2.97%

MFCC 94.22% 4.30%

Real speech frames 95.20% 4.51%

MFCC using CMS 97.12% 1.06%

Delta-cepstrum of MFCC 98.64% 1.21%

Real Speech frames using CMS 98.88% 0.94%

Proposed method 99.16% 0.57%

Table 8. Identification rates and standard deviations of half training speeches in different methods for CMU PDA Database.

These experiment results using half training speeches show that the rate of the proposed method is the highest, it is increased 0.28% relative to that of using the real speech frames using CMS for CMU PDA Database, and increased 0.31% for our database. The proposed method has the highest identification rate and the lowest standard deviations for CMU PDA Database, and has the highest identification rate for our database.

Identification Rates Standard Deviations MFCC + Delta-cepstrum (L32) 42.66% 19.31%

MFCC 95.00% 1.35%

Real speech frames 96.56% 1.20%

MFCC using CMS 97.66% 1.39%

Delta-cepstrum of MFCC 98.13% 0.88%

Real Speech frames using CMS 98.28% 0.60%

Proposed method 98.59% 0.79%

Table 9. Identification rates and standard deviations of half training speeches in different methods for our database.

CHAPTER 4 CONCLUSIONS AND FUTURE WORKS

In this thesis, we proposed a speaker identification system. A new channel effect remover is provided to get a higher identification rate. In the channel effect remover, the channel effects for speeches recorded from different record devices or in a noisy environment are decreased. In our system, for each input speech, the MFCC feature vectors are first extracted.

Secondly, these feature vectors are inputted into the proposed channel effect remover to obtain new feature vectors. Finally, in the training part, these new feature vectors are used to get the GMM of each speaker, and in the testing part, these feature vectors are inputted to GMM to determine the speaker. Experiment results show that the proposed method provides a higher identification rate.

In our channel effect remover, the threshold used to classify frames into silent type and real speech type is adapted according to different databases. We use a constant percentile of the frame energies as the threshold for all speeches in the same database. In the future, we want to develop a method to adapt the threshold according to each speech. With the automatically adapted threshold, the real speech frames and silent frames can be classified more precisely such that identification rate can be improved.

REFERENCES

[1] J. P. Campbell Jr, “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, Vol.85, pp. 1437-1462, Sep. 1997.

[2] R. Vergin, D. O’Shaughnessy, and V. Gupta, “Compensated Mel Frequency Cepstrum Coefficients,” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996 (ICASSP ’96), Vol. 1, pp. 323-326, Atlanta, GA. USA, 07-10 May 1996.

[3] S. Molau, M. Pitz, R. Schluter, and Ney H., “Computing Mel-Frequency Cepstral Coefficients on the Power Spectrum,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001. Proceedings (ICASSP ’01), Vol. 1, pp. 73-76, Salt

Lake City, UT., USA, 07-11 May 2001.

[4] J. C. Wang, J. F. Wang, and Y. S. Weng, ”Chip Design of MFCC Extraction for Speech Recognition,” Integration, the VLSI Journal, Vol. 32, pp. 111-131, Nov. 2002.

[5] D. A. Reynolds, and R. C. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, Vol. 6, pp. 72-83, Jan. 1995.

[6] D. A. Reynolds, “A Gaussian Mixture Modeling Approach to Text-Independent Speaker Identification,” Georgia Institute of Technology, Ph. D., Aug. 1992.

[7] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted

Gaussian Mixture Models,” Digital Signal Processing, Vol. 10, pp. 19-41, Jan. 2000.

[8] D. A. Reynolds, “The Effects of Handset Variability on Speaker Recognition Performance: Experiments on the Switchboard Corpus,” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. Conference Proceedings.

(ICASSP ’96), Vol.1, pp. 113-116, Atlanta, GA. USA, 07-10 May 1996.

[9] B. S. Atal, “Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification,” Journal of the Acoustical Society of America, Vol. 55, pp. 1304-1312, June 1974.

[10] S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 29, pp. 254-272, Apr.

1981.

[11] B. A. Hanson, T. H. Applebau, and J. C. Junqua, “Spectral Dynamics for Speech Recognition Under Adverse Conditions,” in Automatic Speech and Speaker Recognition:

Advanced Topics, C. H. Lee, F. K. Soong, and K. K. Paliwal, Eds. Boston, MA: Kluwer, 1996.

[12] Y. Chen, and Q. Y. Hong, “Voiceprint Verification Based on Two-Level Decision HMM-UBM,” 2009 1^st International Conference on Information Science and Engineering (ICISE), pp. 3556-3559, Nanjing, China, 26-28 Dec. 2009.

[13] P. K. Ajmera, D. V. Jadhav, and R. S. Holambe, “Text-Independent Speaker Identification

Using Radon and Discrete Cosine Transforms Based Features from Speech Spectrogram,”

Pattern Recognition, Vol. 44, pp. 2749-2759, Oct.-Nov. 2011.

[14] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,”

Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, No. 1, pp.

100-108, 1979.

[15] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), Vol. 30, No. 1, pp. 1-38, 1977.

在文檔中語者辨別的研究 (頁 23-0)