Chapter 2 Framework of Independent Component Analysis and General
2.3 General Probability Decent Method
2.3.3 Summarize Advantages of GPD Formalization
The most important point of the GPD concept is to embed the entire process of a given recognition task into a smooth function. Therefore, we can optimize all of the adjustable system parameters in consistent with the design objective of minimizing recognition errors.
In addition, GPD has both mathematical rigor and a great degree of practicality.
GPD was shown to provide attractive solutions to three of the four major DFA issues:
1) The design objective;
2) Optimization method;
3) Design consistency with unknown samples.
The forth DFA issue, which is the selection of the discriminant function form, has not been fully studied yet.
Because of the above advantages, we choose GPD to modify the GMM for speaker recognition.
Chapter 3
Speaker Recognition System Based on ICA and
GPD Optimizer
3.1 Overall Speaker Recognition System
The framework of our speaker recognition system is shown in Fig. 3-1 and Fig.
3-2.
For the training phase, feature MFCCs is extracted from the original speech signal of speaker s, and then we use the FastICA algorithm to find the independent components of MFCCs. Therefore, we transform MFCCs into feature ICAfts based on the basis found from the above step. In the next step, we use the ICAfts as the input of GMM to train the model. Among the structure, the GPD method is utilized to optimize the GMM recognizer. From the above steps, we could obtain the speaker recognition structure of each speaker s.
In the test phase of speaker recognition system, we also extract MFCC from the speech signal, and transform them by the ICA basis obtained in the training phase.
Then, we use the new features to evaluate the degree (score) of matching the GMM model of some speaker. If the largest score, which is estimated from some model of speaker k, is smaller than a threshold we set in advance, then we will reject the speaker and take him/her as an imposter. Otherwise, we regard the speaker as one customer.
Fig. 3-1 Training phase of our speaker recognition system for each speaker s.
Fig. 3-2 Test phase of our speaker recognition system
3.2 Each Block of Speaker Recognition System
In this section, we will decompose the entire speaker recognition system into blocks. After that, we will detail each block of the recognition system.
3.2.1 Feature Extraction
MFCC is widely used in the automatic speech recognition (ASR) applications. It is primarily for the three reasons [28]: 1) The cepstral features are roughly orthogonal because of the DCT, 2) cepstral mean subtraction eliminates static channel noise, and 3) MFCC is less sensitive to additive noise than linear prediction cepstral coefficients (LPCC). The key component of MFCC responsible for noise robustness is the filter bank; the filters smooth the spectrum, reducing variation due to additive noise across the bandwidth of each filter.
First, the speech signal is pre-processed by a high-pass filter. Next, a segment (frame) of speech is windowed and transformed to the frequency domain via the fast Fourier transform (FFT) and then the magnitude spectrum of the utterance is passed through a bank of triangular-shaped filters whose center frequencies are spaced along the perceptually-motivated Mel frequency scale. Therefore, the energy output from each filter is log-compressed and transformed to the cepstral domain via the discrete cosine transform (DCT). The block of feature extraction is shown in Fig. 3-3.
Fig. 3-3 Block diagram of Feature Extraction
3.2.2 ICA Algorithm
ICA can find a linear non-orthogonal coordinate system in multivariate data determined by high-order statistics. Its goal is to linearly transform the data such that the transformed variables are as statistically independent from each other as possible [29], [30]. Like data mining, ICA can extract the hidden predictive information from large databases and it is a powerful novel technology with great potential for finding the most important information in the data.
ICA not only decorrelates the signals but also reduces higher-order statistical
dependencies. We use it to find the most important and independent components of MFCC.
The block of ICA algorithm is shown in Fig. 2-2.
3.2.3 GPD-Based GMM
The most important concept of the GPD method is to formalize the overall procedure of the task into an optimized design process. Its objective is to directly minimize the recognition error rate.
One advantage of using GPD as the optimizer of the speaker recognition model is that the structure of the convention speaker recognizer can be kept intact without modification. This could demonstrate the practical value of the GPD method if it is to be incorporated in existing recognizer designs.
In addition, for reducing our computation, we will rewrite the equations in subsection 2.3.2. We assume that the covariance matrix is diagonal and the values of the elements in the diagonal are all the same for one Gaussian. That means, we can use a unique variance V to replace the covariance matrix i Σi. Then, eq. (2.17) is
The definition of the discriminant function g xs( ; )j λs , the classification decision rule, the class misclassification measure d xk( ; )j λk , and the loss function ( ; )l xk j λk are the same as eq.(2.16)-(2.21).
1) Discriminant Function:
And then, the adjustment rule using the loss function for the GMM parameter { , , }G
( )(1 ( ))
The block of GPD-based GMM model for the training phase is shown in Fig.
3-4.
In the test phase, we use the misclassification measure to decide if the speaker is an imposter. When d xk( ; )j λk is larger, it represents the degree of misclassification is
higher. On the other hand, when d xk( ; )j λk is smaller, it classifies the speaker more correctly. Therefore, the choice of the threshold is important. If the threshold is large, the rejection rate for some imposter will become low; if the threshold is small, the identification rate of a customer will be reduced. We must find the balance between the rejection rate and the identification rate.
The block diagram of GPD-based GMM model for the test phase is illustrated in Fig. 3-5.
Fig. 3-4 Block diagram of GPD-based GMM model for the training phase.
Fig. 3-5 Block diagram of GPD-based GMM model for the test phase.
Chapter 4
Experiment Results and Discussion
4.1 Introduction
In the pervious chapter, we described the structures of the proposed speaker recognition system. For investigating and showing the contribution and efficiency of these methods we applied, several sets of experiments were done. In the first set of experiments, we evaluated the ML-based GMM with MFCC. The ML-based GMM with ICA features was evaluated in the second set of experiments. In these experiments, we tried to show the superior ability of the ICA features for the speaker recognition task. In the third set of experiments, we adopted the MFCC as the features and the proposed GPD-based GMM as the classifier. The improvement caused by the classifier optimization is shown here. Finally, in the forth set of experiments, we combined the ICA features and the GPD-based GMM as the overall speaker recognition system. The experimental results showed the contribution of this model.
For these experiments, several processing steps occur in the front-end speech analysis. First, the speech signal was decomposed in frames of 256 samples with an overlap of 128 samples (the sampling rate is 8k Hz). For each frame, FFT was computed and provided 256 square module values representing the short term power spectrum in the 0-4k Hz band. And then, this Fourier power spectrum was used to compute 16 mel-spaced filter bank coefficients. We finally computed the power accumulated in each filter bank and the discrete cosine transformation (DCT) to get
the cepstral coefficients called MFCCs with 30 orders.
4.2 Experiment Database
The database for the experiments is the TIMIT acoustic-phonetic speech corpus.
This corpus is widely used throughout the world and provides a standard that permits direct comparison of experimental results obtained by different methodologies. In this thesis, we only used a subset of the DR2 from TIMIT database. This set represents 76 speakers of the same (North America) dialect. There are 52 males and 23 females in this set. The corpus consists of 10 sentences recorded from each speaker. We randomly choose 8 sentences to train the speaker models, and the other 2 sentences to test. For the speaker recognition, we used 5, 10, and 20 speakers as the customers from the DR2 speaker corpus separately and used the reminding speakers as the imposters to evaluate the utility of the rejection.
4.3 Experiment Result
In the following, four sets of experiments would be carried out to evaluate our recognition system.
We assigned one class to each set of features, and after the process of voting by the classifications of features, we could make sure which person the speaker was. The recognition rate was calculated by the result of the correct classification.
Experiment I
Fig. 4-1 Sketch of the feature and the GMM model of Experiment I
Table 4-1 Recognition Results of Experiment I for 5 customers (71 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15
Right No. 5 69 71 71 72 73 74 74 74 75 75 75
Error True No. 71 7 5 5 4 3 2 2 2 1 1 1 Error False No. 0 0 0 0 0 0 0 0 0 0 0 0
Table 4-2 Recognition Results of Experiment I for 10 customers (66 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15
Right No. 10 66 68 70 69 69 71 72 74 74 74 74 Error True No. 66 10 8 6 6 6 4 3 0 0 0 0 Error False No. 0 0 0 0 1 1 1 1 2 2 2 2
Table 4-3 Recognition Results of Experiment I for 20 customers (56 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 20 66 67 68 68 70 70 72 73 71 69 67 Error True No. 56 9 8 7 7 5 5 3 2 3 4 5 Error False No. 0 1 1 1 1 1 1 1 1 2 3 4
In these tables, “Right No.” means that the number of right classifications from 76 speaker; “Error True No.” represents that the number of false classifications from the imposters; “Error False No.” is the number of false rejections from the customers.
Experiment II
Fig. 4-2 Sketch of the feature and the GMM model of experiment II
Table 4-4 Recognition Results of Experiment II for 5 customers (71 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15
Right No. 5 70 71 72 72 74 74 74 74 75 76 76
Error True No. 71 6 5 4 4 2 2 2 2 1 0 0 Error False No. 0 0 0 0 0 0 0 0 0 0 0 0
Table 4-5 Recognition Results of Experiment II for 10 customers (66 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15
Right No. 10 66 68 70 71 71 72 72 73 73 74 74 Error True No. 66 10 8 6 5 4 3 3 2 2 1 1 Error False No. 0 0 0 0 0 1 1 1 1 1 1 1
Table 4-6 Recognition Results of Experiment II for 20 customers (56 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 20 66 67 68 70 71 71 73 73 73 73 73 Error True No. 56 9 8 7 5 4 4 2 2 2 2 2 Error False No. 0 1 1 1 1 1 1 1 1 1 1 1
Experiment III
Fig. 4-3 Sketch of the feature and the GMM model of experiment III
Table 4-7 Recognition Results of Experiment III for 5 customers (71 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 5 72 73 74 74 75 75 75 76 76 76 76 Error True No. 71 4 3 2 2 1 1 1 0 0 0 0 Error False No. 0 0 0 0 0 0 0 0 0 0 0 0
Table 4-8 Recognition Results of Experiment III for 10 customers (66 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 10 69 70 70 73 73 75 75 75 75 75 74 Error True No. 66 7 5 5 2 2 0 0 0 0 0 0 Error False No. 0 0 1 1 1 1 1 1 1 1 1 2
Table 4-9 Recognition Results of Experiment III for 20 customers (56 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 20 69 70 70 70 71 71 73 73 72 73 73
Error True No. 56 6 5 5 5 4 4 2 2 2 1 1
Error False No. 0 1 1 1 1 1 1 1 1 2 2 2
Experiment IV
Fig. 4-4 Sketch of the feature and the GMM model of experiment IV
Table 4-10 Recognition Results of Experiment IV for 5 customers (71 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 5 72 74 75 75 75 75 76 76 76 76 76 Error True No. 71 4 2 1 1 1 1 0 0 0 0 0 Error False No. 0 0 0 0 0 0 0 0 0 0 0 0
Table 4-11 Recognition Results of Experiment IV for 10 customers (66 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 10 71 73 73 73 73 75 75 75 75 75 75 Error True No. 66 5 2 2 2 2 0 0 0 0 0 0 Error False No. 0 0 1 1 1 1 1 1 1 1 1 1
Table 4-12 Recognition Results of Experiment IV for 20 customers (56 imposters) rejection
threshold(*10³) 0 5 6 7 8 9 10 11 12 13 14 15 Right No. 20 70 73 73 73 73 73 73 73 73 73 73 Error True No. 56 5 2 2 2 2 2 2 2 2 2 2 Error False No. 0 1 1 1 1 1 1 1 1 1 1 1
We could see that if the rejection threshold is set to 0, then no one (includes customers and imposters) would be rejected; that is to say, the error false number is zero, the error false number equals to the imposter number, and the right classification number is equivalent to the customer number. Therefore, the recognition rate is worst
Besides, when the rejection threshold is larger, the more imposters were rejected.
Hence, the recognition rate would also be raised. It means that the grades (probabilities) of the customers are greater than those of the imposters. But the customer might be rejected if the threshold was too large.
Comparison
Fig. 4-5 the recognition rates of four experiments
5 persons : Error Ture Rate
0%
Fig. 4-6 the error true rates of four experiments
5 persons : Error False Rate
Fig. 4-7 the error false rates of four experiments
10 customers (66 imposters)
Fig. 4-8 the recognition rates of four experiments
10 persons : Error True Rate
Fig. 4-9 the error true rates of four experiments
10 persons : Error False Rate
0%
Fig. 4-10 the error false rates of four experiments
20 customers (56 imposters)
Fig. 4-11 the recognition rates of four experiments
20 persons : Error Ture Rate
0%
Fig. 4-12 the error true rates of four experiments
20 persons : Error False Rate
Fig. 4-13 the error false rates of four experiments
From the above figures, Experiment IV has the best performances in the recognition rate, the error true rate, and the error false rate; on the other hand, Experiment I has the worst recognition rate.
In addition, the recognition rate is higher when the customers are fewer.
4.4 Discussion
By using ICA to transform MFCC to the independent basis, we could obtain a better feature for the GMM recognizer. And from the experiments, we observed that the performance of the GPD-based GMM was also better than that of the ML-based GMM. Therefore, we combined the two algorithms into our speaker recognition system, and then we could get the best recognition rate of all the four systems. It was proven that our proposed recognition system was really improved the conventional speaker recognition system.
Chapter 5
Conclusion and Future Work
5.1 Conclusion
In this thesis, we develop an text-independent speaker recognition system. It has two main subjects to construct the system. One is ICA used to find out the independent basis for transforming MFCC to the more important features and reducing the dimension. The other is the GPD optimizer applied to modify the GMM recognizer. We show the formulation of the GPD algorithm can be blended into the GMM recognizer design.
A series of experiments are conducted to examine the efficiencies of ICA and the GPD algorithm. Because the ICA-based features are contained the most important components in MFCC, it has better performance than that of MFCC. Besides, the new features transformed by the ICA basis has fewer dimensions, it can save computation.
It showed in experiment I and experiment II.
A GPD algorithm is analyzed and applied to a conventional GMM-based speaker recognizer. We show that the formulation of the GPD algorithm is compatible with GMM, and we also present an implementation of the GPD method in a GMM-based speaker recognizer.
The experiments I~IV has shown the performance of the GPD-based GMM.
Compared our proposed system (experiment IV) with the conventional system (experiment I), it is improved approximately 5%.
5.2 Future Work
By using ICA, we can find the hidden predictive information of the speech signals and reduce the dimension of the data. However, how many dimensions we select will have the best performance is the interesting problem. If we are able to know about it, we could raise the recognition rate and would not waste the operation.
In other words, if we could know what each independent component represents, such as formants, pitches, and so on, then we can use them directly instead of choosing them empirically.
For GPD, a most important point is the discovery of a desirable form of the discriminant function. Solving this problem will advance the speaker recognition technology, but it is obviously difficult and needs significant research efforts. Another important point is to find a reasonable method of controlling the smoothness of the functions – the smooth classification error count loss for example.
In addition, GPD-based training suffers from a scaling problem; it means that extensive computation is involved in evaluating the interclass competition over the tremendous number of possible classes in a large-scale task, such as large-number speaker identification. This problem also occurs in the misclassification measure processing. It will cause the optimization used in GPD to be slower than the conventional method, ex. the expectation maximization (EM) method. Then the L∞
norm may be needed to reduce the adjustment computation in the training phase.
Besides, the success of the GPD method is depended on a good selection of some parameters which the designer decided, such as ε and µ. But the selection is usually performed experimentally due to a lack of theory, a more theoretically selection method is needed.
Finally, we can apply this speaker recognition system to the speech recognition
since they are kinds of recognition. Of course, it requires some modification between the two systems. For example, we should use HMM to replace GMM for continuous speech signals.
Bibliography
[1] D. O’Shaughnessy, “Speaker Recognition,” IEEE ASSP Mag., pp. 4-17, Oct.
1986.
[2] Q. Li, B. H. Juang, C. H. Lee, Q. Zhou, and F. K. Soong, “Recent Advancements in Automatic Speaker Authentication”, IEEE Robotics & Automation Mag., pp.
24-34, Mar. 1999.
[3] D.A. Reynolds, R.C. Rose, and M. J. T. Smith, “PC-based TMS320C30 Implementation of the Gaussian Mixture Model Text-Independent Speaker Recognition System,” in Proc. Int. Conf. Signal Processing Applications Technol., pp. 967-973, Nov. 1992.
[4] J. Fortuna, D. Schuurman, and D. Capson, “A Comparison of PCA and ICA for Object Recognition under Varying Illumination,” Proc. of 2002 Int. Conf. on Pattern Recognition, vol. 3, pp. 11-15, Aug. 2002.
[5] J. H. Lee, H. Y. Jung, T. W. Lee, and S. Y. Lee, “Speech Feature Extraction Using Independent Component Analysis,” in Proc. ICASSP, Istanbul, Turkey, vol. 3, pp.
1631-1634, Jun. 2000.
[6] H. Y. Jung, M. Park, H. R. Kim, and M. Hahn, “Speaker Adaptation Using ICA-Based Feature Transformation,” ETRI Journal, vol. 24, no. 6, pp. 469-472, Dec. 2002.
[7] V. Moonasar and G. K. Venayagamoorthy, “Speaker Identification Using a Combination of Different Parameters as Feature Inputs to an Artificial Neural Network Classifier,” Proc. of 1999 Int.Conf. on Africon, vol. 1, pp. 189-194, 1999.
[8] T. Kinnunen and I. Karkkainen, “Class-Discriminative Weighted Distortion
Measure for VQ-Based Speaker Identification,” SSPR & SPR 2002, LNCS 2396, pp. 681-688, 2002.
[9] R. Soganci, F. Gurgen, and H. Topcuoglu, “Parallel Implementation of a VQ-Based Text-Independent Speaker Identification,” ADVIS 2004, LNCS 3261, pp. 291-300, 2004.
[10] C. Miyajima, Y. Hattori, K. Tokuda, T. Masuko, T. Kobayashi, and T.
Kitamura, “Speaker Identification Using Gaussian Mixture Models Based on Multi-Space Probability Distribution,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), vol. 1, pp. 433-436, May 2001.
[11] D. A. Reynolds and R. C. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan. 1995.
[12] O. W. Kwon and C. K. Un, “Discriminative Weighting of HMM State-Likelihoods Using the GPD Method,” IEEE Signal Processing Letters, vol.
3, no. 9, pp. 257-259, Sep. 1996.
[13] M. Inman, D. Danforth, S. Hangai, and K. Sato,”Speaker Identification Using Hidden Markov Models,” Proc. of 1998 Int. Conf. on Signal Processing, vol. 1, pp. 609-612, Oct. 1998.
[14] J. E. Higgins and R. I. Damper, “An HMM-Based Subband Processing Approach to Speaker Identification,” AVBPA 2001, LNCS 2091, pp. 169-174, 2001.
[15] D. A. Reyonds, “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Speech Communication, vol. 17, pp. 91-108, 1995.
[16] D. A. Reynolds, “Large Population Speaker Identification Using Clean and Telephone Speech,” IEEE Signal Processing Letters, vol. 2, no. 3, pp46-48, Mar.
[17] P. C. Chang and B. H. Juang, “Discriminative Training of Dynamic Programming Based Speech Recognizers,” IEEE Trans. on Speech and Audio Processing, vol. 1, no. 2, pp.135-143, Apr. 1993.
[18] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley.
[19] A. Hyvärinen, “New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit, ” in Advances in Neural Information Processing Systems, vol. 10, pp. 273-279, 1998.
[20] S. Katagiri, C. H. Lee, and B. H. Juang, “A Generalized Probabilistic Descent Method,” in Proc. ASJ Autumn Conf., pp. 141-142, 1990.
[21] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Roy. Statist. Soc., vol. 39, no. 1, pp.
1-38, 1977.
[22] B. H. Juang and L. Rabiner, “The Segmental K-means Algorithm for Estimating Parameters of Hidden Markov Models,” IEEE Trans. Audio Speech Signal Processing, vol. 38, pp. 1639-1641, Sep. 1990.
[23] S. Katagiri, B. H. Juang, and C. H. Lee, “Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method,” Proc. of the IEEE, vol. 86, no. 11, pp. 2345-2373, Nov. 1998.
[24] S. Katagiri, C. H. Lee, and B. H. Juang, “New Discriminative Training Algorithms Based on the Generalized Probabilistic Descent Method,” in Proc.
IEEE Workshop Neural Networks for Signal Processing, pp. 299-308, 1991.
[25] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” Ann. Math. Stat., vol. 41, pp. 164-171, 1970.
[26] B. H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Processing, vol. 40, pp. 3043-3054, Dec.
1992.
[27] W. Chou and B. H. Juang, “Adaptive Discriminative Learning in Pattern Recognition,” Tech Rep., AT&T Bell Labs, Murray Hill, NJ.
[28] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. 28, pp. 357-366, 1980.
[29] J. F. Cardoso and B. Laheld, “Equivariant Adaptive Source Separation,” IEEE Trans. Signal Processing, vol. 45, no. 2, pp. 434-444, 1996.
[30] T. W. Lee, M. Girolami, A. J. Bell, and T. J. Sejnowski, “A Unifying Framework for Independent Component Analysis,” Computers and Math. with
[30] T. W. Lee, M. Girolami, A. J. Bell, and T. J. Sejnowski, “A Unifying Framework for Independent Component Analysis,” Computers and Math. with