Chapter 4 Evaluation
4.3 Discussions
From the experiment results in the previous section, it is clearly that ACCs are more robust than MFCCs. In addition, the STMF process could further enhance the performance of ACCs. Here are some reasons for these phenomenons.
The ACCs are derived from the auditory spectrum which represents speech energy along the log-frequency axis with the 24 cochlear filters per octave frequency resolution. This constant frequency resolution is high enough to characterize the 1/3~1/6 octave critical bandwidth measured in human hearing. In addition, the lateral inhibitory network in the first cochlear module sharpens the cochlear filters to have narrower bandwidth. On the other hand, the MFCCs use FFT to transform a time domain signal into the frequency domain. Such conventional approach has the trade-off between the time and frequency resolution. This constant-frequency-resolution versus time-frequency-resolution-trade-off might be the main reason for why hearing based features usually perform better than conventional FFT-based features.
The STMF process comprises several significant concepts. First, it works on joint spectro-temporal modulations not spectral or temporal modulations separately.
Thus, we can extract more high-level features, such as the speaking rate (by temporal modulations) and FM sweeping directions (by joint spectro-temporal modulations).
Secondly, the 10-2 - 1 mask (10-2 for non-speech regions and 1 for speech regions) is intuitively similar to the conventional VAD approach. The major difference is that the VAD masks the non-speech regions on a frame-by-frame basis in the time domain while the STMF process masks the non-speech t-f units in the joint spectro-temporal domain.
As shown in Table 2 and Table 3, the parameter set STMF(B) performs better
within 0~10dB than the parameter set STMF(A), but worse in clean and 15dB conditions with the GMM based recognizer. It is not surprisingly since adopting the higher threshold would inevitably degrade the intelligibility of the clean speech such that the recognition rates decrease in high SNR conditions. On the other hand, the parameter set of STMF(A) produces better results than the parameter set of STMF(B) in the MAP-GMM based recognizer as shown in Table 4 and Table 5. The reason for that is the UBM is trained by the clean speech through the STMF process. The MAP-GMM is then to derive the speaker’s model by adapting the well trained UBM.
As mentioned above, the parameter set of STMF(B) performs worse in clean conditions, therefore, to construct a worse UBM. Thus, MAP-GMM speaker models adapted from this UBM would have worse performance due to the worse of speaker variability of this UBM. Consequently, the parameter set of the STMF(B) is suitable for GMM based recognizers while the parameter set of the STMF(A) is more favorable for MAP-GMM based recognizers.
Chapter 5
Conclusion and Future Works
In this thesis, we propose auditory spectral features (ACCs) further enhanced by the spectro-temporal modulation filtering (STMF) process for speaker recognition tasks in additive noises and demonstrate their superior performance of robustness to conventional MFCCs. Performance comparisons are also done between our proposed features and newly developed ANTCCs reported in [7]. For a randomly drawn 70-people testing set from TIMIT corpus, our STMF features are more robust than ANTCCs. For GRID corpus, ANTCCs and our STMF features achieve higher recognition rates in high SNR (15 and 10 dB) and low SNR (5 and 0 dB) conditions, respectively. We also demonstrate the MAP-GMM can further improve performance of proposed features provided the amount of training data is considered insufficient.
However, the down side of the MAP-GMM is its computational complexity which makes the real-time implementation infeasible for now.
The STMF process produces joint spectro-temporal smoothed spectrogram, which highlights certain high-level information of the speaker. Such high-level information appears to be less variable than low-level information (for example, the static spectrum or MFCCs) in presences of noises. In the future, we will inspect the benefit of STMF process in presences of convolutional noises (such as channels or handsets mismatch).
REFERENCE
[1] D.A. Reynolds, "Speaker Identification and Verification using Gaussian Mixture Speaker Models," Speech Comm., vol. 17, pp. 91–108, 1995.
[2] T. Kinnunen and H. Li, "An overview of text-independent speaker recognition:
from features to supervectors," Speech Comm., vol. 52, pp. 12–40, 2010.
[3] D.A. Reynolds, et al., "The SuperSID project: exploiting high-level information for high-accuracy speaker recognition," in Proc. ICASSP, pp.
784-787, 2003.
[4] R. Saeidi, J. Pohjalainen, T. Kinnunen, P. Alku, “Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification”, in IEEE Signal Processing Letters, vol. 17, pp. 599-602, 2010.
[5] J. Ming, et al., "Robust speaker recognition in noisy conditions," IEEE trans.
on Audio, Speech, and Language processing, vol. 15, no. 5, pp. 1711–1723, July, 2007.
[6] S. Furui, "Cepstral analysis technique for automatic speaker verification,"
IEEE trans. on Audio, Speech, and Language processing, vol. 29, pp. 254–272, 1981.
[7] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE trans.
on Audio, Speech, and Language processing, vol. 2, no. 4, pp. 578–589, Oct.
1994.
[8] M. Rahim, Y. Bengio, and Y. Lecun, “Discriminative feature and model design for automatic speech recognition,” in Proc. Eurospeech’97, Rhodes, Greece, 1997, pp. 75–78.
[9] D. A. Reynolds, “Channel robust speaker verification via feature mapping,” in Proc. ICASSP, 2003, vol. 2, pp. II-53–II-6.
[10] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proc. A Speaker Odyssey—The Speaker Recognition Workshop, Crete, Greece, 2001, pp. 213–218.
[11] B. Xiang, U. Chaudhari, J. Navratil, G. Ramaswamy, and R. Gopinath,
“Short-time Gaussianization for robust speaker verification,” in Proc.
ICASSP’02, Orlando, FL, 2002, pp. 681–684.
[12] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp. 19–41, 2000.
[13] C. Barras and J. L. Gauvain, “Feature and score normalization for speaker verification of cellular data,” in Proc. ICASSP’03, Hong Kong, China, 2003,
pp. 49–52.
[14] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Process., vol.
10, pp. 42–54, 2000.
[15] R. Teunen, B. Shahshahani, and L. P. Heck, “A model based transformational approach to robust speaker identification,” in Proc. ICSLP, 2000, vol. 2, pp.
495–498.
[16] C. H. Lee, C. H. Lin, and B. H. Juang, “A study on speaker adaptation of the parameters of continuous density hidden markov models,” ASSP-39, vol. 39, no. 4, pp. 806–814, April 1991.
[17] Yiu, K.K., Mak, M.W., Kung, S.Y., “Environment adaptation for robust speaker verification by cascading maximum likelihood linear regression and reinforced learning,” Computer Speech and Language, vol. 21, pp. 231-246, 2007.
[18] J. Ortega-Garcia and L. Gonzalez-Rodriguez, “Overview of speaker enhancement techniques for automatic speaker recognition,”in Proc.ICSLP’96, Philadelpia, PA, 1996, pp. 929–932.
[19] Suhadi, S. Stan, T. Fingscheidt, and C. Beaugeant, “An evaluation of VTS and IMM for speaker verification in noise,” in Proc. Eurospeech’03, Geneva, Switzerland, 2003, pp. 1669–1672.
[20] J. A. Nolazco-Flores and L. P. Garcia-Perera, “Enhancing acoustic models for robust speaker verification,” in Proc. ICASSP, pp. 4837–4840, Las Vegas, U.S.A., April 2008.
[21] S. G. Pillay, A. Ariyaeeinia, M. Pawlewski, and P. Sivakumaran, “Speaker verification under mismatched data conditions,” IET Signal Processing, vol. 4, no. 3:236–246, July 2009.
[22] R.P. Lippmann, "Speech recognition by machines and humans," Speech Comm., vol. 22, pp. 1–15, 1997.
[23] W.H. Abdulla, "Robust speaker modeling using perceptually motivated feature", Pattern Recognition letters, pp.1333-1342, 2007.
[24] Y. Shao, et al., "Incorporating auditory feature uncertainties in robust speaker identification," in Proc. ICASSP, vol. IV, pp. 277-280, 2007.
[25] Q. Wu and L. Zhang, "Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2008, Article ID 578612, 2008.
[26] T. Chi, P. Ru, and S.A. Shamma, "Multi-resolution spectro-temporal analysis of complex sounds," J. Acoust. Soc. Am., vol. 118, no. 2, pp. 887-906, 2005.
[27] D. A. Reyolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using
adapted Gaussian mixture models,” Digital Signal Process., vol.10, pp. 19–41, Jan. 2000.
[28] Y. Linde, A.Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE trans. on communication, com-28:84 95,1980.
[29] Y.-N. Hung, "Speech Enhancement Method based on Auditory Perceptual Model" in Communication Engineering. vol. master Hsin-Chu, Taiwan:
National Chiao Tung University, 2008.
[30] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J.
Acoust. Soc. Am., vol.87, no. 4, pp. 1738–1752, 1990.
[31] M. Elhilali, T. Chi and S. A. Shamma, "A spectro-temporal modulation index (stmi) for assessment of speech intelligibility," Speech Comm., 41(2-3), pp.331–348, 2003.
[32] T. Chi, et al., "Spectro-temporal modulation transfer functions and speech intelligibility," J. Acoust. Soc. Am., vol. 106, no. 5, pp. 2719–2732, 1999.
[33] M. Cooke, J. Barker, S. Cunningham, and X. Shao, "An audio-visual corpus for speech perception and automatic speech recognition," J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421–2424, 2006.
[34] A.Varga and H.J.M. Steeneken, "Assessment for automatic speech recognition:
II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech Comm., vol.12(3), pp. 247-251, 1993.
[35] D.A. Reynolds, and R.C. Rose, "Robust text-independent speaker identification using Gaussian mixture speaker models," Speech and Audio Processing, IEEE Transactions Speech Audio Processing, vol. 3, pp. 72-83, 1995.