T HE RESULTS OF GRAPH REGULARIZED - BASED MANIFOLD LEARNING

Learning

There are three sets of experiments in this section. First, we validate the strength of the low-dimensional structure methods with its manifold variants method on the standard MFCC features. Next, we combine the best-performing method with the well-known

features. Finally, we evaluate the effect of low-dimensional structure methods and its manifold variants method by the power spectral density curve. In the first set of experiments, we compare the low-dimensional structure methods with its manifold variants in terms of their performance to boost the ASR performance when using the standard MFCC features.

Table 6-1Word error rates (%) for the detailed results of baselines (i.e., NMF) by using MFCC features

NMF

Mic. Clean Car Babble Resturant Stree Airport Train Wv1 5.72 25.28 38.91 39.62 37.68 37.03 39.06 Wv2 15.43 38.65 49.79 50.63 49.49 48.65 49.65 Table 6-2 Word error rates (%) for the overall results of baselines (including MFCC,

K-SVD and NMF) by using MFCC features

Overall Performance

Method SetA SetB SetC SetD Avg.

MFCC

3.75 49.93 22.55 60.32 34.14

OMP+K-SVD

5.90 35.09 20.38 46.45 26.96

NMF

5.72 36.26 15.43 47.81 26.31

Table 6-3 Word error rates (%) for the detailed results of various graph regularized based methods by using MFCC features Table 6-4 Word error rates (%) for the overall results of various graph regularized

based methods by using MFCC features GraphSC(HeatKernel) 5.98 33.49 14.29 43.96 24.43

GraphSC (Cosine) 3.83 33.49 10.42 45.99 23.43

The corresponding results are shown in Table 6-1, Table 6-2, Table 6-3, and Table 6-4. From them we can draw three noteworthy observations and list as follows:

1. K-SVD and NMF can improve the performance of the baseline MFCC system significantly. It seems that both of them to use only few atoms to linearly reconstruct the MFCC-based modulation spectrum is sufficient. The improved results by K-SVD and NMF means that the linguistic information conveyed in the modulation spectra may lying in unions of these linear subspaces.

2. Considering the manifold structures of magnitude modulation spectra, we discuss three measurements to build the affinity graph. We can see that the cosine-based affinity graph stands out in performance. There is a possible reason for that the magnitude modulation spectra of speech features merely offers holistic embeddings of a given speech utterance, measuring the absolute distance is not suitable for capturing the relationship among any two utterances.

3. GraphSC achieves better performance as compared to GNMF with using the cosine based affinity graph, leading to a further WER reduction of 2.86%. It should be mentioned here that GraphSC basically offers a sparse representation for a noisy magnitude modulation spectrum, which is validated that noise feature is dense through reconstruction of clean magnitude modulation spectrum by the union of a few important atoms, thereby ignoring redundant or

nosy components.

In the second set of experiments, we investigate the synergy of the proposed GraphSC method with two state-of-the-art robustness methods that directly perform normalization on the MFCC components at each speech time frame instead of the Table 6-5 Word error rates (%) for the detailed synergy of the GraphSC based method

and several state-of-the-art methods. Table 6-6 Word error rates (%) for the overall results of synergy of the GraphSC based

method and several state-of-the-art methods.

Overall Performance

Method SetA SetB SetC SetD Avg.

CMVN

3.92 33.96 9.81 46.22 23.48

AFE

3.70 21.14 9.45 30.36 16.16

GraphSC(Cosine)+CMVN

4.18 30.32 11.40 42.83 22.18

GraphSC(Cosine)+AFE 3.79 20.91 9.42 30.26 16.09

modulation spectra, they are cepstral mean and variance normalization (CMVN) and the ETSI advanced front-end based method (AFE). As can be evident from Table 6-5 and Table 6-6, the synergy of GraphSC and the existing methods that directly enhance the MFCC features can bring considerable additional gains for the later methods, also showing the complementary robustness capability of additionally normalizing magnitude modulation spectra of speech features.

Apart from recognition performance, we also compare the presented NMF and GNMF with regard to their capabilities of reducing the mismatch in the power spectral density (PSD) of the MFCC-based cepstral sequence caused by noise. Figs. 2(a) to 6-2(c) depict the averaged PSD curves of the unprocessed, NMF processed and GNMF processed first MFCC feature c1 for the Aurora-4 test utterances contaminated with four types of environmental noise, with SNR levels varying from 5 dB to 15 dB. First, for the unprocessed case as depicted in Fig. 6-2(a), it shows that the various noise sources lead

Figure 6-2 The average c1 PSD curves for Aurora-4 test utterances with noise types, i.e., clean, airport noise, clean with channel distortion and airport noise with channel distortion, which were processed by two normalization methods:

(a ) the MFCC baseline (without normalization), (b) NMF and (c) GNMF.

(a) (b) (c)

to a significant PSD mismatch over the entire modulation frequency band [0, 50 Hz].

Figs. 6-2(b) and 6-2(c) show that both NMF and GNMF can considerably reduce the PSD distortions, while GNMF additionally preserves the proximity relationships among utterances. Furthermore, it seems that GNMF is more effective than NMF to mitigate the PSD mismatch at all frequency bands.

Chapter 7

Conclusions and Future Work

_____________________________________________________________________

In this thesis, we have explored three kinds of low-dimensional structure method. First one is using sparse representation for enhancing modulation spectra of speech features which map the original modulation spectra into the space spanned by these representative basis vectors. Second one is the novel use of LRR methods to discover the intrinsic subspace structures residing in the modulation spectra of speech features and simultaneously alleviating the negative effects of environmental noise. Finally, the last one is to explore two graph regularization methods, i.e., GNMF and GraphSC, to discover the intrinsic low-dimensional manifold structures residing in the magnitude modulation spectra of speech features, by projecting the noisy magnitude modulation spectra into a pre-learned manifold structure to alleviate the negative effects of environmental noise. Moreover, we empirically compare our methods with state-of-the-art methods using Aurora-4 dataset and tasks. The experimental results demonstrate that LRR- and graph regularization-based feature normalization conducted in the modulation spectrum domain can significantly improve the baseline MFCC system, as well as the CMVN- and the AFE-based systems.

As to future work, we envisage several directions. First, we plan toinvestigate more other sophisticated robust methods for the recovery of subspace structures inherent in the modulation spectra of speech features. In addition, we will also try to incorporate knowledge of noise that leverage other source separation methods to reduce mismatch between training and test conditions. Furthermore, we are also interested in exemplar-based sparse representations for noise robust automatic speech recognition. Finally, we plan to investigate to leveraging more fantasy deep neural network techniques for effective use in graph-regularization based speech feature normalization.

BIBLIOGRAPHY

[1] J. Droppo, and A. Acero, “Environmental robustness,” springer handbook of speech

processing. Springer Berlin Heidelberg, pp. 653–680, 2008.

[2] D. Yu, and L. Deng. Automatic speech recognition: A deep learning approach.

Springer, 2014.

[3] Y. He, G. Sun, and J. Han, “Spectrum enhancement with sparse coding for robust speech recognition,” Digital Signal Processing, 43: 59–70, 2015.

[4] M. L. Seltzer, Y. Dong, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. ICASSP, pp. 7398–7402, 2013.

[5] B. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification." JASA, vol. 55, pp. 1304-1312, 1974.

[6] O. Viikki, B. David, and K. Laurila. “A recursive feature vector normalization approach for robust speech recognition in noise.” in Proc. ICASSP, vol. 2, pp.733-736, 1998.

[7] O. Viikki, and K. Laurila. “Cepstral domain segmental feature vector normalization for noise robust speech recognition.” Speech Commun., vol. 25, pp. 133–147, 1998.

[8] S. Molau, H. Florian, and H. Ney. “Feature space normalization in adverse acoustic conditions.” in Proc. ICASSP, vol. 1. pp.656-659, 2003.

[9] D. Macho, L. Mauuary, B. Noé. “Evaluation of a noise-robust DSR front-end on Aurora databases.” in Proc Interspeech. 2002.

[10] J. Li, M. L. Seltzer, and Y. Gong. “Improvements to VTS feature enhancement.”

in Proc. ICASSP, pp. 4677-4680, 2012.

[11] S. Boll. “Suppression of acoustic noise in speech using spectral subtraction.” IEEE

Trans. on acoustics, speech, and signal processing, vol.27, pp. 113-120, 1979.

[12] T. F. Quatieri, “Discrete-time speech signal processing: principles and practice.”

Pearson Education India, 2006.

[13] R. Lippmann, E. Martin, and D. Paul, “Multi-style training for robust isolated- word speech recognition,” in Proc. ICASSP, vol. 12, pp. 705–708, 1987.

[14] J. L. Gauvain and L. Chin-Hui, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” IEEE Trans. on speech and audio

processing, vol. 2, no. 2, pp. 291–298, 1994.

[15] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Computer Speech

& Language, vol. 9, no. 2, pp. 171–185, 1995.

[16] A. Jansen, and P. Niyogi. “Intrinsic Fourier analysis on the manifold of speech sounds.” in Proc. ICASSP, vol. 1, pp.241-244, 2006.

[17] K. N. Stevens, “Acoustic phonetics”, Vol. 30. MIT press, 2000.

[18] A. Oppenheim and R. Schafer, “From frequency to quefrency: a history of the cepstrum.” IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95–106, 2004.

[19] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–

366, 1980.

[20] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” Journal of

the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.

[21] J. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the

IEEE, vol. 81, no. 9, pp. 1215–1247, 1993.

[22] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of

America, vol. 8, no. 3, pp. 185–190, 1937.

[23] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE

Trans. on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272,

1981.

[24] Y.H. Suk, S.H. Choi, and H.-S. Lee, “Cepstrum third-order normalisation method for noisy speech recognition,” Electronics Letters, vol. 35, no. 7, pp. 527–528, 1999.

[25] C.W. Hsu and L.S. Lee, “Higher order cepstral moment normalization (hocmn) for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication

Association. ISCA, 2004.

[26] A. de la Torre, J. C. Segura, C. Benitez, A. M. Peinado, and A. J. Rubio, “Non-linear transformations of the feature space for robust speech recognition,” in Proc.

ICASSP, 2002, pp. 401–404.

[27] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 13, no. 3, pp. 355–

366, 2005.

[28] S.H. Lin, B. Chen, and Y.M. Yeh, “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” IEEE

Trans. on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 84–94, 2009.

[29] F. Hilger and H. Ney, “Quantile based histogram equalization for noise robust speech recognition,” in Proc. Eurospeech, vol. 2, pp. 1135–1138, 2001.

[30] S. H. Lin, Y. M. Yeh, and B. Chen, “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,” in Proc. ICSLP, pp. 2522–2525, 2006.

[31] J. W. Hung, H. J. Hsieh and B. Chen, “Robust speech recognition via enhancing the complex-valued acoustic spectrum in modulation domain,” IEEE/ACM Trans.

on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 236-251, 2016.

[32] Y. C. Kao et al., “Effective modulation spectrum factorization for robust speech recognition,” in Proc. INTERSPEECH, pp. 2724–2728, 2014.

[33] W. Y. Chu, J. W. Hung, “Modulation spectrum factorization for robust speech recognition,” in Proc. APSIPA, 2011.

[34] N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, “On the importance of various modulation frequencies for speech recognition,” in Proc. European Conf. on Speech Communication and Technology, 1997.

[35] N. Parihar and J. Picone, “Aurora working group: DSR front end LVSCR evaluation au/384/02,” in Institute for Signal and Information Processing Report, 2002.

[36] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,”

in HLT Proc. of the workshop on Speech and Natural Language. Association for

Computational Linguistics, pp. 357–362, 1992.

[37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.

[38] P. Nirmala, L. R. Sulochana, and N. Rethnasamy, “Centrality measuresbased algorithm to visualize a maximal common induced subgraph in large communication networks,” Knowl. Inf. Syst., vol. 46, pp. 213–239, 2015.

[39] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Roy. Statist. Soc. B (Statist. Methodol.), vol. 68, pp. 49–67, 2006.

[40] X. Zhu, S. Zhang, Z. Jin, Z.Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, pp. 110–121, Jan. 2011.

[41] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” J. Mach. Learn. Res., vol. 12, pp. 2777–2824, Feb. 2011.

[42] Z. Zhang et al., “A survey of sparse representation: algorithms and applications,”

IEEE Transactions on Content Mining, vol. 3, pp. 490–530, 2015.

[43] M. Aharon, E. Michael, and A. Bruckstein, “K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. on Signal

Processing, vol. 54, no. 11, pp. 4311–4322, 2006

[44] C. Lu, S. Jiaping, and J. Jia, “Online robust dictionary learning,” in Proc. IEEE

Conference on Computer Vision and Pattern Recognition, pp. 415–422, 2013.

[45] Y. Emre and J. F. Gemmeke, “Noise-robust speech recognition with exemplar-based sparse representations using Alpha-Beta divergence,” in Proc. ICASSP, pp.

5502–5506, 2014.

[46] J. F. Gemmeke, V. Tuomas, and A. Hurmalainen, “Exemplar-based sparse representations for noise robust automatic speech recognition,” IEEE Trans. on

Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2067–2080, 2011.

[47] D.P. Wipf and B.D. Rao, “Sparse Bayesian learning for basis selection,” IEEE

Transactions on Signal Processing, vol. 52, pp. 2153–2164, 2004.

[48] Y, Mehrdad et al., “Parametric dictionary design for sparse coding,” IEEE

Transactions on Signal Processing, vol. 57, pp. 4800–4810, 2009.

[49] M. Stéphane and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”

IEEE Transactions on signal processing, vol. 41, pp. 3397–3415, 1993.

[50] P. Yagyensh et al., “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proc. of Conference

Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, 1993.

[51] B. Efron, T. Hastie, I. Johnstone, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.

[52] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,”

Adv. Neural Inf. Process. Syst., vol. 20, pp. 801–808, 2007.

[53] P.O. Hoyer, “Non-negative sparse coding,” in Proc. of Neural Networks for Signal

Processing, 2002.

[54] B. C. Yan, C. H. Shih, S. H. Liu and B. Chen, "The use of dictionary learning approach for robustness speech recognition," International Journal of

Computational Linguistics and Chinese Language Processing, Vol. 21, No. 2, pp.

35-54, 2016.

[55] G. Liu et al., “Robust recovery of subspace structures by low-rank representation,”

IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 35, pp. 171–184,

2013.

[56] G. Liu et al., “Robust subspace segmentation by low-rank representation” in Proc.

ICML, 2010.

[57] G. Luyet, et al., “Low-rank representation of nearest neighbor phone posterior probabilities to enhance DNN acoustic modeling,” No. EPFL-REPORT-218116.

Idiap, 2016.

[58] P. Dighe, et al, “Exploiting low-dimensional structures to enhance dnn based acoustic modeling in speech recognition.” in Proc. ICASSP, pp. 5690-5694, 2016.

[59] J.E. Candès, et al. “Robust principal component analysis?,” Journal of the ACM, vol. 3, 2011.

[60] M. Belkin, and P. Niyogi. "Laplacian eigenmaps and spectral techniques for embedding and clustering." in Proc. NIPS, pp.585-591, 2002.

[61] X. He, and P. Niyogi. "Locality preserving projections." in Proc. NIPS, pp.153-160, 2004.

[62] S. Roweis, and L. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science, 290, pp. 2323-2326, 2000.

[63] D. Cai, et al. "Graph regularized nonnegative matrix factorization for data representation." IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 1548–

1560, 2011.

[64] M. Zheng, et al. "Graph regularized sparse coding for image representation." IEEE

Transactions on Image Processing, vol. 20, pp. 1327-1336, 2011

[65] A. Jansen, and P. Niyogi, “Intrinsic Fourier analysis on the manifold of speech sounds,” in Proc. ICASSP, vol. 1, pp. 241–244, 2006.

[66] K. N. Stevens, Acoustic phonetics, MIT press, 2000.

[67] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, 401, pp.788–791, 1999.

在文檔中探索調變頻譜特徵之低維度結構應用於強健性語音辨識 (頁 80-95)