• 沒有找到結果。

T HE RESULTS OF LOW - RANK REPRESENTATION

This sub-section consists of three parts of experiments. We first assess the effectiveness of the LRR method on the standard MFCC features. Next, we conduct the LRR-based experiments on the well-known features. Finally, we evaluate their de-noising effect by the power spectral density curve.

In the first set of experiments, we focus on LRR method used to model the magnitude modulation spectra of the standard MFCC features. The corresponding results are shown in Table 5-1 and Table 5-2, from which we can list two observations as following:

1. K-SVD and its variants, as well as LRR, can boost the performance of the baseline MFCC system significantly. They, respectively, only used very few dictionary atoms (the number of atoms is set to 5 in this paper) to linearly

reconstruct the MFCC-based modulation spectrum. It means that the linguistic information conveyed in the modulation spectra may be lying in unions of these linear subspaces.

2. NN-K-SVD+NNSC can reduce average word error rate (WER) of the two variants of K-SVD (i.e., K-SVD+MP and K-SVD+OMP) by 0.86% and 0.91%, respectively. This implies that the nonnegativity constraint should be imposed when unveiling the low-dimensional structures and sparse representations of the magnitude modulation spectra of the MFCC features.

Table 5-1 Word error rates (%) for the detailed results of LRR by using MFCC features

MFCC

Table 5-2 Word error rates (%) for the overall results of the baselines (including various K-SVD based methods ) and LRR by using MFCC features

Overall Performance

3. LRR stands out in performance as compared to NN-K-SVD+NNSC, achieving a further WER reduction of 3.84%. It should be mentioned here that LRR adopts a low-rank representation the noisy magnitude modulation spectrum, which reconstructs the clean magnitude modulation spectrum by the union of a few important atoms, simultaneously squeezing residual noise out to the sparse error component. In contrast, NN-K-SVD+NNSC utilizes non-negative sparse coding to diminish the redundant information residing in the noisy modulation spectrum.

In the second set of experiments, we investigate the synergy of the proposed LLR method with two state-of-the-art robustness methods that directly perform normalization on the MFCC components at each time frame instead of the modulation spectra; they are cepstral mean and variance normalization (CMVN) and the ETSI advanced

front-Table 5-3 Word error rates (%) for the detailed synergy of the LRR based method and several state-of-the-art methods. Table 5-4 Word error rates (%) for the overall synergy of the LRR based method

and several state-of-the-art methods.

end based method (AFE).AFE is believed to be one of the most elaborated and effective robust methods, which leads to the WER of 16.16%. As be evident from Table 5-3 and Table 5-4, the synergy of LLR and the two existing methods that directly enhance the MFCC features can bring considerable additional gains for the two latter methods, with the WER reductions of 2.16% (CMVN) and 0.5 (AFE), respectively. From this observation, it can be confirmed that the complementary robustness capability of additionally normalizing the magnitude modulation spectra of speech features.

In the last set of experiments,we also compare the presented NN-K-SVD+NNSC and LRR with regard to their capabilities of reducing the mismatch in the power spectral density (PSD) of the MFCC-based cepstral feature sequence. From Fig. 5-2(a) to 5-2(c), they depict the average PSD curves of the unprocessed, NN-K-SVD+NNSC processed and LRR processed first MFCC feature component (c1) for the Aurora-4 test utterances contaminated with four types of environmental noise, with SNR levels varying from 5

Figure 5-2 The average c1 PSD curves for Aurora-4 test utterances with various noise types, i.e., clean, airport noise, clean with channel distortion and airport noise with channel distortion, which were processed by two normalization methods: (a) the MFCC baseline (without normalization), (b) NN-K-SVD+NNSC and (c) LRR.

(a) (b) (c)

dB to 15 dB. First, for the unprocessed case shown in Fig. 5-2 (a), the various noise sources cause a significant PSD mismatch over the entire modulation frequency band [0, 50 Hz]. The Fig. 5-2 (b) and 5-2 (c) show that both NN-K-SVD+NNSC and LRR can considerably reduce the PSD distortion, while LRR appears to be more effective than NN-K-SVD+NNSC with respect to the mitigation of the PSD mismatch at all frequency bands.

Chapter 6

Manifold Learning

_____________________________________________________________________

In this chapter, we endeavor to explore the intrinsic geometric low-dimensional manifold structures inherent in modulation spectra of speech features, in the hope to obtain more noise-robust speech features.

Recently, a prevalent trend has also been to explore intrinsic structures of data instances (for example, speech features or their intermediate representations) in a low-dimensional space, showing that such latent structures have to do with correlations between the data instances present in their original high-dimensional ambient space [60]-[64]. Following this thought, we may consider the case where each data instance and its neighboring instances lie on or close to a locally linear patch of the manifold in an ambient space. Specifically, let ℳ be a d-dimensional subspace of the Riemannian manifold, which also is embedded in a high-dimensional ambient space (Euclidean space) ℝ𝑀, i.e., ℳ𝑑 ⊂ ℝ𝑀. To some extent, exploring low-dimensional structures by sampling the geometric distribution of data instances, or preserving the locally coherent structures of data instances in a high-dimensional space, will be useful for various tasks, including clustering and classification, to name just a few [63]-[64]. In order to capture the low-dimensional manifold structures of data instances, the methods presented in [60]-[62] (known as the notion of local invariance) indicated that the neighboring data

instances in a high-dimensional space are likely to have similar embeddings. On the other hand, there still has another line of research, known as the graph-regularization based methods [63]-[64], which claims that the conventional matrix factorization methods seem to fail to preserve the geometric structures present in the original ambient space. As such, the graph Laplacian is further incorporated as a regularization term to constrain the original objective function with the hope that the derived basis vectors can retain the intrinsic Riemannian manifold structure, rather than the ambient Euclidean structure.

It has been observed that the articulatory parameterizations of speech production exist in a low-dimensional manifold structure of certain phoneme classes [65]-[66].

Along this research direction, we hypothesize that the intrinsic structures of the magnitude modulation spectra of speech features may lie in a latent manifold of low dimensionality embedded in their original high-dimensional ambient space. This way, noise components can be ruled out by projecting noisy magnitude modulation spectra into a pre-learned basis space of the manifold. Specifically, we endeavor to explore the intrinsic geometric low-dimensional manifold structures inherent in the magnitude modulation spectra of speech features, in the hope to obtain more noise-robust speech features. The main contribution of this paper is that we propose a novel use of the graph-regularization based methods to enhance speech features by preserving the inherent

manifold structures of the magnitude modulation spectra and excluding irrelevant ones.

Furthermore, we also compare our methods with several well-practiced methods that also explore low-dimensional structures of data instances thoroughly. To our best knowledge, this work is the first attempt to leverage such a modeling paradigm in the modulation frequency domain of speech features for robust ASR.

相關文件