A robust training algorithm for adverse speech recognition
Wei-Tyng Hong
a,*, Sin-Horng Chen
baE000/CCL, Industrial Technology Research Institute, Chutung, Hsinchu, Taiwan, ROC
bDepartment of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan, ROC
Received 10 November 1998
Abstract
In this paper, a new robust training algorithm is proposed for the generation of a set of bias-removed, noise-sup-pressed reference speech HMM models in adverse environment suering from both channel bias and additive noise. Its main idea is to incorporate a signal bias-compensation operation and a PMC noise-compensation operation into its iterative training process. This makes the resulting speech HMM models more suitable to the given robust speech recognition method using the same signal bias-compensation and PMC noise-compensation operations in the recog-nition process. Experimental results showed that the speech HMM models it generated outperformed both the clean-speech HMM models and those generated by the conventional k-means algorithm for two adverse Mandarin clean-speech recognition tasks. So it is a promising robust training algorithm. Ó 2000 Elsevier Science B.V. All rights reserved.
Keywords: Robust training algorithm; PMC noise-compensation; Signal bias-compensation; Mandarin speech recognition
1. Introduction
Background noise and channel bias are the two major interference factors that seriously degrade the performances of speech recognizers operating in adverse environments such as telephone speech through public switching network. Recently, IBM built an HMM-based Mandarin telephone speech recognition system using a large telephone speech database called ÔMandarin call home databaseÕ (Liu et al., 1996). The vocabulary contained about 44 000 words. The word and syllable error rates were, respectively, 70.5% and 58.7%, which were much worse than those achieved in microphone-speech recognition (Lee and Juang, 1996). In the past, many studies have been devoted to the ®eld of robust speech recognition for adverse environment (Juang, 1991; Furui, 1992; Gong, 1995; Junqua and Haton, 1996). Major eorts of those studies were put on developing robust recognition algorithms to compensate or to eliminate noise/channel eect based on a given set of reference speech models trained usually in clean-speech environment. In the non-linear noise subtraction method (Lockwood and Boudy, 1992; Mokbel and Chollet, 1995), a noise model was ®rst estimated from the non-speech precursor of the testing utterance and then subtracted from the speech part in linear spectrum domain in order to obtain noise-suppressed features to be recognized using the clean-speech reference models. In (Acero and Stern, 1990, 1991), the CDCN
www.elsevier.nl/locate/specom
*Corresponding author.
E-mail addresses: jfhong@taiwan.com (W.-T. Hong), schen@cc.nctu.edu.tw (S.-H. Chen). 0167-6393/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 5 7 - 6
(codeword-dependent cepstral normalization) algorithm was proposed to estimate equalization vectors for the best transformation, in the maximum likelihood sense, from the universal codebook into the testing acoustic space in order to eliminating both the noise and channel eects. In the RASTA method (Hermansky and Morgan, 1994), a ®lter was used to eliminate the speaker/channel bias for obtaining bias-removed recognition features. In the parallel-model-combination (PMC) method (Gales and Young, 1996), clean-speech HMM models were combined with the current noise model to form noise-compensated composite HMM models for recognizing noisy speech. In the state-based Wiener ®ltering method (Hansen and Clements, 1991; Ephraim, 1992; Vaseghi and Milner, 1997), a two-stage recognition method was used. It ®rst used the Viterbi algorithm in the ®rst stage to ®nd the best state sequence for the input testing noisy speech, and then applied state-based Wiener ®ltering to estimate the clean-speech and recognized it using the clean-speech HMM models in the second stage. In (Zhao, 1996), a two-step procedure was employed to detect a spectral bias vector for the input testing utterance by using Gaussian distributed phone models. It then removed the estimated bias vector from the testing utterance for recognition. In the stochastic matching algorithm (Sankar and Lee 1996; Lee, 1998), the parameters of mapping functions between the testing speech and reference HMM models were estimated iteratively using the expectation maximization (EM) algorithm (Dempster et al., 1977). In (Minami and Furui, 1996), an integrated method for adapting HMM models to additive noise and channel distortion was proposed. This method ®rst estimated the signal-to-noise ratio by maximizing the likelihood of the PMC-compensated HMM models to the input speech, and then estimated the cepstral bias by the SankarÕs method (Sankar and Lee, 1996). The procedure is iteratively applied until a convergence is reached.
Apart from the above-mentioned main research stream, the robust training issue is also important for adverse speech recognition when the clean-speech reference models are not available. Its main concern is to train a set of robust reference speech models directly from a database collected in adverse environment for adverse speech recognition. The issue is important because the set of reference speech models obtained by the conventional segmental k-means algorithm (Juang and Rabiner, 1990) is usually not robust. This is mainly owing to the high variability on the characteristics of the training speech signals collected in the adverse environment. For example, a training data set collected from telephone calls through the public switching network will suer diverse recording conditions caused by dierent background noises, dierent types of transducers, dierent telephone channels, etc. This will make speech patterns distribute more widely in the feature space so as to overlap to each other more seriously and cause the trained speech models degrade on their discrimination capabilities.
In the past, many robust training algorithms have been proposed. In the signal bias removal (SBR) algorithm (Rahim and Juang, 1996), a codebook-based iterative signal bias removing technique was per-formed on both the training and testing phases for minimizing the channel-induced variations. In (Anas-taskos et al., 1997), the speaker-speci®c characteristics were ®rst modeled by a linear-regressive transformation between the independent models and the dependent models. A speaker-adaptive training algorithm designed basing on the EM algorithm was then employed to iteratively estimate the parameters of the transformation and the compact speaker-normalized HMM models. In (Gong, 1997), a source normalization training algorithm, which modeled the environmental corruption as a form of linear transformation, was proposed to estimate the HMM models. The noise and channel eects were modeled implicitly in the linear transformation. In the testing stage, the MLLR adaptation (Gales and Woodland, 1996) was applied to estimate the state-dependent transformation matrices and the bias terms for recog-nition. Those training algorithms have been shown to be eective on removing the channel biases and/or the speaker variations. However, the noise eect is still seldom considered in the robust training issue.
In this study, we are interested in the robust training issue with both the signal bias and noise eects being considered. A robust training algorithm, referred to as the robust environment-eects suppression training (REST) algorithm, is proposed. The design goal of the REST algorithm is twofold. One is to countervail the large variability of the corrupted training samples for obtaining a set of compact reference
speech HMM models with both signal bias and noise being suppressed. The other is to make the generated compact reference speech HMM models better for a given robust speech recognition method. The REST algorithm is an iterative training procedure that sequentially optimizes the following three operations: parameter estimation for environment characterization, environment-eect compensation for speech seg-mentation, and environment-eect suppression for HMM model re-estimation. The parameter estimation for environment characterization is to detect the signal bias and to estimate the noise statistics for each training utterance. It assumes that each utterance has its own environmental characteristics. Based on an assumed environment contamination model, the environment-eect compensation uses the estimated en-vironment characterization parameters to adapt the HMM models to match with the current training utterance for optimal segmentation. Using the segmentation results and the same environment contami-nation model, the environment-eect suppression is to remove the signal bias and the noise out of the corrupted speech for updating the HMM models. Owing to the involvement of the environment-eect compensation operation in the training process of the REST algorithm, we expect that it will generate better reference speech HMM models for the robust recognition method which employs the same envi-ronment-eect compensation operation in the recognition process. This is especially true for the case when the environment-eect compensation operation is not perfect due either to the non-existence of a perfect one or to the use of an inaccurate environment contamination model in its derivation.
The organization of the paper is stated as follows. Section 2 presents the proposed REST algorithm in detail. Section 3 describes the robust speech recognition method using the reference speech HMM models generated by the REST algorithm. Eectiveness of the REST algorithm is evaluated by simulations dis-cussed in Section 4. Some conclusions are given in Section 5.
2. The REST algorithm
The proposed REST training algorithm consists of an iterative procedure which sequentially performs the following three steps:
1. optimally segment each training utterance by using the environment-compensated HMM models, 2. estimate the environment characteristics and enhance the speech by eliminating the noise using the
state-based Wiener ®ltering method and by removing the signal bias using the SBR method, and 3. re-estimate the speech HMM models.
Operations performed in these three steps are derived based on a presumed environment contamination model. A schematic diagram of the model is displayed in Fig. 1. It assumes that, for each utterance, the observed speech z is generated from the clean speech x by corrupting ®rst with a convolutional channel b and then with an additive noise n. Here b is assumed to be time-invariant and n is stationary throughout the utterance. In linear spectrum domain, the model can be expressed by
yt f b f xt f ; 1a
zt f yt f nt f ; 1b
where the subscript t denotes the frame index and yt f is an intermediate signal showing the corruption of the
clean-speech with the channel bias only. We can also express the relation of x and y in cepstrum domain by
yt m b m xt m; 1c
where m denotes the order of cepstral coecient. Obviously, it is troublesome to directly estimate the original clean speech x in either linear spectrum domain or cepstrum domain when both noise interference and channel distortion exist. We had better, as suggested by above formulations, to separately deal with the channel distortion in cepstrum domain and the noise interference in linear spectrum domain. In the fol-lowing discussions, we specify a signal in linear spectrum domain and in cepstrum domain by attaching it with parameters f and m, respectively.
The REST algorithm is derived as follows. Assume that the training data set contains R utterances. Let Ke K rn ; b r
r1;...;R denote the set of environmental interference models of the whole training data set,
where b r and K r
n fl rn ; R rn g are, respectively, the signal bias and the noise model of the rth training
utterance; l r
n and R rn are the mean vector and covariance matrix of K rn . Let Z r z r1 ; . . . ; z rTr and
X r x r
1 ; . . . ; x rTr be, respectively, the observed and clean-speech feature vector sequence of the rth
utterance, and Kx denote the set of environment-eect normalized speech HMM models that we want to
generate. Based on the maximum likelihood criterion, the goal of an ideal robust training algorithm is to jointly estimate Kx and Ke with given fZ rgr1;...;R by
K x; Ke arg max Kx;Ke L fZ rg r1;...;RjKx; Ke ; 2
where L is the likelihood function of the observation sequence Z r given the parameter set of Kx; Ke.
But, due to the fact that it is generally dicult to derive a close form solution for the above joint maxi-mization problem, we therefore use a three-step iterative training procedure in the REST algorithm to obtain a sub-optimal solution. The three steps are:
1. Form the environment-compensated speech HMM models K r
z by using the current Kx; Ke and use it to
optimally segment the training utterance Z r.
2. Based on the segmentation result, estimate K r
n and enhance the adverse speech Z r to obtain Y r by the
state-based Wiener ®ltering method; and then, estimate b rand further enhance the speech Y rto obtain
X r by the SBR method.
3. Update the current speech HMM models Kx using the enhanced speech fX rgr1;...;R. We discuss these
three steps in more detail as follows.
The ®rst step of the REST algorithm is to optimally segment each training utterance using the current speech HMM models Kx;kÿ1 and the environmental interference model Ke;kÿ1 given by the previous
itera-tion, where the subscript k denotes the index of iteration. The task can be accomplished, based on the maximum likelihood criterion, by solving the following optimization problem to ®nd the best state se-quence Uk r u r1;k; . . . ; u rTr;k and the best mixture component sequence Vk r v r1;k; . . . ; v rTr;k of the optimal segmentation:
Uk r; Vk r arg max
U r;V r Pr Z r; U r; V rK x;kÿ1;Ke;kÿ1 ÿ arg max u r1;...;u rTr ; v r1 ;...;v rTr YTr t1 au r tÿ1;u rt Pr z r t u rt ; v rt ; K rz;kÿ1 ( ) ; 3
where ai;j denotes the transition probability from state i to state j. Eq. (3) is solved in this study by ®rst
forming the environment-compensated speech HMM models K r
z;kÿ1using Kx;kÿ1 and Ke;kÿ1, and then using
the Viterbi search to simultaneously ®nd U r
k and Vk r. The formation of K rz;kÿ1 from Kx;kÿ1 and Ke;kÿ1 is
based on the assumed environment contamination model de®ned in Eqs. (1b) and (1c), and realized by the following two sub-steps:
(1.1) Calculate K r
y;kÿ1 in cepstrum domain by
l ry;j;q;kÿ1 m lx;j;q;kÿ1 m b rkÿ1 m; 4a
R ry;j;q;kÿ1 m Rx;j;q;kÿ1 m; 4b
where l r
y;j;q;kÿ1 m and R ry;j;q;kÿ1 m are, respectively, the mean vector and covariance matrix of the qth
Gaussian mixture in the jth state of K r
y;kÿ1, and b rkÿ1 m is the bias vector given in Ke;kÿ1.
(1.2) Use the PMC method to form K r
z;kÿ1 by ®rst transforming K ry;kÿ1 from cepstrum domain to linear
spectrum domain, then combining it with K r
n;kÿ1 in linear spectrum domain, and lastly transforming the
result back to cepstrum domain.
The second step of the REST algorithm is to enhance the adverse speech by ®rst suppressing the noise using the state-based Wiener ®ltering method (Hansen and Clements, 1991; Ephraim, 1992; Vaseghi and Milner 1997) and by then removing the signal bias by the SBR method (Rahim and Juang, 1996). It consists of the following two sub-steps:
(2.1) Noise suppression: Given the segmentation information Uk r, estimate the noise model K rn;k and eliminate it from the input adverse speech z rt f , in linear-spectrum domain, by the state-based Wiener
®ltering method to obtain the intermediate signal yt;k r f . The noise model K rn;k and its average power spectrum density Pn;k r f of the rth utterance are re-estimated from the non-speech frames by
l rn;k m PTr t1z rt m I u rt;k 2 non-speech PTr t1I u rt;k 2 non-speech ; 5a R r n;k m PTr t1 z rt m 2 I u rt;k 2 non-speech PTr t1I u rt;k 2 non-speech ÿ l r n;k m 2 ; 5b Pn;k r f PTr t1P^z;t r f I u rt;k 2 non-speech PTr t1I u rt;k 2 non-speech ; 5c where ^P r
z;t f is the periodogram of z rt , which is de®ned as
^ P r z;t f 1 Lz rt f 2 ; 6
and L is the analysis length of the FFT operation; I is the zero±one indicator function. Basing on Eq. (1b) of the assumed environment contamination model, the Wiener ®lter for the jth state of speech model and the rth training utterance is constructed and expressed by
Wj;k r f Py;j;kÿ1 f
Py;j;kÿ1 f Pn;k r f
; 7a
where Py;j;kÿ1 f is the average power density spectrum corresponding to the jth state of the
bias-compensated speech HMM models. After forming all state-based Wiener ®lters, we calculate the enhanced signal by
y r
t;k f Wu rt;k f z r
t f ; for t 1; . . . ; Trand ut6 non-speech: 7b
(2.2) SBR: Given with the segmentation information U r
k ; Vk r, estimate the signal bias and remove it
from the intermediate signal y r
t;k f to obtain the environment-normalized speech estimate. The SBR
method is realized by ®rst transforming y r
t;k f to yt;k r m, then making a simpli®ed assumption of R rz;j;q
identity matrix in Eq. (A.11) of Appendix A to obtain b r k m PTr t1 yt;k r m ÿ lx;u rt;k;v rt;k;kÿ1 m I u r t;k 62 non-speech PTr t1I u rt;k 62 non-speech 8a
and lastly removing the signal bias by
x rt;k m yt;k r m ÿ b rk m: 8b The third step of the REST algorithm is to re-estimate the speech HMM models Kx;k and the average
power density spectrum fPy;j;kÿ1 f gj1;...;Nj using, respectively, the enhanced speech signals fX r
k mgr1;...;R
and fYk r mgr1;...;R based on the current segmentation information f Uk r; Vk rgr1;...;R, where Nj denotes
the total number of states in HMM models.
The combination of all operations in above three steps can be interpreted as a sequential optimal es-timation procedure listed in the following:
For iteration k
For utterance r 1 to R, do Uk r; Vk r arg max
U r;V r Pr Z r; U r; V rK x;kÿ1; Ke;kÿ1 ÿ ; 9a K r n;k arg max K rn Pr Z r K r n ; Uk r; Vk r ; 9b Yk r arg max Y r Pr Y r Z r; U r k ; K rn;k; P y;j;kÿ1 j1;...;Nj ; 9c b rk arg max b r Pr Y r k b r; Uk r; Vk r ; Kx;kÿ1 ; 9d Xk r arg max X r Pr X r Y r k ; b rk : 9e
End loop for r Py;j;k j1;...;Nj arg maxfP y;jgj1;...;Nj Pr Y r r1;...;R Py;j j1;...;Nj ; U r k r1;...;R ; 9f Kx;k arg max Kx Pr nXk ro r1;...;R Kx; Uk r; Vk r r1;...;R : 9g
A similar idea was used in (Lim and Oppenheim, 1978; Hansen and Clements, 1991) to employ a se-quential MAP estimation procedure in an iterative algorithm to sese-quentially estimate the linear prediction coecients, gain, and the noise-free speech waveform for frame-level speech enhancement.
The REST algorithm can also be derived by using the EM algorithm (Dempster et al., 1977). So its convergence can be guaranteed. Detailed derivations of the EM procedure for estimating Kx; Ke is given in
Appendix A.
Like other iterative algorithms, the REST algorithm must be initialized by giving an initial set of speech HMM models, an initial set of state averaged power density spectra, an initial channel bias vector, and an initial noise model. The initial speech HMM models and the initial state averaged power density spectra can be constructed by a conventional ML training algorithm using either an enhanced version of the given adverse-speech training set or another training set with high SNR. In the study, we adopt the former approach to use an enhanced speech training set obtained by subtracting the given initial noise model from the adverse-speech training set. The initial noise models are obtained from non-speech frames of the ad-verse-speech training set detected by an RNN-based speech segmentation method (Hong and Chen, 1997). It uses an RNN classi®er, directly trained from adverse speech, to classify the input speech pattern into three broad-classes: initial, ®nal and non-speech. The speech segmentation method has been shown to perform well in noisy environment (Hong et al., 1999). The initial bias vector is obtained by the SBR method using the above enhanced speech training set.
3. The PMC±SBC method for Mandarin base-syllable recognition
Mandarin Chinese is a tonal language. Each Chinese character is pronounced as a syllable with a tone. There are, in total, about 1300 syllables. If the tones are disregarded, there are only 411 phonologically allowed base-syllables. The phonetic structures of these 411 base-syllables are very regular and relatively simple as compared with English. A base-syllable can be decomposed into an optional initial and a ®nal. There are in total 22 initials (including a null) and 39 ®nals. Although, the base-syllable set is only in medium size, its recognition is actually very dicult because it comprises many highly confusable sets. Speci®cally, all 411 base-syllables can be categorized into 39 confusable sets according to their ®nals. Like the English E-set, all base-syllables in each confusable set dier only in their initial consonants and are therefore dicult to be distinguished (Chang et al., 1993; Lee and Juang, 1996). Besides, cross-set confusion between these 39 sets are also easy to occur. Medial confusion and nasal-ending confusion are the two most commonly occurred types of cross-set confusion. Highly discriminative speech models are therefore needed to tackle the dicult task. In this study, a set of sub-syllable HMM models containing 100 3-state right-®nal-dependent initial models and 39 5-state context-independent ®nal models is used as basic recognition units (Wang and Chen, 1998). In each state, a mixture Gaussian distribution with diagonal covariance matrices is used. The number of mixture in each state is variable and depends on the number of training samples, but a ®xed maximum value is set for it. Besides, a single-state, single-mixture, utterance-dependent model is used for noise.
An integrated PMC-based Mandarin base-syllable recognition method, which is a modi®ed version of the PMC method for additive and convolutional noise (Gales and Young, 1995; Nakamura et al., 1996) by additionally considering broad-class based likelihood compensation (Hong and Chen, 1997), is employed in this work to test the reference speech HMM models generated by the proposed REST training algo-rithm. It can be regarded as the combination of the PMC method and a signal bias compensation (SBC) method and is referred to as the PMC±SBC method. A block diagram of the new recognizer is displayed in Fig. 2. Each input testing utterance is ®rst processed in the RNN-based Speech Segmentation (Hong and Chen, 1997) to detect non-speech frames. The RNN-based speech segmentation uses a three-layer simple RNN to discriminate each input frame among three broad-classes of initial, ®nal and non-speech.
Non-speech frames are then detected by comparing the RNN non-speech output with a pre-determined threshold and used in the noise estimations to estimate the noise model. The input utterance is then processed in the Noise Subtraction and Signal Bias Estimation by ®rst subtracting the noise model estimate to obtain an enhanced speech and then transforming to cepstrum domain to estimate the signal bias by the SBR method (Rahim and Juang, 1996). The SBR method estimates the signal bias by ®rst encoding the feature vectors of the enhanced speech using a codebook and then calculating the average encoding re-siduals. The codebook is formed by collecting the mean vectors of mixture components of all reference speech HMM models. The bias estimate is then used in the Bias Compensation to convert all reference speech HMM models into bias-compensated speech HMM models. These models are then further con-verted, in the PMC Noise Compensation, into noise- and bias-compensated speech HMM models using the above noise model estimate. The PMC noise-compensation method used adopts the log-normal approx-imation (Gales and Young, 1993) for its noise-combination operator. These noise- and bias-compensated speech HMM models are then used in the One-stage DP Search to generate the recognized base-syllable sequence for the input adverse testing utterance. The One-stage DP Search uses a Viterbi search algorithm invoking with cumulative bounded-state-duration constraints (Wang and Chen, 1998) to accomplish its task with the help of the Likelihood Compensation. The likelihood compensation (LC) scheme used is the one proposed previously for improving the PMC-based recognition method for noisy Mandarin speech (Hong and Chen, 1997; Hong et al., 1999). The LC scheme uses the broad-class classi®cation information, provided by the RNN outputs, to help reduce the recognition errors caused by the misalignments of syllable boundaries. Due to its importance, the LC scheme is brie¯y discussed as follows. Although the PMC method is eective on adapting the clean-speech HMM models to match with the testing noise environment, the discrimination capabilities of the noise-compensated HMM models are still subject to be degraded resulted from the noise perturbation on the distributions of the recognition features of speech patterns. This noise-perturbation eect will make all speech phones more dicult to be distinguished not
only to each other but also from the background noise. The PMC method can do nothing to compensate this eect. This noise-induced confusing eect was also con®rmed in a recent study by Junqua et al. (1994) on a simple 10-digit noisy speech recognition task. They found that a large portion of recognition errors is owing to word boundary misalignments caused by the confusing between speech signals and the back-ground noise. To partially cure the weakness of the PMC method, the LC scheme uses the broad-class classi®cation information provided by the RNN to assist in the recognition. It directly takes the three RNN outputs as weighting factors to add additional scores to the log-likelihood scores of HMM states associated with the three broad classes, i.e.,
qc j zt qj zt alog WI t; j 2 initial; qj zt alog WF t; j 2 final; qj zt alog WN t; j 2 non-speech; 8 < : 10
where WI(t), WF(t) and WN(t) are the initial, ®nal and non-speech outputs of the RNN, qj zt is the
log-likelihood score of state j, and a is a scaling factor to control the degree of the log-likelihood compensation. It is noted that, if hard-decisions are performed in the broad-class classi®cation to make WI(t), WF(t) and WN(t)
become 0±1 functions, the LC scheme is equivalent to a restricted recognition search scheme in which only sub-syllables belonging to the detected broad-class are needed to be considered.
4. Evaluation
Performance of the proposed REST algorithm was evaluated on two multi-speaker Mandarin base-syllable recognition tasks. Due to the fact that the previous studies on robust training for eliminating the noise eect were still very few, we examined the eectiveness of the REST training algorithm on eliminating the noise eect in detail in the ®rst task. Both the REST training algorithm and the PMC±SBC recognition method were simpli®ed by discarding the parts related to the signal bias compensation. In the second task, the complete function of the REST algorithm on eliminating both the signal bias and noise eects was examined. In the following experiments, the base-syllable accuracy rate de®ned below was used to evaluate the recognition performance:
base-syllable accuracy rate 1
ÿ Subs Dels Ins number of testing base-syllables
100 %; 11 where Subs, Dels and Ins denoted the numbers of substitution, deletion and insertion errors, respectively. 4.1. Performance evaluation I
In the ®rst task, the performance of the REST algorithm on the adverse environment with only additive noise interference was examined. The noisy speech databases used in this study were generated by arti®cially adding noises to a clean-speech database composing of 1200 utterances of four speakers including two males and two females. Each utterance comprised several syllables and was pronounced in such a way that every syllable was clearly pronounced. The database contained in total 6197 syllables including 5124 training syllables and 1073 testing syllables. All speech signals were digitally recorded in a laboratory using a PC with a 16-bit Sound Blaster card and a head-set microphone. A sampling rate of 16 kHz was used. Two noisy-speech databases were arti®cially generated from the clean-noisy-speech database by adding noises of two dierent types including the Lynx helicopter noise from NOISEX-92 (Varga, 1993) and a computer-generated white Gaussian noise. For simplicity, these two noise types are referred to as Lynx and White noises, respectively. For each noise type, the training database contained three noisy-speech data sets of 12, 24 and 36 dB in SNR.
The open test used another three data sets for each noise type with 9, 18 and 30 dB in SNR. All speech signals were ®rst pre-processed for each of 20 ms Hamming-windowed frame with 10 ms shift. Then, a set of 25 recognition features including 12 MFCC, 12 delta MFCC and a delta log-energy was computed for each frame. The maximum number of mixture components in each HMM state was set to be 5.
We ®rst examined the eciency of the speech HMM models generated by the REST algorithm using the F-ratio measure (Nicholson et al., 1997). The F-ratio is a measure of class separability in the acoustic feature space and can be roughly de®ned by
F -ratio variance of meansmean of variances: 12 In this test, the classes were de®ned to include all states of the speech HMM models. The variance of means is the sample variance of all state means of these HMM models, and the mean of variances is the sample mean of all state variances. Obviously, a larger F-ratio measure indicates a larger separation among the states of the speech HMM models, which in turn roughly indicates that they have a higher discrimination capability. In the study, two schemes of the REST training algorithm with two dierent sets of initial models were tested. The ®rst set of initial models, denoted as INIT1, was formed by the clean-speech HMM models, clean-speech state average power density spectra, and the exact noise models. Since INIT1 was an ideal model, the ®rst scheme was not practical and hence was taken for reference only. The other set of initial models, denoted as INIT2, was a practical one and was generated by ®rstly segmenting all training utterances by the RNN-based speech segmentation method (Hong and Chen, 1997), secondly estimating the initial utterance-dependent noise models from non-speech frames of those training utterances, and lastly estimating the initial speech HMM models and the initial state average power density spectra from the enhanced version of the original training set obtained by subtracting the initial noise model. Figs. 3 and 4 show the feature-based F-ratio measures of the resulting HMM models for the two cases using Lynx and White noises, respectively. It can be seen from these two ®gures that the F-ratio measures for both schemes of the REST algorithm with INIT1 and INIT2 are comparable and are all better than the HMMB models (to be de®ned later) trained by the
con-ventional k-means algorithm. This is especially true for the lower-order recognition features. So the speech HMM models generated by the proposed REST algorithm are more compact and hence expected to possess better discrimination capability. Fig. 5 shows the learning curve of the REST algorithm. It can be found from Fig. 5 that the average log-likelihood score increases monotonically with respect to the iteration number. This empirically shows the convergence of the REST algorithm.
We then examined the recognition performance of the speech HMM models generated by the REST algorithm. The performance of the HMM method when both training and testing data were clean speech was also tested and taken as a benchmark. Its base-syllable recognition rate was 80.5%. In this test, four sets of reference speech HMM models were compared. They included:
M1. HMMC: The HMM models trained from the clean-speech database by the ML-based segmental
k-means algorithm.
M2. HMMB: The HMM models trained from the noisy-speech database with three dierent SNRs by the
ML-based segmental k-means algorithm.
M3. HMMR: The HMM models trained from the noisy-speech database with three dierent SNRs by the
proposed REST algorithm.
M4. HMMM: The HMM models trained from a noisy-speech data set with SNR matched with the testing
speech by the ML-based segmental k-means algorithm. That is, the HMM models trained from 9, 18 or 30 dB noisy-speech data set were used to recognize noisy speech with the same SNR.
Fig. 4. The F-ratio measures of the speech HMM models trained from the noisy speech training database corrupted with White noise.
For comparing the performances of these four sets of reference speech HMM models on noisy speech recognition, the following three recognition schemes were used:
Tables 1 and 2 show the experimental results of the open tests for the two cases using Lynx and White noises, respectively. It is noted that, in the implementation of the PMC recognition method using HMMBas
reference models, the mean of the estimated noise model, ^l r
n f , was intuitively modi®ed by
^l r n f ^l r n f ÿ ^ln0 f ; if ^l rn f > ^ln0 f ; 0; otherwise; ( 13 to count the noise eect embedded in the HMMB models. Here ^ln0 f is the noise mean of the training
database estimated in the training process of generating the HMMB models. From Tables 1 and 2, the
following observations can be found:
S1-1. The ÔNCÕ scheme: The conventional HMM recognition method without noise compensation. S1-2. The ÔPMCÕ scheme: The conventional PMC method (Gales and Young, 1993) with noise model
being estimated based on RNN-based speech segmentation. Its noise-compensation operation used the log-normal approximation.
S1-3. The ÔPMC/LCÕ scheme: An extended version of the ÔPMCÕ scheme invoking with the likelihood compensation scheme. It is a degenerated version of the PMC±SBC method discussed in Section 3 with the parts related to signal-bias compensation being discarded.
O1. For HMMB, the NC scheme performed fair for both noise types with SNR 18 dB and SNR 30
dB. But it performed very bad for both noise types with SNR 9 dB.
O2. For HMMM, the NC scheme performed very well for both noise types with all the three SNRs.
O3. For HMMB, the PMC scheme performed only slightly better than the NC scheme for both noise types
with SNR 18 dB and SNR 30 dB, and much better for SNR 9 dB.
O4. The NC scheme with HMMMperformed better than the PMC scheme with HMMCfor both noise
types with all the three SNRs.
Table 1
The recognition results of the open tests for noisy speech corrupted with Lynx noise (unit: %)
HMMB HMMC HMMR HMMM NC SNR (dB) NC PMC PMC PMC/LC PMC PMC/LC 9 )12.1 34.9 39.1 42.3 43.6 48.7 45.0 18 51.4 52.0 58.6 62.5 62.8 67.7 66.3 30 62.3 65.1 71.2 75.1 73.6 78.3 75.6 Table 2
The recognition results of the open tests for noisy speech corrupted with White noise (unit: %)
HMMB HMMC HMMR HMMM NC SNR(dB) NC PMC PMC PMC/LC PMC PMC/LC 9 )35.9 29.8 26.9 33.0 35.0 38.1 33.6 18 42.8 45.2 48.3 52.0 54.2 58.0 57.0 30 58.4 59.9 65.4 71.8 68.2 73.8 68.6
Based on these observations, the following conclusions can be drawn:
An extra test on noisy English digit recognition using the NOISEX-92 database (Varga and Steeneken, 1993) was performed to examine the validity of the proposed REST algorithm. The database contains utterances of isolated digits and digit triples uttered by one male and one female speakers. Here only the part of isolated-digit utterances was used. The database contains in total 400 digits including 200 training tokens and 200 testing tokens. Each testing utterance comprises 100 digits and was uttered in such a way that every digit was clearly pronounced. All speech signals were ®rst pre-processed for each of 25 ms Hamming-windowed frame with 10 ms shift. Then, 12 MFCC were computed for each frame and taken as the recognition features. For each digit, an 8-state HMM model with observations in each state being modeled by a mixture Gaussian distribution was trained. The number of mixture components in each state was set to be 2. Besides, a single-state, single-mixture model was used for noise.
In the test, we considered the performance of the REST algorithm on the adverse environment with additive noise interference only. Noisy-speech databases were arti®cially generated from the clean-speech database by adding computer-generated white Gaussian noise. The noisy training database contained four data sets of 0, 6, 12 and 24 dB in SNR. The open test used another ®ve data sets of )3, 0, 3, 9 and 18 dB in SNR. The same accuracy rate de®ned in Eq. (11) was used to evaluate the recognition performance. We note that the benchmark of the recognition performance achieved by the conventional ML-trained HMM method for the clean-speech case is 100%. Three recognition schemes used in the ®rst test were compared. They included:
C1-1. From O1±O2, the conventional HMM method without noise compensation can be used in noisy speech recognition only when the noise level of the training data set is the same as that of the testing speech. If the training database contains noisy speech with diverse noise levels, its performance will degrade seriously.
C1-2. From O1±O3, the HMM models generated by the conventional k-means training algorithm are good for the NC scheme in the noise-level matched condition, fair in the noise-level interpolation condition, and bad in the noise-level extrapolation condition.
C1-3. From O1 and O3, the performance improvements for the HMM method using HMMB reference
models by the PMC noise compensation are very limited.
C1-4. From O4, the log-normal approximation of the noise-compensation operation used in the PMC scheme is not perfect.
C1-5. From O5 and C1-4, the REST algorithm is a very ecient training algorithm to generate noise-suppressed HMM models directly from a noisy speech database with diverse noise levels. The resulting HMM models perform very well in the PMC scheme for testing noisy speech with untrained noise levels. They are even better than the clean-speech HMM models for the PMC method when the noise-compensation operation is not perfect. So it is a very promising robust training algorithm.
C1-6. From O6±O7, the likelihood compensation scheme is very helpful for the PMC-based noisy speech recognition. Actually, the PMC/LC scheme using HMMR reference models performed best in all
cases of the test.
O5. For both PMC and PMC/LC schemes, HMMR performed better than HMMC for both noise types
with all the three SNRs.
O6. For both HMMR and HMMC, the PMC/LC scheme performed much better than the PMC scheme.
1. HMMB±NC: The conventional HMM method without noise compensation using HMM models trained
from noisy speech.
2. HMMC±PMC: The PMC method using clean-speech HMM models.
3. HMMR±PMC: The PMC method using HMM models trained by the REST algorithm.
Table 3 shows the experimental results. It can be seen from the table that HMMB±NC performed the worst,
HMMC±PMC the next, and HMMR±PMC the best. This result is consistent with what we have obtained in
the ®rst test of the study on adverse Mandarin speech recognition. 4.2. Performance evaluation II
In the second task, the performance of the REST algorithm on adverse environment with both channel bias and noise interferences was examined. A simulated telephone-speech database generated by corrupting a clean-speech database with both convolutional channel bias and additive white noise was used in this study. The clean-speech database was generated by 10 speakers including 8 males and 2 females. It was a super-set of the clean-speech database used in the ®rst task with the same recording condition. It contained, in total, 3050 utterances including 2572 training utterances (12 800 syllables) and 478 testing utterances (2666 syllables). To generate the adverse-speech database, each clean-speech utterance was ®rst corrupted by a computer-generated white Gaussian noise and then passed through a ®lter which simulated a telephone channel. This was realized simply by ®rst adding the white noise in time domain and then adding the channel bias in frequency domain. It is noted that the assumed environment contamination model shown in Fig. 1 is still suitable for modeling the simulated database. In the training database generation, noises with levels of 12, 24 and 36 dB in SNR were separately added to three subsets of the clean-speech training database. These three subsets contained utterances of three, three and four speakers, respectively. In the testing database generation, noises with levels of 9, 18 and 30 dB in SNR were added to the whole clean-speech testing database. To simulate the channel variations on the telephone speech through the public switching network, a set of 227 simu-lated ®lters was generated from a large telephone-speech database provided by Chunghwa Telecom-munication Laboratories. Each ®lter was obtained by performing a frame-based cepstrum average to the long utterance of a telephone call through the public switching network. Fig. 6 shows their fre-quency responses. Among these 227 channel ®lters, 195 were used to generate the training database while all others were used in the testing database generation. It is noted that the stationarity of the environment characteristics for each utterance is guaranteed in this simulated adverse-speech database via the use of utterance-dependent channel ®lter and noise level.
The same format of speech HMM models as the ®rst task was used here. The only dierence was that the maximal number of mixtures used in each HMM state was increased to 20. In the REST algorithm, the initial condition was generated from the same adverse training database by a four-step procedure. First, segment all training utterances by the RNN-based speech segmentation method (Hong and Chen, 1997). Second, estimate the initial utterance-dependent noise model from the non-speech frames of each training utterance. Third, estimate the initial speech HMM models and the initial state average power density
Table 3
The recognition results of the NOISEX-92 database corrupted by White noise (unit: %)
SNR (dB) HMMB±NC HMMC±PMC HMMR±PMC )3 39.5 72.0 82.5 0 63.5 86.0 94.5 3 82.0 94.0 99.0 9 93.0 98.5 99.5 18 94.0 99.5 99.5
spectra from the enhanced version of the original training set obtained by subtracting the initial noise models. Last, estimate the initial channel bias vectors from the same enhanced training set by the SBR method (Rahim and Juang, 1996).
In this test, the following recognition schemes were compared:
Table 4 shows the base-syllable recognition results of these six schemes for adverse speech corrupted with channel bias and White noise. It can be found from Table 4 that, according to the recognition rate, these six schemes can be ordered as: REST=LC > REST > REST-noise or REST-bias > BASELINE > CLEAN. Based on the experimental results, the following conclusions can be made:
S2-1. The ÔBASELINEÕ scheme: The conventional HMM method using the reference speech models trained directly from the adverse training database by the segmental k-means algorithm.
S2-2. The ÔCLEANÕ scheme: The PMC±SBC recognition method using the clean-speech reference HMM models, but without invoking the LC scheme.
S2-3. The ÔREST-biasÕ scheme: The SBC recognition method using the reference HMM models trained by the REST algorithm without considering noise suppression.
S2-4. The ÔREST-noiseÕ scheme: The PMC recognition method using the reference HMM models trained by the REST algorithm without considering signal bias removal.
S2-5. The ÔRESTÕ scheme: The PMC±SBC recognition method using the REST-trained reference HMM models, but without invoking the LC scheme.
S2-6. The ÔREST/LCÕ scheme: The PMC±SBC recognition method using the REST-trained reference HMM models.
Table 4
The recognition results of the open tests for adverse speech corrupted with channel bias and White noise (unit: %)
SNR (dB) BASELINE CLEAN REST-bias REST-noise REST REST/LC
9 23.4 14.8 24.5 29.3 33.0 35.2
18 46.7 27.3 50.2 48.4 53.7 56.5
30 60.2 45.6 62.7 61.8 65.5 66.7
A ®nal test to check whether the REST training scheme is operable for clean-speech environment was lastly done. It is worthwhile to note that some robust training algorithms, designed for improving the performance of speech recognizers under adverse-speech environment, performed not well for clean-speech environment. In the test, two sets of HMM models were generated, respectively, by the conventional ML training method and by the REST training scheme using the same clean-speech database. The base-syllable recognition rate was 76.05% for the ML method and 76.24% for the REST scheme. This result con®rmed that the REST algorithm did not degrade the system performance when the training data were clean speech.
5. Conclusions
A robust training algorithm for generating a set of speech HMM models directly from a training dat-abase collected in adverse environment for adverse speech recognition has been discussed in this paper. Its main advantage lies on the incorporation of the signal bias-compensation and PMC noise-compensation operations of a given robust adverse speech recognition method into its iterative training process so as to make the resulting speech HMM models more suitable to be used in the given robust adverse speech recognition method. Its eectiveness on generating robust speech HMM models has been con®rmed by simulations. Experimental results showed that the HMM models it generated were even better than the clean-speech HMM models for use in the given robust adverse speech recognition method when the PMC noise-compensation and/or channel bias-compensation operations are imperfect. So it is a promising robust training algorithm.
Acknowledgements
This work was supported by the National Science Council of Taiwan under Contract no. NSC87-2213-E-009-056. The telephone-speech database was provided by the Chunghwa Telecommunication Labora-tories.
C2-1. The conventional HMM method using the reference models trained by the k-means algorithm performed fair in adverse speech recognition.
C2-2. The result that the CLEAN scheme performed much worse than the BASELINE scheme is mainly owing to the imperfection of the channel bias compensation performed in the SBC method. Actually, the CLEAN scheme was totally fail to compensate the mismatch between the testing speech and the clean-speech HMM model. This primarily resulted from the large deviation on the estimated signal bias from the real channel bias.
C2-3. Although the channel bias-compensation operation of the SBC method is imperfect, the REST training algorithm can still take its advantage by embedding it into the iterative training process to make the resulting HMM models more suitable to be used with the channel bias compensation of the testing process. This has been con®rmed by the fact that both the REST-bias and REST scheme performed better than the BASELINE scheme.
C2-4. The HMM models generated by the REST algorithm which considers both noise suppression and signal bias removal are better than those obtained by the REST algorithm considering only noise suppression or signal bias.
C2-5. The likelihood compensation scheme is still eective on assisting in the adverse speech recognition.
Appendix A. The EM procedure for estimating Kf x; Keg
Eq. (2) can be solved using an iterative EM procedure (Dempster et al., 1977) which tries to ®nd the local optimal estimate of ^H ^K x; ^Ke with the following two intermediate parameter sequences involved:
the hidden state sequences fU r u r
1 ; . . . ; u rTrgr1;...;R and the mixture component sequences
fV r v r
1 ; . . . ; v rTrgr1;...;R. The ®rst (expectation) step of the EM procedure is to compute the auxiliary
Q-function de®ned as Q H; ^Hkÿ1 E log L fZ r; U r; V rgr1;...;R H fZ rg r1;...;R; H^kÿ1 n o : A:1
Here the subscript k ) 1 denotes the iteration index. In the second (maximization) step, new values of ^Hkare
computed based on the maximization of Q H; ^Hkÿ1:
^
Hk arg max
H Q H; ^Hkÿ1: A:2
The detailed derivation of the EM procedure is described as follows. Let K r z G Kx; K r n ; b r
for adverse-speech model; K r
n for non-speech model
(
A:3 be the environment-compensated HMM models, constructed from Kx and ( K rn , b r), for the rth
obser-vation utterance Z r. Here G denotes a mapping function that transforms K
x to match with the current
environment of Z r. By assuming that, in K r
z , observations are mixture-Gaussian-distributed, we can
calculate the mean vector l r
z;j;qand covariance matrix R rz;j;qof the qth mixture component in the jth state of
K r
z , based on the assumed environment contamination model de®ned in Eqs. (1b) and (1c), by
l r z;j;q lx;j;q b r K r n ; j 2 adverse-speech model; l r n ; j 2 non-speech model; A:4a R rz;j;q Rx;j;q K r n ; j 2 adverse-speech model; R r n ; j 2 non-speech model; A:4b where denotes the PMC noise-compensation operator (Gales and Young, 1993), and lx;j;q and Rx;j;qare,
respectively, the mean vector and covariance matrix of the qth mixture component in the jth state of Kx. By
further assuming that the state-based Wiener ®ltering is the inverse operation of the PMC (Gales and Young, 1993; Vaseghi and Milner, 1997), we can express the compensated cepstral mean l r
z;j;qin Eq. (A.4a)
by (Gales and Young, 1993; Vaseghi and Milner, 1997) l rz;j;q lx;j;q b r hj; j 2 adverse-speech model; l r n ; j 2 non-speech model; A:5 where hjis the cepstral coecients of the state-based Wiener ®lter of the jth state which is constructed from
an estimate of the signal power density spectrum at the jth state and an estimate of the noise power density spectrum of the rth utterance.
Based on the above expression of K r
z , the auxiliary Q-function can be rewritten as (Sankar and Lee,
Q H; ^Hkÿ1 Q Kx; fK rn ; b rgr1;...;R ; ^Hkÿ1 Q fK r z gr1;...;R ; ^Hkÿ1 Qkÿ1 XR r1 XTr t1 XNj j1 XNq q1 c r t;kÿ1 j; qlog Pr z rt ; ut ÿ j; vt qH Qkÿ1 XR r1 XTr t1 XNj j1 XNq q1 c r t;kÿ1 j; qlogN z rt ; l rz;j;q; R rz;j;q; A:6 where c rt;kÿ1 j; q Pr z r t ; u rt j; v rj q ^Hkÿ1 A:7 is the probability of the observation z r
t produced from the qth mixture component of the jth state; Njand
Nqdenote, respectively, the total numbers of states and mixture components; N represents normal
dis-tribution; and Qkÿ1 is a function depending only on the transition probability and mixture probability of
^K r
z;kÿ1(which are assumed to be the same as those of ^Kx;kÿ1). But, due to the fact that it is generally dicult
to derive a close form solution for the above joint maximization problem, a multi-stage sequential maxi-mization procedure is employed to approximate the local optimum of ^Hk. In each stage, only one type of
parameters is optimally estimated.
We ®rst estimate the parameters of noise model ^K rn;kto maximize the Q-function in Eq. (A.6). They can be obtained by ^l r n;k PTr t1 PNj j1 PNq q1c rt;n;kÿ1 j; qz rt PTr t1 PNj j1 PNq q1c rt;n;kÿ1 j; q ; A:8a ^R r n;k PTr t1 PNj j1 PNq q1c rt;n;kÿ1 j; q z rt ÿ ^l rn;k z r t ÿ ^l rn;k T PTr t1 PNj j1 PNq q1c rt;n;kÿ1 j; q ; A:8b
where c rt;n;kÿ1 j; q c rt;kÿ1 j; qI j 2 non-speech and I is the zero±one indicator function. We then estimate the signal bias ^b r
k . After replacing K rn and Kx with ^K rn;k and ^Kx;kÿ1, the Q-function
becomes Q ^Kx;kÿ1; ^K rn;k; b r n o r1;...;R ; ^Hkÿ1 Q0 kÿ1 XR r1 XTr t1 XNj j1 XNq q1 c r t;s;kÿ1 j; qlog N z rt ; ^lx;j;q;kÿ1 b rÿ h j;k; R rz;j;q;k Q0 kÿ1 XR r1 XTr t1 XNj j1 XNq q1 c r t;s;kÿ1 j; qlog N z rt ÿ hj;kÿ b r ; ^lx;j;q;kÿ1; R r z;j;q;k Q0 kÿ1 XR r1 XTr t1 XNj j1 XNq q1 c rt;s;kÿ1 j; qlog N yt;j;k r ÿ b r; ^l x;j;q;kÿ1; R rz;j;q;k ; A:9
where c rt;s;kÿ1 j; q c rt;kÿ1 j; qI j 2 speech; hj;kand Rz;j;q;k r are updated versions of hjand R rz;j;qwith K rn and
Kx being replaced with ^K rn;k and ^Kx;kÿ1, and yt;j;k r is the Wiener-®ltered version of z rt at the jth state. By
oQ ^Kx;kÿ1; ^K rn;k; b r n o r1;...;R ; ^Hkÿ1 ob r 0; A:10
the pth element of ^b rk can be obtained by (Sankar and Lee, 1996) ^b r k p PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; q R rz;j;q;k p; p ÿ1 y r t;j;k p ÿ ^lx;j;q;kÿ1 p PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; q R rz;j;q;k p; p ÿ1 ; A:11
where y rt;j;k p and ^lx;j;q;kÿ1 p denote, respectively, the pth elements of yt;j;k r , and ^lx;j;q;kÿ1, and R rz;j;q;k p; p is
the p; pth element of R rz;j;q;k.
We then estimate ^Kx;k. After replacing K rn and b r with ^K rn;k and ^b rk , the Q-function becomes
Q Kx; ^K rn;k; ^b rk n o r1;...;R ; ^Hkÿ1 Q0 kÿ1 XR r1 XTr t1 XNj j1 XNq q1 c rt;s;kÿ1 j; qlog N x rt;j;k; lx;j;q; Rx;j;q ; A:12 where x rt;j;k yt;j;k r ÿ ^b rk z r t hj;kÿ ^b rk A:13
is the signal bias-removed and Wiener-®ltered signal of the jth state. The Q-function is now in the same form as that in the conventional EM algorithm for estimating HMMÕs parameters. So, the mean and covariance of ^Kx;k can be estimated in the same way by
^lx;j;q;k PR r1 PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; qx rt;j;k PR r1 PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; q ; A:14a ^Rx;j;q;k PR r1 PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; q x rt;j;kÿ ^lx;j;q;k x rt;j;kÿ ^lx;j;q;k T PR r1 PTr t1 PNj j1 PNq q1c rt;s;kÿ1 j; q : A:14b
The HMM state transition probabilities and the mixture component coecients can also be estimated by the standard EM method.
It can be veri®ed that the Q-function will increase at each stage of the sequential maximization proce-dure, i.e., Q ^Hkÿ1; ^Hkÿ1 Q ^Kx;kÿ1; ^K rn;kÿ1; ^b rkÿ1 n o r1;...;R ; ^Hkÿ1 6 Q ^Kx;kÿ1; ^K rn;k; ^b rkÿ1 n o r1;...;R ; ^Hkÿ1 6 Q ^Kx;kÿ1; ^K rn;k; ^b rk n o r1;...;R ; ^Hkÿ1 6 Q ^Kx;k; ^K rn;k; ^b rk n o r1;...;R ; ^Hkÿ1 Q ^Hk; ^Hkÿ1 : A:15
This in turn leads to an increase on the likelihood of the training data in each iteration (Dempster et al., 1977), i.e., L fZ rg r1;...;RH^k P L fZ rg r1;...;RH^kÿ1 : A16
Hence, the EM procedure is guaranteed to converge.
In practical implementation, the above EM procedure needs to be modi®ed by invoking with the seg-mental k-means algorithm (Juang and Rabiner, 1990) in order to increase its computational eciency. It adds an additional pre-segmentation stage into the above iterative re-estimation procedure. In each iter-ation, all training utterances are ®rst optimally segmented by the Viterbi algorithm (Forney, 1973) to de-termine the best state sequences f ^Uk rgr1;...;Rand the best mixture component sequences f ^Vk rgr1;...;R. Then, parameters of all models are re-estimated based on the given f ^Uk r; ^Vk rgr1;...;R. All formulations of the above EM procedure listed in Eqs. (A.3)±(A.14) still hold except that c rt;s;kÿ1 j; q and c rt;n;kÿ1 j; q are now associated only with f ^Uk r; ^Vk rg and hence all
PNj j1
PNq
q1 in Eqs. (A.6), (A.8), (A.9), (A.11), (A.12) and
(A.14) have to be taken away.
A ®nal modi®cation of the above re-estimation procedure is needed to replace the optimal signal bias estimation with the conventional SBR method. By making a simpli®ed assumption of ^R r
z;j;q;kÿ1 I, the
modi®ed version of Eq. (A.11) can be reduced to Eq. (8a). This completes the derivations of the REST training algorithm.
References
Acero, A., Stern, R.M., 1990. Environmental robustness in automatic speech recognition. In: Proceedings of ICASSP-90, pp. 849±852. Acero, A., Stern, R.M., 1991. Robust speech recognition by normalization of the acoustic space. In: Proceedings of ICASSP-91,
pp. 893±896.
Anastasakos, T., McDonough, J., Makhoul, J., 1997. Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: Proceedings of ICASSP-97, pp. 1043±1046.
Chang, P.-C., Chen, S.-H., Juang, B.-H., 1993. Discriminative analysis of distortion sequences in speech recognition. IEEE Trans. Speech and Audio Process. 1, 326±333.
Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. 39, 1±38.
Ephraim, Y., 1992. Statistical-model-based speech enhancement systems. Proc. IEEE 80, 1526±1555. Forney, G., 1973. The Viterbi algorithm. Proc. IEEE 61, 268±278.
Furui, S., 1992. Toward robust speech recognition under adverse conditions. In: Proceedings of the ESCA Workshop on Speech Processing in Adverse Conditions, pp. 31±24.
Gales, M.J.F., Woodland, P.C., 1996. Mean and variance adaptation within the MLLR framework. Comput. Speech and Language 10, 249±264.
Gales, M.J.F., Young, S.J., 1993. Cepstral parameter compensation for HMM recognition in noise. Speech Communication 12, 231±239.
Gales, M.J.F., Young, S.J., 1995. Robust speech recognition in additive and convolutional noise using parallel model combination. Comput. Speech and Language 9, 289±307.
Gales, M.J.F., Young, S.J., 1996. Robust continuous speech recognition using parallel model combination. IEEE Trans. Speech and Audio Process. 5, 352±359.
Gong, Y., 1995. Speech recognition in noisy environments: A survey. Speech Communication 16, 261±291.
Gong, Y., 1997. Source normalization training for HMM applied to noisy telephone speech recognition. In: Proceedings of EuroSpeech-97, Vol. 3, pp. 1555±1558.
Hansen, J.H.L., Clements, M.A., 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Signal Process. 39, 795±805.
Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech and Audio Process. 2, 578±589.
Hong, W.-T., Chen, S.-H., 1997. A robust RNN-based pre-classi®cation for Noisy Mandarin speech recognition. In: Proceedings of EuroSpeech-97, Vol. 3, pp. 1083±1086.
Hong, W.-T., Liao, Y.-F., Wang, Y.-R., Chen, S.-H., 1999. RNN-based speech segmentation and its applications to robust noisy Mandarin speech recognition. J. Acoust. Soc. Amer., revised.
Juang, B.-H., 1991. Speech recognition in adverse environment. Comput. Speech and Language 5, 275±294.
Juang, B.-H., Rabiner, L.R., 1990. The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 38, 1639±1641.
Junqua, J.-C., Halton, J.-P., 1996. Robustness in Automatic Speech Recognition: Fundaments and Applications. Kluwer Academic Press, Boston, MA.
Junqua, J.S., Mak, B., Reaves, B., 1994. A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech and Audio Process. 2, 406±412.
Lee, C.-H., 1998. On stochastic feature and model compensation approaches to robust speech recognition. Speech Communication 25, 29±47.
Lee, C.-H., Juang, B.-H., 1996. A survey on automatic speech recognition with an illustrative example on continuous speech recognition of Mandarin. J. Comput. Linguist. Chinese Language Process. 1, 1±36.
Lim, J.S., Oppenheim, A.V., 1978. All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Sig. Process. 26, 197±210. Liu, F.-H., Picheny, M., Srinivasa, P., Monkowaski, M., Chen, J., 1996. Speech recognition on Mandarin call home: a large
vocabulary, conversational and telephone speech corpus. In: Proceedings of ICASSP-96, Vol. 1, pp. 157±160.
Lockwood, P., Boudy, J., 1992. Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in cars. Speech Communication 11, 215±228.
Mokbel, C.E., Chollet, G.F.A., 1995. Automatic word recognition in cars. IEEE Trans. Speech and Audio Process. 3, 346±356. Minami, Y., Furui, S., 1996. Adaptation method based on HMM composition and EM algorithm. In: Proceedings of ICASSP-96,
pp. 327±330.
Nakamura, S., Takigucgi, T., Shikano, K., 1996. Noise and room acoustics distorted speech recognition by HMM composition. In: Proceedings of ICASSP-96, Vol. 1, pp. 69±72.
Nicholson, S., Milner, B., Cox, S., 1997. Evaluating features set performance using the F-ratio and J-measures. In: Proceedings of EuroSpeech-97, Vol. 1, pp. 413±416.
Rahim, M., Juang, B.-H., 1996. Signal bias removal by maximum likelihood estimation for robust telephone speech recognition. IEEE Trans. Speech and Audio Process. 4, 19±30.
Sankar, A., Lee, C.-H., 1996. A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Trans. Speech and Audio Process. 4, 190±202.
Varga, A., Steeneken, H.J.M., 1993. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the eect of additive noise on speech recognition systems. Speech Communication 12, 247±251.
Vaseghi, S.V., Milner, B.P., 1997. Noise compensation methods for hidden Markov model speech recognition in adverse environments. IEEE Trans. Speech and Audio Process. 5, 11±21.
Wang, Y.-R., Chen, S.-H., 1998. Mandarin telephone speech recognition for automatic telephone number directory service. In: Proceedings of ICASSP-98, Vol. 2, pp. 841±844.
Zhao, Y., 1996. Self-learning speaker and channel adaptation based on spectral variation source decomposition. Speech Communication 18, 65±77.