Quality of life in patients with hepatocellular carcinoma received surgical resection

(1)

A Perceptually Constrained GSVD-Based Approach

for Enhancing Speech Corrupted by Colored Noise

Gwo-Hwa Ju and Lin-Shan Lee, Fellow, IEEE

Abstract—The singular value decomposition (SVD)-based

method for single-channel speech enhancement has been shown to be very useful when the additive noise is white. For colored noise, with this approach, one needs to whiten the noise spectrum prior to SVD-based approach and perform the inverse whitening processing afterwards. A truncated quotient SVD (QSVD)-based approach has been proposed to handle this problem and found very useful. In this paper, a generalized SVD (GSVD)-based subspace approach for speech enhancement is first extended from the concept of the truncated QSVD-based approach, in which the dimension of the signal subspace can be precisely and auto-matically determined for each frame of the noisy signal. But with this new approach some residual noise is still perceivable under lower signal-to-noise ratio conditions. Therefore a perceptually constrained GSVD (PCGSVD)-based approach is further pro-posed to incorporate the masking properties of human auditory system to make sure the undesired residual noise to be nearly un-perceivable. Closed-form solutions are obtained for both the GSVD- and PCGSVD-based enhancement approaches. Very care-fully performed objective evaluations and subjective listening tests show that the PCGSVD-based approach proposed here can offer improved speech quality, intelligibility and recognition accuracy, whether the noise is stationary or nonstationary, especially when the additive noise is nonwhite.

Index Terms—Auditory masking thresholds, colored noise,

gen-eralized singular value decomposition (GSVD), signal subspace, speech enhancement.

I. INTRODUCTION

V

OICE quality and intelligibility are always important for communication systems, either wired or wireless, either in human-to-human or human-to-machine interactions. In order to obtain near-transparent speech communications, for example via mobile phones, speech enhancement techniques have been employed to improve the quality and intelligibility of the noise-corrupted speech and/or the speech recognition performance. The corrupting noise sources are usually classified into addi-tive and convolutional. The former very often dominates in real-world applications, and the spectral subtraction (SS) approach has been a very popular example solution for it [1]–[3]. To sub-tract the noise components from the input noisy speech, the SS algorithm has to estimate the statistics of the additive noise Manuscript received March 14, 2005; revised January 27, 2006. The associate editor coordinating the review of this manuscript and approving it for publica-tion was Prof. Alex Acero.

G.-H. Ju was with the Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. He is now with the Telecommunication Laboratories, Chunghwa Telecom Corporation, Ltd., Taoyuan 32617, Taiwan, R.O.C. (e-mail: [email protected]).

L.-S. Lee is with the Graduate Institute of Communication Engineering, Na-tional Taiwan University, Taipei 10617, Taiwan, R.O.C. (e-mail: lslee@gate. sinica.edu.tw).

Digital Object Identifier 10.1109/TASL.2006.876868

in frequency domain. Under low signal-to-noise ratio (SNR) conditions, a spectral flooring process is usually taken to pre-vent the over-subtraction situation occurred. However, all such processes very often produce some unnatural residual noise in the enhanced speech, the so-called musical noise, due to the inevitable random tone peaks generated in the time-frequency spectrogram. Previous studies have pointed out that this perceiv-able residual noise can be effectively alleviated by considering the masking effect in human auditory system [4]–[8], i.e., the residual noise will not be perceived if it is under the masking thresholds in human auditory functions.

The singular value decomposition (SVD)-based subspace ap-proach has been found useful for noise reduction in recent years [9]–[12]. With this approach, by diagonalizing the Hankel-form matrices constructed from the noisy speech samples by the SVD, we can properly decompose the vector space for the input speech samples into two orthogonal subspaces. It assumes that the clean speech is presented only in the signal subspace whereas the additive noise spans both the signal and noise subspaces. We can thus discard the noise subspace components and reconstruct the speech signal from those of the signal subspace only. This approach was found very effective while the additive noise is white. But when the noise is not white, a reasonable approach is to whiten the noise spectrum prior to the SVD-based approach and perform the inverse whitening pro-cessing afterwards. To avoid such extra processes, a truncated quotient SVD (QSVD)-based approach was proposed [12] to perform the noise whitening with the SVD algorithm together in an integrated enhancement framework, and found very useful in handling colored noise. This truncated QSVD-based approach was then extended in this paper, in which more pre-cise and flexible determination of the dimensions of the signal and noise subspaces became possible for each frame of the noisy signal using well-defined procedures [13]. This extended speech enhancement algorithm was referred to as generalized SVD (GSVD [14])-based approach here. In fact, similar con-cept using Karhunen–Loève transform (KLT)-based subspace technique have also been proposed recently for enhancing speech corrupted by colored noise [15]–[17].

Although this GSVD-based speech enhancement approach has been shown to provide better performance than the pre-vious SVD-based approach, some musical noise is still per-ceivable in the enhanced speech under lower SNR conditions [12], [13]. This is why the auditory masking thresholds (AMTs) in human auditory functions were further integrated with the above GSVD-based algorithm to establish an improved frame-work for speech enhancement in this paper [18], referred to as the perceptually constrained GSVD (PCGSVD)-based ap-proach here. Because this PCGSVD-based apap-proach operates in 1558-7916/$20.00 © 2006 IEEE

(2)

the generalized singular domain, whereas the conventional audi-tory masking thresholds (AMTs) are well defined in frequency domain, the previously proposed transformation between the frequency domain and the eigen domain [19] is extended to be performed between the frequency domain and the generalized singular domain [18], with which a closed-form solution for the PCGSVD-based speech enhancement approach is obtained. Experimental results based on various objective and subjective tests (e.g., time/frequency domain evaluations, speech recog-nition accuracies, paired-utterance listening comparison, mean opinion score (MOS) rating, etc.) show this proposed PCGSVD-based approach can effectively alleviate the phenomenon of mu-sical noise in the previous GSVD-based approach, enhance the quality and intelligibility of the processed speech, and improve the accuracy of the speech recognition system, regardless of whether the additive noise is stationary or not, especially when the noise is nonwhite.

The rest of the paper is organized as follows. The frame-work for the GSVD-based speech enhancement approach is first summarized in Section II. The procedures for obtaining the AMTs and transforming them to the generalized singular domain, and the proposed PCGSVD-based approach are then presented in Section III. Experiments, objective/subjective performance evaluation results, and some discussions are offered in Section IV, with the conclusions finally given in Section V. Detailed derivations of the closed-form solutions for the GSVD- and PCGSVD-based approaches are presented in the Appendix.

II. SUMMARY OF THEGSVD-BASED SPEECH ENHANCEMENTAPPROACH

Two series of the GSVD-based approach are summarized here (GSVD-MVE and GSVD-LSE), with the difference from the previously proposed truncated QSVD-based approach [12] clearly indicated. Let be the th sample of the input noisy speech signal , expressed as the sum of the samples and

of the clean speech and the noise :

(1) and the goal here is to estimate from . Fig. 1 depicts the framework of the GSVD-based speech enhancement approach, which includes five phases as described next [13].

A. Phase (I): Framer, Nonspeech Detector, and Buffer The input speech is first segmented into overlapped frames with window length , and then the following enhancement process is repeated for each frame. A voice activity detection (VAD) algorithm is used to identify and accumulate the non-speech frames in the input signal.

B. Phase (II): Construction of the Hankel-Form Sample Matrices

To employ the subspace concept for speech enhancement, two series of Hankel-form sample matrices of order , as in Fig. 2, are constructed. for each noisy speech frame and

Fig. 1. Framework of the GSVD-based speech enhancement approach.

Fig. 2. Construction of the Hankel-form sample matricesH and H .

for the latest buffered nonspeech frame , where

equals the frame size and in general is much smaller than . From (1), it is clear that the matrix can be represented as the summation of two other Hankel-form sample matrices

and , , which are, respectively,

con-structed from the clean speech frame and the real noise frame. Under noise-free conditions, the column dimension of the

ma-trix (in this case is a zero matrix with zeros

and thus ) is chosen such that is rank deficient,

i.e., rank [12]. Both and are unknown, yet

can be approximated by , where can be

ei-ther a constant or a time-varying variable. With the GSVD algo-rithm as described below, we can estimate and thus the clean speech frame from the matrices and . In other word, we can apply the GSVD algorithm and the subspace concept to the rank deficient least squares (LS) problem to estimate clean speech signal from the noisy observations.

(3)

C. Phase (III): GSVD Algorithm

The GSVD algorithm is useful in several constrained least squares problems [14]. With GSVD, a nonsingular matrix

and two real matrices and , whose columns

are, respectively, orthonormal vectors, can be found to transform both and into nonnegative, bounded diagonal matrices

and simultaneously

(2) (3) subject to

or (4)

where the superscript “ ” means the transpose, the diagonal el-ements and , or the transformed components, of the ma-trices , and are arranged in descending and ascending or-ders, respectively, and is an identity matrix. The constraint in (4) is helpful to achieve efficient numerical solu-tions here [14], [20], and useful in determining the precise di-mension of the signal subspace of the matrices and as presented below, which was not discussed at all in the pre-viously proposed truncated QSVD-based speech enhancement

approach. The values , , and the columns of

the matrix are, respectively, referred to as the generalized sin-gular values and generalized sinsin-gular vectors of the matrices and . The conventional SVD algorithm can be considered as a special case of the above GSVD algorithm if . D. Phase (IV): GSVD-MVE- and GSVD-LSE-Based Subspace Approaches

The diagonal elements of the matrix in (2) can be parti-tioned into two sets, the principal set [in which the transformed (or projected) noisy speech components obtained in (2) are

dom-inant (i.e., for those s.t. , , ),

associated with the signal subspace of and ] and the minor set [in which the transformed noise components obtained

in (3) are dominant (i.e., for those s.t. , ,

), associated with the noise subspace of and ]. The dimensions of the signal subspace and the noise sub-space are, respectively, denoted as and here, where the variable is the number of the coefficients or such that . In other words, because are arranged in de-scending order and in ascending order, with the constraints

in (4), , we have for and

for , and it can be shown that the

value of is proportional to the instantaneous SNR of each speech frame. In this way the dimension of the signal subspace can be precisely, flexibly, and automatically determined for each Hankel-form matrix and therefore for each frame of the noisy speech signal. This is the major difference of the newly extended GSVD-based approach from the previously proposed truncated QSVD-based approach, in which the dimensions of the signal and noise subspaces are empirically determined con-sidering the SNR conditions based on the concept of “parsimo-nious order” [12], [21]. The signal subspace for and is then constructed by the first row vectors of the matrix

Fig. 3. Typical example of the computation results of the diagonal elementsc (circle) ands (x-mark) of the matrices C and S, respectively, in (2) and (3) for a voiced frame withK = 40.

, whereas the rest rows of span the noise sub-space of and . Fig. 3 illustrates an example of the com-putation results for respective diagonal elements and of the matrices and obtained in (2)–(4) for a typical example of 512-sample voiced frame corrupted by white noise at 5-dB

SNR, with column size for the matrices and . It

can be found that in this frame of noisy speech , because

for but for , so

the first 30 row vectors of the matrix are used to construct

the signal subspace and the rest, , are used to

construct the noise subspace. However, of course can be dif-ferent for difdif-ferent frames of noisy speech. This is how the di-mensions and are determined precisely and automat-ically for each frame of noisy speech. The minimum variance estimation (MVE) algorithm, a linear estimator with the lowest residual noise level [9], is then used to estimate the matrix

by finding a transformation matrix which

mini-mizes the Frobenius distance between the two matrices

and (by approximating the rank of by )

(5)

where is the Frobenius norm

square of a matrix , and is an element of the

matrix . is the estimated result of the transfor-mation matrix . A closed-form solution for estimating based on (5) can be obtained by weighting the components of the principal set and nulling those of the minor set of the matrix

[obtained in (2)]

(6) where the matrices and are those in (2) and (3), the

vec-tors and are the th column and row,

, of the matrices and , respectively. From (2), (3), and (6), we know that the matrices , , and can be transformed onto the same vector space (spans by the row vectors of ). The matrix is diagonal with diagonal elements given as follows:

(7)

where and , , are those in (2) and (3). The

(4)

Fig. 4. Typical example of the computation results of the diagonal elementsc (circle),s (x-mark), and c (square, for GSVD-MVE) of the matrices C, S, andC (for GSVD-MVE), respectively, in (2), (3), and (6) and (7) for a voiced frame withK = 40.

as Fig. 3 for the same example voiced frame, except here what are plotted in addition are the diagonal elements of the matrix

obtained from (6) and (7). Apparently, it can be found that

decreases monotonically for and becomes zero

for .

The estimated matrix obtained here may not have the Hankel-form structure. We can simply average the antidiagonal elements of to recover the Hankel-form sample matrix and thus the enhanced speech frames, as depicted in Fig. 5 [12].

For comparison purposes, the matrix can also be esti-mated using the least squares estimation (LSE) algorithm in the framework of the GSVD-based speech enhancement approach (GSVD-LSE). The simplest estimate of , or , given , is obtained by approximating by a matrix of rank in the least-squares sense [12], where the value of can be obtained with the same procedure as mentioned previously in this section

(8) Again with the GSVD algorithm, the solution for (8) is straight-forward

(9)

where the diagonal matrix consists of the most

informative diagonal elements of the matrix (principal set) obtained in (2). The reconstructed Hankel-form sample matrix and the enhanced speech frames can be similarly obtained. E. Phase (V): Frame Overlap and Add

Finally, the enhanced speech signal can be obtained by concatenating the estimated speech frames with the overlap-add method.

Fig. 5. Evaluation of the Hankel-form sample MatrixH from the matrix H .

III. PCGSVD-BASEDAPPROACH

Though the GSVD-based approach mentioned previously has been shown to provide better performance than the other popular enhancement approaches [13], some musical noise is still per-ceivable in the enhanced speech under lower SNR conditions. To obtain better sounding conditions, we further propose to in-tegrate the masking properties of human auditory system into the GSVD-based approach to establish an improved framework for speech enhancement in this paper [18], referred to as the per-ceptually constrained GSVD (PCGSVD)-based approach here. In this section, first we briefly summarize the proce-dure for evaluating the human auditory masking thresholds (AMTs) in frequency domain, following that we offer two series of PCGSVD-based approach (PCGSVD-MVE and PCGSVD-LSE). Furthermore, because the PCGSVD-based approach operates in the generalized singular domain, a transformation of AMTs between the frequency domain and the generalized singular domain is proposed [18], with which closed-form solutions for the PCGSVD-MVE- and PCGSVD-LSE-based subspace approaches can be obtained. Finally, in Section III-F, we discuss the influence of the scaling factor , as defined in Section II-B for obtaining the matrix , on the performance evaluation of the proposed enhancement approaches.

A. Evaluation of the Auditory Masking Thresholds

Noise masking is a well-known psychoacoustic property of the human auditory system that has been applied with good success to speech and audio coding in order to partially or to-tally mask the distortion introduced in the coding processes [4], [8]. Masking effect happens when the human auditory system is incapable of distinguishing two signals close enough in time or frequency domains. The maximum allowable level of noise spectrum (or distortion spectrum) below which the distortion is not discernible by a human listener is referred to as the masking threshold. This is obtained by the minimum threshold of audi-bility for a given masker signal. Both temporal and simultaneous masking properties of human perception have been investigated,

(5)

but here we only use the simultaneous masking properties evalu-ated in frequency domain in the PCGSVD-based approach pro-posed in this paper. The evaluation procedure for simultaneous AMTs is briefly described as follows.

The perceptible frequency range for human auditory system (20 Hz–20 kHz) is usually modeled by 25 critical bands. The magnitude square of the discrete Fourier transform (DFT) components of the clean speech signal can be summed in each critical band, and then convolved with a spreading function to consider the cross correlation between the critical bands. This spread sequence is further divided by a set of relative threshold values based on the noise-like or tone-like nature for each critical band of the input speech frame. The AMTs are finally obtained by renormalizing the above sequence to compensate for the gain modification of the convolution process, and make sure they are not below the absolute masking thresholds of human hearing [4]–[6].

B. Formulation of the PCGSVD-MVE-Based Subspace Approach

The enhancement framework of the PCGSVD-based sub-space approach by using MVE algorithm (PCGSVD-MVE) is almost identical to that of the GSVD-MVE-based approach as described in the above section, except for the Phase (IV) in Section II-D. Therefore only the Phase (IV) of the frame-work of the PCGSVD-MVE-based approach is presented here. By incorporating the auditory masking effect into the GSVD-MVE-based subspace approach to further suppress the perceivable residual noise, the goal here is to find an optimal transformation matrix for which not only the Frobenius distance between the two matrices and is minimum, but under the constraints that the normalized energies are not greater than the transformed AMTs for the first projections ( is the dimensionality of the signal subspace of the matrices and here) of the residual noise signal (i.e., ), and zero for the rest projections. In other words, the residual noise components cannot be perceived by the human ear

subject to (10)

where the vectors and are, respectively,

the th column and row, , of the matrices and

obtained in (2) and (3), and is the AMTs but transformed to the generalized singular domain, which can be evaluated by the procedures present below [18]. Everything else is the same as the procedures summarized in Section II.

C. Estimating AMTs Projected Onto the Generalized Singular Domain

Because the PCGSVD-based approach operates in the gen-eralized singular domain, whereas the conventional AMTs are well defined in the frequency domain, the previously proposed transformation between the frequency domain and the eigen domain [19] is used here to perform the transformation between

the frequency domain and the generalized singular domain [18], with which closed-form solutions for the PCGSVD-MVE- and PCGSVD-LSE-based subspace approaches can be obtained. The power spectrum of the clean speech signal is required for evaluating the AMTs in frequency domain but the clean speech is not known here. This power spectrum is estimated by the Blackman–Tukey frequency estimation technique [22] as summarized below.

With (6) and (7) in Section II-D, the -dimensional autocorrelation matrix of the clean speech frame can be probably obtained from the matrix .

(11) where is the frame size and is the column dimension of the

Hankel-form sample matrices and . With obtained

in (11), the estimated power spectrum of the clean speech frame can be approximately obtained by the principal component version of the Blackman–Tukey frequency estima-tion with Bartlett window [22, pp. 470–471]

(12)

where the elements and , ( is the dimension

of the signal subspace of the matrices and ), are those in (2) and (3) respectively, is the vector for the magni-tude square of the -point DFT ( is the number of the AMTs) of the th row of the matrix [i.e., in (10)] in (2) and (3), and the value in the bracket of (12) is in fact the value in (7). With the estimated power spectrum in (12), the vector for the

AMTs in frequency domain can then be evaluated

from by the procedure as mentioned in Section III-A. With , the AMTs projected onto the generalized singular domain can be obtained with similar process to that proposed by Jabloun and Champagne [19].

(13)

where is the th element of the vector given as

follows:

(14)

where is a transformation matrix whose th row

is the vector for the magnitude square of the -point DFT of the th row vector of the matrix in (2) and (3).

D. Solution for the PCGSVD-MVE-Based Subspace Approach With the formulations above in (10)–(14), it can be shown that the estimation of the matrix has a closed-form solution

(6)

Fig. 6. Typical example of the computation results of the diagonal el-ements c (circle), s (x-mark), c (square, for GSVD-MVE), and c (triangle, for PCGSVD-MVE) of the matricesC, S, C (GSVD-MVE), C (PCGSVD-MVE) and the sequencec (p =s ), 1 i K, (diamond), respectively, in (2), (3), (6), and (7), (15) and (16) for a voiced frame with K = 40.

where the diagonal elements , , of the

nonnega-tive diagonal matrix are

(16) respectively for the principal set ( , ) and the minor set ( , ) of the matrix . The result in (15) and (16) is derived detailed in the Appendix. From (16), we notice that for the components of the principal set of the matrix , the singular values of the matrix are the smaller value of and . The former term is in fact the solution for the GSVD-MVE-based approach as in (7), which is chosen when the normalized projection of the residual noise signal onto the signal subspace of the matrices and is below the value of the corresponding transformed AMT [as de-scribed in the first constraint of (10)], and thus this projected residual noise component cannot be perceived. Otherwise, the

term in (16) will be chosen. Note that

is proportional to the square root of the th transformed AMT , but inversely proportional to which is arranged to be in-creasing with index as in (3), therefore is more dominant in this second term for smaller , which corresponds to more in-formative signal subspace components. This AMT-related term will be chosen when the th constraint of (10) ac-tivates [i.e., equal sign of the first constraint of (10) holds]. In this case, the residual noise component is nearly unperceivable. Fig. 6 is the same as Figs. 3 and 4 for the same example voiced frame, except here what are plotted in addition are the sequence

, , in (16) and the diagonal elements

of the matrix for PCGSVD-MVE in (15) and (16). It can

be found that for the diagonal elements for

PCGSVD-MVE are upper bounded by , while for

they are all zero.

E. PCGSVD-LSE-Based Subspace Approach

For comparison purposes, the PCGSVD-LSE-based subspace approach can be similarly formulated as follows:

subject to (17)

The solution for (17) is similar to (15) and (16)

(18)

where the elements , of the nonnegative

diag-onal matrix are

(19)

F. Estimation of the Scaling Factor

It is well known that the entropy is a metric of uncertainty for random variables. In the preliminary study, it was found that by setting the value of the scaling factor , to be used in estimating the matrix from as specified in Section II-B, to be in-versely proportional to the spectral entropy of the additive noise in some sense can improve the performances of the PCGSVD-MVE- and PCGSVD-LSE-based approaches. The formulation for the spectral entropy is as follows [23]:

(20)

where is the DFT size and , is the probability

density function obtained by normalizing the spectral energy of the th frequency component of the estimated noise over all frequency components

(21)

For the PCGSVD-based approach, with smaller value of (e.g., ), less signal distortion was observed in the tests for the estimated speech, especially for the low-energy frequency components, but that also led to increased residual noise. So, there is a tradeoff in choosing the value of given the best per-formance. We found that the best value for has to be inversely proportioned to the spectral flatness of the additive noise in some sense, which is reasonably well measured by the spectral en-tropy as defined above in (20) and (21). Moreover, if the addi-tive noise is stationary, the evaluation result for the spectral en-tropy ought to be invariant whether the noise level is high or low. However, changing the value of does not significantly affect the performances of the GSVD-MVE- and GSVD-LSE-based approaches.

(7)

IV. EXPERIMENTS ANDPERFORMANCEEVALUATION The experimental environment was as follows. We randomly selected 30 clean utterances produced by two females and two males from the TIMIT speech corpus with sampling rate 16 kHz for testing [24]. Four types of noise source, “White,” “Volvo-car,” “Babble (speech-like),” and “Factory,” chosen from NOISEX-92 database [25], were artificially added to the test speech with resulting SNR ranged from 20 to 10 dB with 5-dB step size for evaluation. The Babble and Factory noise sources are nonstationary whereas the Volvo-car noise is stationary; all of the three are not white. The following parameter settings were for the GSVD-/PCGSVD-based approaches. The rectangular window was used for the framer as in Section II-A due to its better performance. The frame size was 32 ms (512 samples) with 50% frame overlap. The value of , the number of AMTs in Section III-A, was the same as the frame size . The value for the scaling factor as discussed in Section III-F, was set to 1.0 for the GSVD-MVE- and GSVD-LSE-based approaches, and to 0.6, 1.15, 0.9, and 0.9, respectively, for the White, Volvo, Babble, and Factory noise cases for the PCGSVD-MVE- and PCGSVD-LSE-based approaches [ was roughly estimated (via the spectral entropy method mentioned in Section III-F) from the first two nonspeech frames of a typical utterance in each respective case]. The row and column sizes ( and ) of the two Hankel-form sample matrices and were 473 and 40, respectively. The column dimension of 40 for the Hankel-form matrices was sufficient for having the matrix to be rank deficient under noise-free condition, and it was adequate for us to choose a boundary for the signal and noise subspaces.

For comparison, we also implemented the conventional spec-tral subtraction algorithm in the power specspec-tral domain (abbre-viated as PSS) [2], a modified version of PSS using the audi-tory masking effect in the enhancement process (abbreviated as PSS-AMT) [6], and a perceptual subspace approach previ-ously proposed, in which a transformation from frequency do-main to eigen dodo-main was used to obtain a perceptual upper bound for the residual noise to be applied with the subspace concept (abbreviated as PKLT), which was shown to offer com-petitive performance as the conventional KLT-based subspace approach in subjective listening tests [19]. The VAD algorithm recommended by ETSI-EAFE for frame-dropping (referred to as ETSI-EAFE VAD here) [26] was employed in all these algo-rithms under concern for detecting the nonspeech frames and es-timating the noise statistics. Compared with other popular VAD algorithms, this algorithm was reported to have low false alarm rate [27]. We also forced the first two frames of each utterance as silence frames for estimating the noise statistics initially.

Equation (22) formulates the processes of the PSS and the PSS-AMT algorithms in frequency domain

otherwise

(22)

where , , is the frequency index, the symbol

“ ” denotes the phase, the value of , or the DFT size, was set

to 512, and are, respectively, the power

spec-trum of the noisy speech and enhanced speech, is the smoothed power spectrum of the estimated noise, and and are the weighting factor and flooring coefficient re-spectively. For PSS, was set to 1.6 and to 0.15,

whereas for PSS-AMT, and are both functions of

AMTs. The input noisy speech signal was segmented into overlapped frames via Hamming window and transformed to the frequency domain. The frame size and shift for both the PSS and PSS-AMT were 32 ms (512 samples) and 16 ms (256 samples), respectively. Each detected silence frame from the ETSI-EAFE VAD algorithm was employed to update the noise statistics in frequency domain

(23)

where is the previously estimated version, and

is the smoothing factor. For both PSS and PSS-AMT, was set to 0.9. This updating process was carried out whenever a new silence frame was detected. For PKLT, the frame size and shift were 32 ms (512 samples) and 2 ms (32 samples), respec-tively, for computation load reduction. This setting was verified to behave almost as well as the 16 ms (256 samples) of frame updating rate in preliminary informal tests. Various objective and subjective measures were used to evaluate the different ap-proaches as given next.

A. Segmental Signal-to-Noise Ratio

As it is well known, the segmental SNR (SegSNR) measure is more accurate in indicating the speech distortion than the global SNR. The SegSNR is measured by computing the SNR (in deci-bels) for each of the frames and averaging these SNR values over the entire utterance [28]. To emphasize the processing ef-fect of the estimated speech signal, we manually exclude the nonspeech segments of the test speech in the following SegSNR and SegLSD evaluations. The SegSNR measure used here had a segment length of 256. The results are illustrated in Fig. 7, in which the various abbreviations “Noisy” (original noisy speech), “PSS” (power spectral subtraction), “PSS-AMT” (modified ver-sion of PSS with AMTs), “PKLT” (the previously proposed per-ceptual subspace approach [19]), and LSE,” “GSVD-MVE,” “PCGSVD-LSE,” and “PCGSVD-MVE” (as proposed in this paper) are used to denote different tests. From Fig. 7(a) for the White noise case, we see for almost every case of SNR, GSVD-MVE obtained the best results. However, for the col-ored noise cases [Fig. 7(b)–(d)], PCGSVD-MVE outperformed GSVD-MVE and other approaches in many cases, especially when the input SNR was low (e.g., 10 dB). We also see that GSVD-LSE was not as good as that of GSVD-MVE in most cases, so was that of LSE as compared with PCGSVD-MVE. This can be understood from the closed-form solution of GSVD-LSE (9), in which all the information (including clean speech and corrupting noise source) were kept in the signal sub-space of and , and similarly for PCGSVD-LSE versus

(8)

Fig. 7. Segmental SNR measures for (a) White noise, (b) Volvo-car noise, (c) Babble noise, and (d) Factory noise at different SNR values.

PCGSVD-MVE. This observation remains true for all the fol-lowing subjective measures and objective listening tests. There-fore, we excluded the results of GSVD-LSE and PCGSVD-LSE evaluations in the following experiments.

B. Segmental Log Spectral Distance Measures

The segmental log spectral distance (SegLSD) measure is for-mulated as follows:

(24)

where and are, respectively, the th spectral

component, , of the th frame (totally

nonsi-lence frames) of clean and enhanced utterances. was set to 512. Fig. 8 depicts the SegLSD measures. From Fig. 8(a) for the White noise case, GSVD-MVE offered the smallest SegLSD, whereas PSS-AMT offered satisfactory results as well. However, for the nonwhite noise cases in Fig. 8(b)–(d), PCGSVD-MVE actually behaved the best on average. This result indicates the proposed PCGSVD-based approach algorithm generated rela-tively less spectral distortion in frequency domain if the additive noise did not exist across the entire spectrum.

Fig. 8. Segmental log spectral distance measures for (a) White noise, (b) Volvo-car noise, (c) Babble noise, and (d) Factory noise at different SNR values.

C. English Phoneme/Digit Recognition Accuracy

The third measure we adopted was the English phonemic ac-curacy obtained in the free-phoneme decoding (without lexicon and language model) for the 30 test utterances mentioned previ-ously in this section, which can be used to test the intelligibility of the estimated speech. The acoustic model used here consisted of 48 left-to-right continuous hidden Markov models (CHMMs) for the 48 context-independent phoneme units [29], trained from 4-h TIMIT speech corpus and most of the CHMMs included five states with two nonemitting and each emitting state consisted of eight Gaussian mixtures. The total number of Gaussian mixtures was about 1100. The dimension of acoustic feature vectors was 39; including 12 mel-frequency cepstrum coefficients (MFCCs) and the normalized log energy, plus the first and second deriva-tives. The frame size for obtaining the acoustic features was 30 ms (480 samples) with 10 ms (160 samples) of shift. The HTK [30] was used for feature extraction, acoustic model training, and recognizer in this experiment. Without dropping the silence frames in the recognition phase, the baseline phoneme accuracy for clean speech for the 30 test sentences was 54.48%.

Fig. 9 shows the accuracy results of the phoneme recog-nition. In Fig. 9(a), all of the six approaches improved the recognition accuracy for the White noise case. For the colored and/or nonstationary noise situations in Fig. 9(b)–(d); however, again PCGSVD-MVE outperformed the other enhancement algorithms in most cases, especially for the Volvo-car and Factory noise cases. Extra tests verified that the recognition

(9)

Fig. 9. TIMIT phoneme recognition accuracies for (a) White noise, (b) Volvo-car noise, (c) babble noise, and (d) factory noise at different SNR values.

performance could be further improved if the training speech corpus for the acoustic model can be similarly processed a priori, but not reported here to save the space.

We also carried out the English digit recognition under AURORA2 testing environment [31]. We adopted the first 200 shortest utterances out of the total of 1001 for test sets A and B of AURORA2 corpus under the clean training condition. We excluded test set C here because it includes channel distortion which is not handled here. The experimental setup was identical to those mentioned previously, except here the sampling rate for the AURORA2 task was 8 kHz and the CHMMs consisted of 11 digit units (0–9 plus OH) and a silence model, each digit HMM contained 16 emitting states and each state comprised two Gaussian mixtures. The value for the scaling factor for obtaining the matrix mentioned in Section II-B was fixed as 0.9 for PCGSVD-based approach (because most of the noise sources in the test sets are nonstationary, the average of the values obtained as mentioned above for all different types of noise in test sets A and B was specified). Fig. 10(a)-(b) reveals the recognition results for different SNR values (but averaged over all noise types) and different types of noise (but averaged over all SNR values), respectively. Similar to that of recognition accuracy measures on TIMIT for colored noise cases in Fig. 9(b)–(d), PCGSVD-MVE outperformed the other enhancement algorithms almost in every case. Compared with Noisy, on average 24.61% and 30.66% word error rate reduction were achieved by the PCGSVD-MVE approach for test sets A and B, respectively. The above results indicate the

Fig. 10. AURORA2 word accuracies for clean condition training and test sets A and B for (a) different SNR values but averaged over all noise types. (b) Different types of noise but averaged over all SNR values.

Fig. 11. Spectrogram plots for a typical utterance corrupted by White noise at 10 dB SNR. (a) Clean. (b) Noisy (10-dB White). (c) PSS. (d) PSS-AMT. (e) PKLT. (f) GSVD-MVE. (g) PCGSVD-MVE.

proposed PCGSVD-MVE could be useful as a noise removal preprocessor for a speech recognition system.

D. Spectrogram and Time Domain Waveform Plots

Figs. 11 and 12 are, respectively, the time-frequency domain spectrogram plots and the time domain waveforms for the var-ious versions of a test utterance, “To many experts, this trend was inevitable.”, produced by a male speaker, corrupted by the White noise at 10-dB SNR. From Fig. 11(c), we can see in

(10)

Fig. 12. Waveforms for a typical utterance contaminated by White noise at 10-dB SNR. (a) Clean. (b) Noisy (10-dB White). (c) PSS. (d) PSS-AMT. (e) PKLT. (f) GSVD-MVE. (g) PCGSVD-MVE.

the spectrogram of the PSS processed speech some undesired random tone peaks present in the nonspeech regions (e.g., re-gion A) and low-energy, noise-like speech segments (e.g., seg-ments B, C, D, and E), compared with the clean speech version in Fig. 11(a), which are perceivable musical noise. This phe-nomenon was improved in the version of PSS-AMT, PKLT, and GSVD-MVE processed utterances as shown in Fig. 11(d)–(f), re-spectively, although it was found in parallel informal listening tests that the residual noise was still quite perceivable. It is fur-ther evident in Fig. 11(g) that almost the same detailed informa-tion of the speech spectrum were recovered by PCGSVD-MVE as compared to that in Fig. 11(f) by GSVD-MVE, but with much less random tone peaks present in the silence segments and low-energy speech regions. Though some residual noise still oc-curred in the PCGSVD-MVE processed speech for the White noise case, this is due to the de-emphasis of the estimated noise constructed Hankel-form matrix by the parameter (0.6 in this case). However, by the parallel informally subjective listing tests, in which many subjects agreed that the musical noise was less perceivable for the utterances processed by PCGSVD-MVE than those by GSVD-MVE and other approaches of interest. Another test utterance, “The small boy put the worm on the hook.”, produced by a female speaker was repeated in the same tests, except the additive noise signal was the Volvo-car noise at 10-dB SNR, and the results are in Figs. 13 and 14. Again, PCGSVD-MVE kept most of the original speech information and can eliminate most of the residual noise existing in the spec-trogram of the enhanced speech of PSS and PSS-AMT processed speech. Furthermore, from Figs. 13 and 14, we can see that the high-frequency components in the spectrogram of PKLT pro-cessed utterance are seriously attenuated, and the waveform is

Fig. 13. Spectrogram plots for a typical utterance corrupted by Volvo-car noise at010-dB SNR. (a) Clean. (b) Noisy (010 dB Volvo). (c) PSS. (d) PSS-AMT. (e) PKLT. (f) GSVD-MVE. (g) PCGSVD-MVE.

Fig. 14. Waveforms for a typical utterance contaminated by Volvo-car noise at 010-dB SNR. (a) Clean. (b) Noisy (010-dB Volvo). (c) PSS. (d) PSS-AMT. (e) PKLT. (f) GSVD-MVE. (g) PCGSVD-MVE.

quite different from the original speech, which means under highly noisy environments, PKLT not only eliminates the noise signal but hurts the signal itself as well.

(11)

TABLE I

PERCENTAGE(%)OFSUBJECTSWHOPREFERRED THEUTTERANCES

PROCESSED BYPCGSVD-MVEASCOMPARED TONOISY ANDTHOSE

PROCESSED BYPSS, PSS-AMT, PKLT,ANDGSVD-MVE, RESPECTIVELY,

FORDIFFERENTNOISYCONDITIONS

E. Subjective Listening Tests

Two sets of subjective listening tests were performed and re-ported here.

1) Listening Preference Comparison: As shown in the first two columns of Table I, the White, Volvo-car, Babble, and Fac-tory noises were artificially added to the test speech at some SNR values (10 to 10 dB) with corresponding input SegSNR values. For each case of noise type and SNR values, five test ut-terance pairs were generated. For each pair, one was processed by the proposed PCGSVD-MVE approach and the other was one of the Noisy or PSS, PSS-AMT, PKLT, GSVD-MVE processed utterances. The first two rows for the White noise at SNR values of 10 and 0 dB targeted at the moderate and highly noisy condi-tions, respectively. The next two rows of the Volvo-car noise at SNR values of 0 and 10 dB were for the situation of car trav-eling in the city at a speed of 40 to 50 km/h, and on the highway at a speed of about 100 km/h, respectively. The other two cases for the Babble and Factory noises at 5-dB SNR were also close to the real-world situation. A total of 26 subjects, 8 females and 18 males between 20 and 45 years of age, participated in the test. Each subject was asked to evaluate four different sets of utterances (totally 20 test utterance pairs) with ordinary head-phones and chose the one they preferred without knowing which one was which, and on average each utterance pair was evalu-ated by 17 subjects. From the evaluation results in Table I, it is clear that the PCGSVD-MVE approach proposed in this paper outperformed the other approaches for almost all cases of noise types and SNR values. On average, 69.32% to 83.61% of sub-jects preferred the PCGSVD-MVE processed speech than other approaches under concern.

2) MOS Comparison: Mean opinion score (MOS) rating is the most widely used measure for subjective quality tests, in which the subjects rate the test speech from 5 to 1 scales for “Excellent,” “Good,” “Fair,” “Poor,” and “Unsatisfactory,”

Fig. 15. Mean opinion score rating results for different types of noisy environment.

respectively, [28]. The same group of 26 subjects as mentioned above participated in this MOS rating, using exactly the same noisy and enhanced test utterances as those in Section IV-E1, plus two clean utterances for the purpose of referenced MOS rating. Fig. 15 depicted the rating results. The averaged MOS rating for the two clean utterances was 4.39. It is clear that the proposed PCGSVD-MVE approach received the highest MOS rating, whether the noise is white or not, stationary, or nonstationary.

F. Discussions

For the objective measures in Sections IV-A–IV-C, GSVD-MVE roughly offered the best performance in the White noise case. This is apparently because GSVD-MVE ac-tually optimized the estimated speech in the signal processing sense. PCGSVD-MVE behaved not so well in those objective evaluations for the White noise situation, in that the spectral components of noise-like speech usually have lower energy, but the White noise distribution is flat. For such wide-band noise cases, in order to mask the residual frequency components based on the auditory masking effect, PCGSVD-MVE tended to attenuate the high-frequency components (which very often cannot be masked by speech signal under adverse conditions) to make sure the residual noise is nearly unperceivable, which may cause some distortion in the estimated speech. Though PSS-AMT also utilized the human auditory effect; however, it only applied the AMTs to figure out the bounded weighting factor and flooring coefficient in the spectral subtraction algorithm in order to achieve a good tradeoff between the per-ceivable residual noise and the signal distortion [6]; therefore, PSS-AMT may introduce less distortion in the high frequency parts than PCGSVD-MVE in the White noise case. To reduce the signal distortion of the PCGSVD-processed speech for such broad-band noise situations, we may de-emphasize the Hankel-form matrix by setting the scaling factor , as specified in Sections II-B and III-F, less than unity. Moreover, from the subjective listening tests in Section IV-E, the proposed PCGSVD-MVE approach indeed offered the best speech quality for this broad-band noise case. In fact, during the tests most of the subjects reported that the musical noise induced by

(12)

PCGSVD-MVE was less perceivable than that by GSVD-MVE, PSS, PSS-AMT, and PKLT.

As for the colored noise cases, it is evident from both the objec-tive and subjecobjec-tive evaluations, PCGSVD-MVE actually offered the best performance on average for Volvo-car, Babble, and Fac-tory noises, whether the SNR was high or low. In other words, such real-world noise may be narrow-band and low-passed, so often masked by the voiced speech (but usually not true for the un-voiced speech), and therefore PCGSVD-MVE may behave better than GSVD-MVE and the other approaches for human percep-tion. An especially worth mentioning case is that for the Volvo-car noise at 0 dB SNR, the MOS rating for PCGSVD-MVE in Fig. 15 was even almost as good as that of the clean speech. This is be-cause the Volvo-car noise is low-passed and narrow-band, easily masked by the voiced segments of input speech. Besides, from the evaluation results for the listening preference comparison as in Table I, it is interested to point out that many subjects disliked GSVD-MVE processed utterances more than those by PSS-AMT and PKLT, especially when the additive noise was white. Because the artificially generated residual noises were quite different for variousenhancementalgorithms,subjectsmaypreferred one kind of residual noise than the others in the listening preference com-parison, and hence we obtained the remarkable result as in Table I. We also evaluated the performance of SegSNR and SegLSD without silence detection error (i.e., using manually labeled silence frames). Experimental results showed that the differ-ence was in general insignificant for stationary types of noise, whether white or not. On the other hand, for nonstationary noise cases, correct detection of noise segments could improve the performance.

Finally, we also estimated the computational complexity of the different enhancement algorithms discussed here. The pri-mary CPU load of PSS is the DFT and inverse DFT operations. For PSS-AMT, extra computations are for the AMTs, which are relatively limited compared with the DFT operation. PKLT re-quires performing the KLT on the autocorrelation matrix of each speech frame with complexity proportional to , where is the dimension of the autocorrelation matrix of the speech frame. However, there existed a recursive algorithm to approximate the KLT which can roughly reduce one order computation load of the KLT algorithm [32]. The GSVD algorithm, on the other hand, requires roughly operations per frame [33], where and are, respectively, the row and column sizes of the Hankel-form matrices and as in Section II-B, which is roughly more than one order of computational complexity than that of PSS. Accordingly, PCGSVD-MVE needs to figure out the AMTs in frequency domain, and the AMTs have to be trans-formed to the generalized singular domain and thus extra opera-tions are needed. Besides, some special techniques for reducing the computations of the GSVD algorithm have been developed as well [34], in which the complexity of one GSVD-update can be reduced to and thus the real-time implementation of the GSVD- and PCGSVD-based approaches on the commercial communication products is achievable.

G. Summary of the Performance Evaluation

All the above results are briefly summarized here. Compared with the noisy speech not enhanced, the proposed

GSVD-and PCGSVD-based approaches achieved, respectively, for different types of noise sources on average 6.62 and 6.67 dB improvements in SegSNR evaluations, 1.60 and 1.31 dB re-ductions in SegLSD measures, and 7.59% and 7.81% absolute phoneme recognition accuracy improvements for the TIMIT test utterances and 4.83% and 13.96% absolute word accuracy improvements for the AURORA2 digit recognition task. In subjective listening tests, on average 69.32% to 83.61% of subjects preferred the PCGSVD-MVE processed speech than other approaches being considered. In particular, the good performance for the proposed approach with respect to colored noise is quite clear.

V. CONCLUSION

In this paper, we first proposed a GSVD-based speech en-hancement approach, which was extended from the concept of the truncated QSVD-based approach. Based on this algorithm, a new PCGSVD-based approach by integrating the auditory masking thresholds transformed onto the generalized singular domain has been presented. Closed-form solutions for both the GSVD- and PCGSVD-based enhancement approaches were obtained. Objective measures including time and fre-quency domain evaluations and speech recognition accuracy measures were used as compared to three other transforma-tion-based speech enhancement algorithms under different noisy environments. The results indicated that the new pro-posed PCGSVD-based approach can effectively alleviate the perceivable residual noise introduced by the enhancement pro-cesses, retain the features of the original speech, and improve the accuracy of the speech recognition system, whether the additive noise is stationary or nonstationary, especially when the noise is nonwhite.

APPENDIX

The proof of (15) and (16) [referred to here as the solution for the constrained optimization problem (10) of PCGSVD-MVE] is given here. It will be shown later on that (6) and (7) [referred to here as the solution for the unconstrained optimization problem (5) of GSVD-MVE] turn out to be a special case of (15) and (16). For the constrained optimization problem of (10), an object function together with the Kuhn–Tucker conditions can be introduced to convert this constrained optimization problem into an unconstrained one [35], in which two Lagrange

multi-plier vectors and are introduced

such that

(25) condition on

(13)

where is the trace of a matrix, , ,

and , , are the Lagrange multipliers

or the components of the vectors and , respectively. The condition described in (26) is the complementarity condition, i.e., the value of is zero when the corresponding th constraint is inactive; otherwise, it should be a positive value. Therefore, the constrained PCGSVD-MVE optimization problem defined by (10) can be transformed to the unconstrained GSVD-MVE problem defined by (5) by assigning all the Lagrange multipliers ( and ) to zero. We take the gradient operation of with respect to to obtain the estimate result

(27) such that

(28) We further define a diagonal matrix whose diagonal elements are exactly the Lagrange multipliers , ,

and , .. . ... . .. .. . ... . .. ... (29)

Equation (28) can then be written in matrix form using the ma-trix defined in (29)

(30) The matrices and in (30) can be further decomposed via the GSVD algorithm [see (2) and (3)] and leading to the following result:

(31) For further developing purposes, with the GSVD algorithm, we can whiten the matrix by multiplying it by a transfor-mation matrix

(32) where denotes the whitened version of , and the ma-trices , , and are given in (2) and (3). The autocorrelation matrix for the whitened version of the estimated noise can be

approximated by , the identity

autocorrelation matrix indicates that it is completely whitened.

From Section II-B, we have .

In order to whiten the noise component in the noisy speech

signal as in (32), we can multiply the matrix with accordingly

(33) where the extra subscript for the matrices and again denotes the version of and with noise component whitened. Again with the GSVD algorithm, the matrix can be factorized as follows:

(34)

where the matrices and ,

re-spectively, consist of the first orthonormal column vectors and the rest orthonormal columns of the matrix . The

diagonal matrices and ,

re-spectively, comprise the preceding diagonal elements and the rest diagonal elements of the matrix , and

and can be similarly obtained from the

matrix . In a similar way, with the SVD algorithm, the matrix can be decomposed as follows:

(35)

where each column of the matrices or

is orthonormal, is a diagonal matrix whose rank is not known a priori, but can be reasonably approximated as the dimension of the signal subspace of the matrices and

(i.e., ), the diagonal matrix consists of the

nonzero diagonal elements of the matrix , and

consist of the first columns and the rest columns of the matrix , respectively, and

(whose columns span the signal subspace of ) and (whose columns span the noise subspace of ) are similarly obtained from the matrix . Substituting (34) and

(35) into (33) with the relation ,

we have

(36) The association of the matrices , , , , , and others in (36) can be obtained as follows:

(37) (38)

(14)

(39) (40) (41)

where the matrix consists of the first

or-thonormal column vectors of the matrix . Moreover, from (39) and (40) it is evident that

(42) With the derivation from (35)–(42), the matrix can be written as follows:

(43) Substituting (43) into (31), we have

(44) Based on the assumption that the clean speech is uncorrelated with the additive noise, the matrices and (reconstructed from the signal subspace of the matrix ) are thus uncorre-lated accordingly. Afterwards we substitute the result of (44) into (31) with the assumption that the equal sign of (31) still applies

(45) or

(46)

where the matrices , , and are those defined in (29) and

(34), respectively. We then define a matrix ,

, and from (46) we know that the matrix is diagonal with its last diagonal elements being zero, i.e., the rank of is . Equation (46) can be rearranged as follows:

(47) and thus

(48) or

(49)

where the diagonal matrices and

consist of the first diagonal elements of the matrices and , respectively. Thus the diagonal elements of the matrix ,

, , are as follows:

(50)

where , , and are those defined in (2), (3), and (25), re-spectively. Because the last diagonal elements of the matrix are zero, the equality part of the constraints of (10) are always guaranteed. For the inequality part of the constraints, the energy of the transformed residual noise components as in (10) can be evaluated as follows:

(51) (52) (53) (54) (55) which leads to (56)

(15)

or

(57)

where the vectors and , , are, respectively,

the th column and row of a -dimensional identity matrix and the matrix , and denotes the norm operation. From (51), with the Kuhn–Tucker conditions, the Lagrange multi-pliers , , can be classified according to the fol-lowing two cases:

Constraint-inactivated case

(58) and

Constraint-activated case

(59) For the case of constraint-inactivated, we set , , to zero and having the result [referred to (50) and (57)]

(60) where

(61) For the constraint-activated case, on the other hand, the equal sign of (57) holds, and therefore the th diagonal element of the

matrix , , , in (50) has the form

(62)

With (50)–(62), the diagonal elements , , of the transformation matrix are

(63) Hence, the estimated matrix for the clean speech frame can be expressed as follows:

(64)

where is a diagonal matrix with its diagonal

elements being those in (16). For the unconstrained optimization problem for GSVD-MVE as formulated in (5), it is easy to verify its solution as given in (6) and (7) by assigning all the Lagrange

multipliers , , of (50) to zero. This concludes

the proof for (15) and (16) for the solution of the constrained optimization problem of PCGSVD-MVE, and (6) and (7) for the solution of the unconstrained optimization problem of GSVD-MVE.

ACKNOWLEDGMENT

The authors would like to thank Dr. Y.-S. Lee, Dr. B.-S. Jeng, Dr. S.-W. Swan, and Dr. R.-K. Chen of Chunghwa Telecom-munication Laboratories, Taiwan, R.O.C., for their consecutive support and encouragements. They would also like to thank the anonymous reviewers and associated editor for their valuable comments.

REFERENCES

[1] S. F. Boll, “Suppression of acoustic noise in speech using spectral sub-traction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979.

[2] R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 2, pp. 137–145, Apr. 1980.

[3] B.-L. Sim, Y.-C. Tong, J.-S. Chang, and C.-T. Tan, “A parametric for-mulation of the generalized spectral subtraction method,” IEEE Trans. Speech Audio Process., vol. 6, no. 4, pp. 328–337, Jul. 1998. [4] J. D. Johnston, “Transform coding of audio signals using perceptual

noise criteria,” IEEE J. Select. Areas Commun., vol. 6, no. 2, pp. 314–323, Feb. 1988.

[5] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, “Speech en-hancement based on audible noise suppression,” IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp. 497–514, Nov. 1997.

[6] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEE Trans. Speech Audio Process., vol. 7, no. 2, pp. 126–137, Mar. 1999.

[7] C.-H. You, S.-N. Koh, and S. Rahardja, “Subspace speech enhancement for audible noise reduction,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2005, pp. 145–148.

[8] E. Zwicker and H. Fastle, Psychoacoustics, 2nd ed. New York: Springer-Verlag, 1999.

[9] S. V. Huffel, “Enhanced resolution based on minimum variance esti-mation and exponential data modeling,” Signal Process., vol. 33, pp. 333–355, Sep. 1993.

[10] B. T. Lilly and K. K. Paliwal, “Robust speech recognition using singular value decomposition based speech enhancement,” in Proc. IEEE TENCO-Speech and Image Tech. Comput.Telecommun., 1997, pp. 257–260.

[11] M. Klein and P. Kabal, “Signal subspace speech enhancement with per-ceptual post-filtering,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2002, pp. 537–540.

[12] S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Sørensen, “Reduc-tion of broad-band noise in speech by truncated QSVD,” IEEE Trans. Speech Audio Process., vol. 3, no. 6, pp. 439–448, Nov. 1995. [13] G.-H. Ju and L.-S. Lee, “Speech enhancement based on generalized

singular value decomposition approach,” in Proc. ICSLP, 2002, pp. 1801–1804.

[14] G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed. Bal-timore, MD: Johns Hopkins Univ. Press, 1996.

[15] U. Mittal and N. Phamdo, “Signal/noise KLT based approach for en-hancing speech degraded by colored noise,” IEEE Trans. Speech Audio Process., vol. 8, no. 2, pp. 159–167, Mar. 2000.

[16] Y. Hu and P. Loizou, “A subspace approach for enhancing speech cor-rupted by colored noise,” IEEE Signal Process. Lett., vol. 9, no. 7, pp. 204–206, Jul. 2002.

[17] H. Lev-Ari and Y. Ephraim, “Extension of the signal subspace speech enhancement approach to colored noise,” IEEE Signal Process. Lett., vol. 10, no. 4, pp. 104–106, Apr. 2003.

[18] G.-H. Ju and L.-S. Lee, “Perceptually constrained generalized singular value decomposition-based approach for enhancing speech corrupted by colored noise,” in Proc. Eurospeech, 2003, pp. 533–536. [19] F. Jabloun and B. Champagne, “A perceptual signal subspace approach

for speech enhancement in colored noise,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2002, pp. 569–572.

[20] C. C. Paige and M. A. Saunders, “Towards a generalized singular value decomposition,” SIAM J. Numer. Anal., vol. 18, pp. 398–405, 1981. [21] M. Dendrinos, S. Bakamidis, and G. Garayannis, “Speech

enhance-ment from noise: A regenerative approach,” Speech Commun., vol. 10, pp. 45–57, Feb. 1991.

[22] M. H. Hayes, Statistical Digital Signal Processing and Modeling. New York: Wiley, 1999.

[23] J.-L. Shen, J.-W. Hung, and L.-S. Lee, “Robust entropy-based end-point detection for speech recognition in noisy environments,” in Proc. ICSLP, 1998, pp. 232–235.

(16)

[24] J. S. Garofolo, Getting Started With the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database. Gaithersburg, MD: National Inst. Stand. Technol. (NIST), 1988.

[25] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, pp. 247–251, Jul. 1993.

[26] Speech Processing, Transmission and Quality Aspects (STQ); Dis-tributed Speech Recognition; Extended Advanced Front-End Feature Extraction Algorithm; Compression Algorithms; Back-End Speech Reconstruction Algorithm, ETSI Std. ES 202 212 V1.1.1 Recommen-dation, Nov. 2003.

[27] J. Ramirez, J. C. Segura, C. Benitez, Á. Torre, and A. Rubio, “A new adaptive long-term spectral estimation voice activity detector,” in Proc. Eurospeech, 2003, pp. 3041–3044.

[28] S. Quackenbush, T. Barnwell, and M. Clements, Objective Measures of Speech Quality. Englewood Cliffs, NJ: Prentice-Hall, 1988. [29] K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition

using hidden Markov models,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 11, pp. 1641–1648, Nov. 1989.

[30] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Wood-land, The HTK Book Version 3.0. Cambridge, U.K.: Cambridge Univ. Press, 1999.

[31] H. G. Hirsch and D. Pearce, “The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions,” in Proc. ISCA ITRW ASR2000, 2000, pp. 181–188. [32] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech

en-hancement,” IEEE Trans. Speech Audio Process., vol. 9, no. 2, pp. 87–95, Feb. 2001.

[33] F. T. Luk, “A parallel method for computing the generalized singular value decomposition,” J. Paral. Distrib. Comput., vol. 2, pp. 250–260, Aug. 1985.

[34] S. Doclo and M. Moonen, “GSVD-based optimal filtering for single and multimicrophone speech enhancement,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002.

[35] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2000.

Gwo-Hwa Ju received the M.S. degree in electrical

engineering and the Ph.D. degree in communication engineering, from National Taiwan University, Taipei, Taiwan, R.O.C., in 1990 and 2006, respec-tively.

He has been a Researcher with the Multimedia Ap-plications Technology Laboratory, Telecommunica-tion Laboratories, Chunghwa Telecom CorporaTelecommunica-tion, Ltd., Taoyuan, Taiwan, since 1990. In 1992, he was a Visiting Researcher with the Speech Recognition and Synthesis Group, Human Interface Laboratories, NTT Yokosuka, Japan. His research interests include robust speech recogni-tion, speech synthesis, speech coding, digital signal processing, and embedded system design.

Lin-Shan Lee (F’94) received the Ph.D. degree

in electrical engineering from Stanford University, Stanford, CA.

He has been a Professor of electrical engineering and computer science at the National Taiwan Univer-sity, Taipei, Taiwan, R.O.C., since 1982 and holds a joint appointment as a Research Fellow of Academia Sinica, Taipei. His research interests include digital communications and spoken language processing. He developed several of the earliest versions of Chinese spoken language processing systems in the world including text-to-speech system, natural language analyzer, dictation systems, and voice information retrieval system.

Dr. Lee was Guest Editor of a Special Issue on Intelligent Signal Pro-cessing in Communications of the IEEE JOURNAL ONSELECTEDAREAS IN

COMMUNICATIONSin December 1994 and January 1995. He was Vice Presi-dent for International Affairs (1996–1997) and the Awards Committee chair (1998–1999) of the IEEE Communications Society. He has been a member of Permanent Council of International Conference on Spoken Language Pro-cessing (ICSLP), was the convener of COCOSDA (International Coordinating Committee of Speech Databases and Assessment, 2000–2001), and is currently a member of the Board of International Speech Communication Association (ISCA).