• 沒有找到結果。

Optimal design of minimum mean-square error noise reduction algorithms using the simulated annealing technique

N/A
N/A
Protected

Academic year: 2021

Share "Optimal design of minimum mean-square error noise reduction algorithms using the simulated annealing technique"

Copied!
10
0
0

加載中.... (立即查看全文)

全文

(1)

Optimal design of minimum mean-square error noise reduction

algorithms using the simulated annealing technique

Mingsian R. Bai,a兲 Ping-Ju Hsieh, and Kur-Nan Hur

Department of Mechanical Engineering, National Chiao-Tung University, 1001 Ta-Hsueh Road, Hsin-Chu 300, Taiwan

共Received 8 August 2008; revised 20 November 2008; accepted 21 November 2008兲

The performance of the minimum mean-square error noise reduction 共MMSE-NR兲 algorithm in conjunction with time-recursive averaging共TRA兲 for noise estimation is found to be very sensitive to the choice of two recursion parameters. To address this problem in a more systematic manner, this paper proposes an optimization method to efficiently search the optimal parameters of the MMSE-TRA-NR algorithms. The objective function is based on a regression model, whereas the optimization process is carried out with the simulated annealing algorithm that is well suited for problems with many local optima. Another NR algorithm proposed in the paper employs linear prediction coding as a preprocessor for extracting the correlated portion of human speech. Objective and subjective tests were undertaken to compare the optimized MMSE-TRA-NR algorithm with several conventional NR algorithms. The results of subjective tests were processed by using analysis of variance to justify the statistic significance. A post hoc test, Tukey’s Honestly Significant Difference, was conducted to further assess the pairwise difference between the NR algorithms. © 2009 Acoustical Society of America. 关DOI: 10.1121/1.3050292兴

PACS number共s兲: 43.60.Dh, 43.60.Np, 43.60.Uv, 43.72.Kb 关EJS兴 Pages: 934–943

I. INTRODUCTION

In recent years, applications of mobile communication, video conferencing, and peer-to-peer internet telephony net-works, such as SKYPE®, hands-free car kits, etc., are rap-idly advancing in modern daily life. In these applications, effective communication in noisy environments has been one of the pressing problems. To enhance speech quality, noise reduction 共NR兲 technology has been extensively studied in the communication community. The main problem with most NR algorithms is that sheer NR does not necessarily lead to the general preference of the users. Overly aggressive NR schemes often result in processing artifacts and degradation of speech quality. How to effectively reduce background noise without impairing speech quality has become an immi-nent issue for NR algorithm design.

NR algorithms fall into three categories: spectral-subtraction algorithms, statistical-model-based algorithms, and subspace algorithms. Spectral-subtraction algorithms1–6 subtract directly the estimated noise spectrum from the spec-trum of the noisy speech. Statistical-model-based algorithms estimate Fourier coefficients using statistically optimal linear or nonlinear estimators of clean signals. The Wiener algorithm7–10 and the minimum mean-square error 共MMSE兲1,11

algorithm belong to this class. Subspace algo-rithms are based on the principle that the vector space of the noisy signal can be decomposed into the “signal” and “noise” subspaces. Noise is suppressed by projecting the noisy signals onto the signal subspace and nullifying the components in the noise subspace. The decomposition of these two orthogonal subspaces can be done by using the

singular value decomposition or the eigenvalue decomposi-tion. The Karhunen–Loéve transform 共KLT兲 algorithm11,12 falls into this category. All NR algorithms require the infor-mation of noise spectra or noise covariance matrices, which must be estimated and updated from frame to frame. Noise estimation can be carried out either during speech pauses, which requires a voice activity detector 共VAD兲, or continu-ously using time-recursive averaging 共TRA兲 algorithms. A more comprehensive review of speech enhancement and NR methods can be found in the monograph by Loizou.11

In this paper, a MMSE-NR algorithm based on TRA11,13 noise estimation 共denoted as MMSE-TRA-NR兲 is investi-gated. This algorithm is found to be very sensitive to the choice of two recursion parameters. To address this problem in a more systematic manner, this paper proposes an optimi-zation method to efficiently search the optimal parameters of the MMSE-TRA-NR algorithms. A global optimization tech-nique, simulated annealing共SA兲14–16algorithm, is exploited for locating the optimal parameters. The objective function is a combined objective measure for NR and the incurred dis-tortion of processed signals. Sensitivity analysis of the TRA parameters obtained using the SA optimization was under-taken for nine types of background noise. In addition to the optimized MMSE-TRA-NR, the possibility of using linear prediction coding 共LPC兲6,17–19 as a preprocessor to the NR algorithm is also explored.

In order to evaluate the proposed optimized algorithm and the other NR algorithms, objective and subjective tests were carried out. The objective tests were conducted accord-ing to ITU-T P.862.20The subjective listening tests were con-ducted according to ITU-T P.835.21 The test data were pro-cessed by using analysis of variance共ANOVA兲 to justify the statistic significance of the difference among the NR algo-rithms. A post hoc test, Tukey’s HSD, was also employed in a兲Author to whom correspondence should be addressed. Electronic mail:

(2)

the paired comparison between the NR algorithms.

II. NOISE REDUCTION ALGORITHMS

Figure 1 illustrates the general three-step structure of NR algorithms.11 The noisy signal is forward transformed using unitary transformations, e.g., Fourier transform, dis-crete cosine transform, and KLT transform. Next, gain modi-fication, the major NR operation, takes place in the trans-formed domain. Finally, the time-domain signal of the enhanced speech is recovered by an overlap-and-add proce-dure. In this section, the MMSE-NR algorithm will be re-viewed. The other traditional NR algorithms, such as the spectral subtraction, the Wiener filtering, and the KLT, to be compared in this paper are only mentioned in Sec. I with references.

A. Statistical-model-based noise reduction algorithm

The MMSE-NR algorithm is also based on a statistical model. Instead of the complex spectrum as in the Wiener filter method, a nonlinear estimator of the magnitude spec-trum is optimized in the MMSE-NR algorithm. It is assumed that the discrete Fourier transform共DFT兲 coefficients are sta-tistically independent and follow the Gaussian distribution. The mean-square error between the estimated 共Sˆk兲 and the

true共Sk兲 magnitudes of the clean speech signal is

Emse= E兵共Sˆk− Sk兲2其. 共1兲

This expectation can be estimated using the following Baye-sian mean-square error approach:

Bmse共Sˆk兲 =

冕冕

共Sk− Sˆk兲2p共Y,Sk兲dYdSk, 共2兲

where Y =关Y共␻0兲Y共␻1兲¯Y共N−1兲兴 is the noisy speech

spec-trum and p共Y,Sk兲 is the joint probability density function

共pdf兲. The posterior pdf of Sk can be determined by using

Bayes’ rule. Minimization of the Bayesian MSE with respect to Sˆkleads to the optimal MMSE estimator,

Sˆk= E关Sk兩Y共k兲兴 =

0 ⬁ skp共sk兩Y共k兲兲dsk =兰0 ⬁s kp共Y共k兲兩sk兲p共sk兲dsk 兰0⬁p共Y共k兲兩sk兲p共sk兲dsk , 共3兲

where sk is a realization of the random variable Sk and

p共sk兩Y共k兲兲 is the conditional posterior pdf of sk under the

observation Y共k兲. Assuming that the pdf of the noise

Fou-rier coefficients is Gaussian, it was shown by Ephraim and Malah that the statistically optimal MMSE magnitude esti-mator takes the form1

Sˆk=

␲ 2

vkk exp

vk 2

共1 + vk兲I0

vk 2

+vkI1

vk 2

Yk, 共4兲 where I0共·兲 and I1共·兲 are the modified Bessel functions of the zero and the first order, respectively, Ykis the spectral

mag-nitude of the noisy signal, andvkis defined by

vk=

k

1 +␰k

k, 共5兲

where␥kdenotes the a posteriori signal-to-noise ratio共SNR兲

given by ␥k Yk 2 Pvv共␻k兲 = Yk 2 E兵兩V共k兲2兩其 . 共6兲

In practice, the noise variance and hence the a priori SNR ␰k are unknown, given the noisy signal y共n兲. Thus,

noise spectrum must be estimated prior to NR processing. First, the noise variance is estimated during speech pauses with the aid of a VAD 共Ref. 22兲 provided the noise is sta-tionary. For example, the following statistical-model-based VAD can be used:

1 N

k=1 N−1 log

1 1 +␰k exp

kk 1 +␰k

冊冊

H0H1 ⌬, 共7兲

where N is the Fast Fourier transform size, H0and H1denote the hypotheses of speech absence and speech presence, re-spectively, and the threshold ⌬ is usually set to 0.15. Here, the MMSE-NR algorithm used in conjunction with VAD for noise estimation is denoted as “MMSE-VAD-NR.” Next, the

a priori SNRkis estimated with a “decision-directed”

ap-proach using the recursive formula

ˆ

k共m兲 = a

Sˆk2共m − 1兲

Pvv共␻k,m − 1

+共1 − a兲max共k共m兲 − 1,0兲, 共8兲

where m is the frame number and 0⬍a⬍1 is a weighting factor commonly chosen to be a = 0.98.

As mentioned above, Eq. 共4兲 is only a spectral magni-tude estimator. To recover the enhanced signal, one needs to estimate the phase of the clean speech signal. It was shown by Ephraim and Malah1 that the optimal phase estimate is simply the noisy phase. Thus, the enhanced complex signal spectrum is calculated by combing the preceding estimated magnitude spectrum Sˆkand the noisy signal phase spectrum

jy共k兲, i.e., Sˆ共k兲=Sˆkexp共j␪y共k兲兲.

III. ENHANCMENT OF MMSE-NR ALGORITHMS

In this section, three approaches of technical refinement are exploited to enhance the aforementioned MMSE-NR al-gorithm. Unitary transform (forward) Unitary transform (inverse) Noisy signal Enhanced speech Analysis g1 Synthesis g2 gN Gain modification

(3)

A. MMSE-time recursive averaging noise reduction

As mentioned earlier in the MMSE-VAD-NR algorithm, the noise variance can be estimated and updated during speech pauses via a VAD provided the noise is stationary. In practice, however, many background noises are often tran-sient and nonstationary. For background noise of this kind, a more practical noise estimation algorithm called the TRA algorithm13 can be used.

In the TRA algorithm, noise variance ␴ˆv2共␭,k兲 at the frame␭ and the frequency k is estimated with the following recursive formula:

ˆv2共␭,k兲 =共␭,k兲ˆv2共␭ − 1,k兲 + 共1 −共␭,k兲兲兩Y共␭,k兲兩2, 共9兲 where兩Y共␭,k兲兩 is the noisy speech magnitude spectrum and

共␭,k兲 is a time and frequency dependent smoothing factor. The smoothing factor ␣ in the one-pole recursive formula was used to avoid the excessive fluctuations during the pro-cess of noise estimation. Various algorithms were proposed to determine the smoothing factor␣共␭,k兲 on the basis of the estimated SNR or the probability of speech presence. In this paper, a SNR-based smoothing factor ␣共␭,k兲 is selected to follow a sigmoid function,

共␭,k兲 = 1

1 + e−␤关␥k共␭兲−␦兴, 共10兲

where␤and␦are constants and the a posteriori SNRk共␭兲

is calculated by averaging the estimated noise variance in the past ten frames,

k共␭兲 = 兩Y共␭,k兲兩2 1 10兺m=1 10 ˆ v 2共␭ − m,k兲. 共11兲

Figure2 plots the smoothing factor ␣ for different val-ues of the parameter␤and␦= 1. Equations共10兲and共11兲can be interpreted as follows. If the speech is present, the a

pos-teriori SNRk共␭兲 will be large, and therefore␣共␭,k兲⬇1. In

this case, the noise update will cease and the noise estimate

will remain the same as that of the previous frame关the first term of Eq. 共9兲兴. Conversely, if the speech is absent, the a

posteriori SNRk共␭兲 will be small, and therefore ␣共␭,k兲

⬇0. That is, the noise estimate will follow the power spec-tral density of the noisy spectrum 关the second term of Eq. 共10兲兴. In a long stationary noise period, ␣ would stay at a very small value. As a consequence, ␴ˆv2共␭,k兲⬇兩Y共␭,k兲兩2. This ensures an accurate and robust estimation of noise level, which gives rise to good reduction performance. Thus, ␣is strongly dependent on the a posteriori SNRk共␭兲. The

choice of parameters ␤ and␦ dictates the slope and the lo-cation of the transition of the sigmoid function. This transi-tion can be considered as a “soft switch” between the bistates of speech presence and absence. How to select these two parameters to maximize the NR performance is crucial to the resulting NR performance, as will be explored in the subse-quent sections.

When noise is strong and the SNR becomes rather low, the distinction of speech and noise segments could be diffi-cult. Moreover, the noise is estimated intermittently and up-dated only during the speech silent periods. This may cause problems if the noise is nonstationary, which is the case in many applications. The recursive nature of the TRA algo-rithm enables estimating noise variance continuously, even during speech activities, which is advantageous in dealing with nonstationary noises. Figure 3 compares NR perfor-mance between the VAD and the TRA algorithms. The test signal is a speech signal corrupted by random noise 共solid line兲 varied with three different levels 共low-high-medium兲. The noise 共dotted line兲 estimated using the VAD and the TRA algorithms are also superimposed in the left side of the top and the bottom panels in Fig.3, respectively. Unlike the VAD algorithm that fails to respond to the noise level varia-tion, the TRA is capable of estimating the noise with drasti-cally transient fluctuation. In other words, VAD and TRA deal with different noise scenarios. VAD is suited for the estimation of stationary noise during speech absence, while TRA is preferred for estimating transient noise, where syn-chronization of noise estimation is crucial. As a result, a marked difference in NR performance is observed in the en-hanced signals using these two noise estimation methods. The right side of the top and the bottom panels in Fig. 3 shows the signals共dotted lines兲 processed by the MMSE-NR using VAD and TRA, respectively, for noise estimation. The noisy signals 共solid lines兲 are also superimposed to ease comparison. Obviously, the TRA is more superior to the VAD in estimating nonstationary background noise. Thus, the MMSE with TRA noise estimation 共denoted as MMSE-TRA-NR兲 will be employed in the following presentation.

B. Intelligent tuning of the parameters for the MMSE-TRA-NR algorithm

As mentioned previously, the parameters ␤ and ␦ are used in the sigmoid function of the TRA algorithm for noise estimation. Conventionally, choices such as ␦= 1.5, 15艋␤ 艋30 are recommended in the literature.11

To our surprise, we found that these two parameters␤and␦ have profound im-pact on noise estimation and hence on the NR performance FIG. 2. The smoothing factor␣共␭,k兲 calculated according to Eq.共10兲for

different values of the parameter␤with␦= 1.共Solid line:␤= 5; dashed line:

(4)

of the MMSE-TRA-NR algorithm. Therefore, it is worth ex-ploring how to adjust these two parameters such that NR performance can be maximized without too much speech quality degradation. In the following, a procedure based on the SA optimization method is presented for automated tun-ing of the TRA parameters.

1. Simulated annealing algorithm

SA is a generic probabilistic meta-algorithm for the glo-bal optimization problem, namely, locating a good approxi-mation to the global optimum of a given function in a large search space.14–16 SA is a technique well suited for solving global optimization problems with many local optima. The flowchart of the SA is illustrated in Fig.4. In the SA method, each point in the search space is analogous to the thermal state of the annealing process in metallurgy. The objective function Q to be maximized is analogous to the internal en-ergy of the system in that state. The goal of search is to bring

the system from an initial state to a randomly generated state with the maximum possible objective function. Two condi-tions are used to determine whether or not to accept an im-proved solution. If the objective function is increased, the new state is always accepted. Conversely, if the objective function is decreased and the following condition holds, the new state is accepted:

pSA= exp共⌬Q/T兲 ⬎␸, 共12兲

where pSA is the acceptance probability function, ⌬Q de-notes the increment of the objective function, T is the tem-perature that follows a certain annealing schedule, and␸is a random number generated subject to the uniform distribution on the interval关0, 1兴. It follows that the system may possibly move to a new state that is “worse” than the present one. It is this mechanism that prevents the search from being trapped in a local maximum.

Initially, the high temperature T results in the high prob-ability of accepting a move that decreases the objective func-tion, which is analogous to a steel piece whose thermal state is highly active at high temperatures. As the annealing pro-cess goes on and T decreases, the probability of accepting a move becomes increasingly small until it finally converges to a stable solution.

A simple annealing schedule is the exponential cooling, which begins at some initial temperature T0 and decreases temperature in steps according to

Tk+1=␣cTk, 共13兲

where 0⬍␣c⬍1 is a cooling factor. It is likely that a number

of moves are accepted at each temperature before proceeding to the new state. SA search is terminated at some final value

Tf. An empirical choice for ␣c is 0.95, and T0 should be chosen such that the initial acceptance probability is higher than 0.8.

FIG. 3. Comparison of the VAD and TRA algorithms. The noise estimated using the VAD and the TRA algorithms are superimposed in the left side of the top and the bottom panels. The processed speech signals using the MMSE-VAD-NR and MMSE-TRA-NR algorithms are superimposed in the right side of the top and the bottom panels.

(5)

2. Objective function Q

Two objective measures, the segmental SNR共denoted as SNRseg兲 and the perceptual evaluation of sound quality 共PESQ兲,21

are considered in constructing the objective func-tion for optimizing the performance in the MMSE-TRA-NR algorithm. The index SNRseg calculates SNR based on the noisy signals and the processed signals averaged over frames, SNRseg = 10 Msm=0

Ms−1 log10n=Nsm Nsm+Ns−1s2共n兲 共s共n兲 − sˆ共n兲兲2 , 共14兲 where Nsis the frame length and Msis the number of frames.

The SNRseg is a widely used objective measure for assess-ing NR performance in the telephony industry. The index PESQ is a more sophisticated objective measure for assess-ing speech quality, which takes into account psychoacoustic aspects of human hearing. The original and the processed signals are first level—equalized to a standard listening level and filtered by a filter, with a response similar to a standard telephone handset. The signals are aligned in time to correct for time delays and then processed through an auditory trans-form to obtain the loudness spectra. A more detailed infor-mation of the PESQ can be found in ITU-T P. 862.21

SNRseg and PESQ reflect the NR performance and the sound quality, respectively, of the processed signals. Hence, an objective function Q is constructed by combining the SNRseg and the PESQ using a weighting factor r, i.e.,

Q = r⫻ SNRseg + PESQ. 共15兲

The weighting factor r will be found from a subjective listening test. Two kinds of background noise at the SNR level of 5 dB, white noise and car noise, were processed using five NR algorithms including spectral subtraction, Wiener filtering, MMSE-VAD-NR, MMSE-TRA-NR, and KLT-NR. The TRA parameters in MMSE-TRA-NR are cho-sen to be␤= 0.6 and␦= 1.5. Figure6shows the clean speech signal used in the simulation. The test signal is a male speech sentence sampled at 8 kHz and separated into 25 ms frames with 50% overlap. The test signals last for 2 s in duration. All test signals were adjusted to the same level of loudness. A headset was used as the means of audio rendering.

Owing to the space limitation, we show only the results processed using the MMSE-TRA-NR algorithm. Figures5共a兲 and5共b兲show the spectrograms of the noise and the signal processed by the MMSE-TRA-NR algorithm for the white noise case. Figures 6共a兲and6共b兲 show the spectrograms of the noise and the signal processed by the MMSE-TRA-NR algorithm for the car noise case. Thirty-two experienced lis-teners participated in the listening test. Three subjective in-dices including NR, sound quality, and total preference were employed in the listening test. The grading scale is set to be −3 to 3. A multiple regression analysis based on five NR algorithms and two background noises was utilized to estab-lish a linear model between the NR, sound quality, and total preference. The results of the multiple regression analysis determine the weighting factors between the SNRseg and the PESQ for the objective function. This gives the weighting

factor in Eq.共27兲, r=1.867, which will be used in the objec-tive function in optimizing the MMSE-TRA-NR algorithm using the SA method next.

3. SA optimization of the MMSE-TRA-NR algorithm

The objective function with r = 1.867 is employed in the SA optimization of the MMSE-TRA-NR algorithm. Initially, the TRA parameters are arbitrarily chosen to be ␤= 1.6 and

= 1. The parameters of SA are chosen as T0= 1 K, Tf

= 10−9 K, and

c= 0.95. With the SA optimization, the

opti-mal parameters are obtained for the white noise共␤= 0.6117 and␦= 0.5214兲 and the car noise 共␤= 0.7128 and␦= 0.5265兲. Figure 7 shows the “learning curve” of the SA for the car noise scenario, where the objective function Q settles to a constant value after about 500 iterations. To see the effect of optimization, NR performances in terms of the SNRseg and PESQ attained using the initial and the optimal parameters␤ and␦are compared in TableI. In comparison with the initial nonoptimal setting, a marked improvement in performance is obtained using the optimal TRA parameters.

To further justify the optimized NR algorithm, a subjec-tive listening test was conducted. The test speech signal and

time(sec) fr eq ue nc y (H z )

Processed speech signal via MMSE-TRA-NR in white noise condition

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 500 1000 1500 2000 2500 3000 3500 4000 -70 -60 -50 -40 -30 -20 -10 0 10 20 time(sec) fr eq ue nc y (H z )

Processed speech signal via MMSE-TRA-NR in white noise condition

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 500 1000 1500 2000 2500 3000 3500 4000 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 (b) (a)

FIG. 5. 共Color online兲 The spectrograms of a male speech sentence in the white noise scenario.共a兲 Speech corrupted by the white noise. 共b兲 Enhanced speech signal processed by the MMSE-TRA-NR algorithm.

(6)

the test conditions are the same as those used in the listening test for the preceding regression analysis. The grading scale is set to be 1–5, as recommended by ITU-T P.835.22 Three subjective indices, including scale of signal distortion (SIG),

scale of background intrusiveness (BAK), and scale of over-all quality (OVL), were employed in the listening test. Every

subject participating in the test was instructed with the defi-nitions of the subjective indices prior to the listening test. Figures8共a兲and8共b兲show the results of the listening test for the white noise and car noise, respectively. The grades were

also processed by using the Multivariate Analysis of Vari-ance 共MANOVA兲 共Ref.23兲 to justify the statistical signifi-cance of the test results. The average—a 5%–95% bracket is shown in the figure—and the significance level of the grades were summarized in TableII. Cases with significance levels below 0.05 indicate that a statistically significant difference exists among methods. Although there is no significant dif-ference in OVL, the difdif-ference in SIG and BAK between the initial and optimal results is significant. The trade-off be-tween NR 共BAK兲 and signal distortion 共SIG兲 is clearly visible—the optimized algorithm has attained remarkable NR performance at some expense of speech quality. Thus, we choose the optimized MMSE-TRA-NR algorithm for the following objective and subjective comparison with several other NR algorithms.

C. Linear prediction coding preprocessor

Another possibility of enhancing NR algorithms is to use LPC as the preprocessor. The underlying idea is that the highly correlated portion of human speech can be extracted by using the LPC approach. The timbral quality of voice is preserved as the spectral envelope is captured using the LPC. Figure10共a兲illustrates the one-step forward linear prediction problem.17–19The current input x共n兲 is predicted by a linear combination of past input samples,

0 100 200 300 400 500 600 700 2 2.5 3 3.5 4 4.5 5 5.5

6 Objective function Q in car noise at SNR 5 dB

FIG. 7.共Color online兲 The learning curve of the SA optimization algorithm applied to the car noise.

time(sec) fr eq ue nc y (H z )

Processed speech signal via MMSE-TRA-NR in car noise condition

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 500 1000 1500 2000 2500 3000 3500 4000 -100 -80 -60 -40 -20 0 20 time(sec) fr eq ue nc y (H z )

Processed speech signal via MMSE-TRA-NR in car noise condition

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 500 1000 1500 2000 2500 3000 3500 4000 -120 -100 -80 -60 -40 -20 0 20 (b) (a)

FIG. 6.共Color online兲 The spectrograms of a male speech sentence in the car noise scenario. 共a兲 Speech corrupted by the car noise. 共b兲 Enhanced speech signal processed by the MMSE-TRA-NR algorithm.

TABLE I. The NR performance of the MMSE-TRA-NR algorithm in terms of the objective measures SNRseg and PESQ for the initial and the optimized parameters␤and␦共the optimal parameters are marked with *兲.

Noise type MMSE-TRA-NR parameters SNRseg PESQ Q ␤ ␦ White noise 1.6 1 −1.0942 1.9639 −0.1984 0.6117* 0.5214* 1.5155 2.1619 4.8106 Car noise 1.6 1 −1.5609 2.2168 −0.2998 0.7128* 0.5265* 0.7061 2.3145 3.9544

(7)

xˆ共n兲 =

k=1 p

Akx共n − k兲, 共16兲

where p is the prediction order and Ak are the prediction

coefficients. The associated prediction finite impulse re-sponse共FIR兲 filter is

P共z兲 =

k=1 p

Akz−k. 共17兲

By minimizing the mean squares of the one-step forward prediction error, Ep= E兵e2共n兲其=E兵关x共n兲−xˆ共n兲兴2其, the

follow-ing equation for the linear prediction problem can be de-rived:

k=0 p Akxx共l − k兲 =

Ep f , l = 0 0, l = 1,2, . . . , p,

共18兲 where Ep f

is the mean of the forward prediction error of order

p andxx共m兲 = E兵x*共n兲x共n + m兲其 = lim N→⬁ 1 2N + 1n=−N

N x*共n兲x共n + m兲 共19兲

is the autocorrelation sequence. The optimal LPC coeffi-cients of the prediction filter can be efficiently calculated by using the Levinson–Durbin algorithm. According to the LPC coefficients, The noisy input can be preprocessed by using the prediction filter P共z兲 in Eq.共17兲to extract the correlated input with minimal timbral distortion for the MMSE-TRA-NR module. Figure9共b兲illustrates a MMSE algorithm concatenated with the LPC as its preprocessor 共denoted as LPC-MMSE-TRA-NR兲.

IV. OBJECTIVE AND SUBJECTIVE EVALUATIONS OF NR ALGORITMS

Objective and subjective experiments were undertaken to compare the proposed optimized LPC-MMSE-TRA-NR algorithm with a number of other widely used NR algo-rithms.

A. Performance evaluation of NR algorithms by objective measures

The preceding objective measures SNRseg and the PESQ are employed to assess the performance of six NR algorithms 共spectral subtraction, Wiener filtering, MMSE-VAD-NR, MMSE-TRA-NR, LPC-MMSE-TRA-NR, and KLT-NR algorithms兲 for the speech signal corrupted by two kinds of background noise 共white noise and car noise兲. All test signals and conditions are similar to those used in the previous test.

According to Table III, the Wiener filtering algorithm tends to underestimate noise level and yield high residual noise 共or low SNRseg兲. The KLT-NR algorithm attains the highest SNRseg. In addition, LPC seems to slightly improve the speech quality over the MMSE-TRA-NR algorithm for

(b) (a)

FIG. 8. 共Color online兲 Comparison of the MMSE-TRA-NR algorithm with and without SA optimization. The results of the listening test are processed by using the MANOVA.共a兲 White noise. 共b兲 Car noise.

TABLE II. The MANOVA output of the subjective listening test to compare the MMSE-TRA-NR algorithm with and without optimization. The back-ground noises are the white noise and the car noise. Cases with significance value p below 0.05 indicate that statistically significant difference exists among all methods.

Noise type

Significance value

SIG BAK OVL

White noise 0.040 0.000 0.117 Car noise 0.017 0.000 0.784

)

(n

x

)

(

ˆ

n

x

)

(n

e

)

(z

P

(b) (a)

FIG. 9. The NR algorithm cascaded with a LPC preprocessor.共a兲 Feedfor-ward linear prediction structure.共b兲 The cascaded LPC-NR system.

(8)

the white noise case. As for the PESQ objective evaluation, there seems to be no significant difference in speech quality resulting from these NR algorithms.

B. Performance evaluation of NR algorithms by subjective measures

In order to further compare the preceding NR algo-rithms, subjective listening tests were conducted according to the ITU-T P.835.22 Thirty-two experienced listeners par-ticipated in the subjective tests. The six NR algorithms used in the objective test are compared again in this subjective test. The test signals and conditions remain the same as the preceding listening tests共TableIV兲. The mean and spread of the listening test results are shown in Figs.10共a兲and10共b兲. The test results were processed using MANOVA 共Ref. 23兲 with significance levels summarized in TableV. Cases with significance levels below 0.05 indicate that a statistically sig-nificant difference exists among methods. From TableV, the difference of the indices SIG, BAK, and OVL among the NR methods was found to be statistically significant 共except for OVL in the car noise scenario兲.

Next, a post hoc Tukey HSD test23 was employed to perform multiple paired comparisons of the NR algorithms.

Post hoc tests are generally performed after ANOVA, which

is able to determine whether or not significant difference is present in the data of a number of cases. Tukey’s HSD test is one of the commonly used post hoc tests for the assessment of differences in the means between pairs of populations fol-lowing the ANOVA test. TableVIsummarizes the results of

the test in terms of the subjective indices SIG, BAK, and OVL. To facilitate the comparison, the NR algorithms that have attained good subjective performance 共with no statisti-cal difference兲 are marked with asterisks in the table. In Figs. 10共a兲 and 10共b兲, surprisingly, in contrast to the results of objective evaluation, the KLT-NR algorithm performed quite poorly in SIG for all noise conditions. The price paid for high NR using the KLT-NR algorithm is obviously the signal distortion, which was noticed by many subjects. Despite the excellent performance in SIG, the Wiener filtering algorithm received the lowest scores in BAK for all noise conditions, TABLE III. Comparison of processing time and objective NR measures for

six NR algorithms.

Noise type SNRseg PESQ

NR algorithm

Noise type

White Car White Car

Spectral subtraction 2.115 1.450 2.224 2.118 Wiener filtering 0.878 0.073 2.162 2.322 MMSE-VAD-NR 2.215 1.224 2.250 2.394 MMSE-TRA-NR 1.515 0.7061 2.161 2.314 LPC-MMSE-TRA-NR 1.439 0.3110 2.234 2.162 KLT-NR 3.177 1.856 2.400 2.367

TABLE IV. The optimal parameters␤and␦obtained using the SA search for nine types of background noise共babble, station, car, airport, street, train, exhibition, restaurant, and white noise兲.

Background noise Optimal␤ Optimal␦

White noise 0.6117 0.5214 Babble 0.7178 0.8710 Station 0.6889 0.5350 Car 0.7128 0.5265 Airport 0.6259 0.5016 Street 0.5266 0.5016 Train 0.4609 0.5043 Exhibition 0.5440 0.5026 Restaurant 0.5103 0.5310 (b) (a)

FIG. 10. 共Color online兲 Comparison of six NR algorithms in time-domain waveforms.共a兲 The noisy and processed signals in the white noise condi-tion.共b兲 The noisy and processed signals in the car noise condition 共dotted line: noisy speech signals; solid line: processed speech signals兲.

TABLE V. The MANOVA output of the listening test of the six NR algo-rithms. Cases with significance value p below 0.05 indicate that statistically significant difference exists among all methods.

Noise type

Significance value p

SIG BAK OVL

White noise 0.008 0.000 0.008

(9)

which is consistent with the observation in the objective evaluation. The spectral-subtraction algorithm received the lowest grade in BAK for all noise conditions because of the “musical noise”11 problem, which is quite disturbing to the listeners. There is no significant difference in OVL among all NR algorithms for the car noise scenario. The spectral-subtraction and KLT-NR algorithms received lower scores in OVL than the other algorithms in the white noise case. It can be concluded that the MMSE-VAD-NR, MMSE-TRA-NR, and LPC-MMSE-TRA-NR algorithms are superior to the other algorithms.

Overall, these three algorithms performed equally well in terms of all subjective indices in the two noise scenarios. For background noise with rapidly varying levels, however, the MMSE-TRA-NR algorithm should be more practical than the MMSE-VAD-NR. The LPC preprocessor may con-tribute to enhancing the NR algorithms, albeit this observa-tion is not statistically significant.

C. Sensitivity analysis in the MMSE-TRA-NR algorithm

In this section, a sensitivity analysis is presented to dem-onstrate the effect of the choice of TRA parameters. The SA method is employed to search for the optimal parameters ␤ and␦ of the aforementioned MMSE-TRA-NR algorithm in dealing with nine types of background noise at the SNR level of 5 dB. These nine types of noise include babble, station, car, airport, street, train, exhibition, restaurant, and white noise, which were taken from the database of Ref.11. The results of the optimal parameters ␤ and ␦ summarized in TableIVare plotted in a scatter diagram in Fig.11 for each noise condition. It is worth noting that the NR performance of the MMSE-TRA-NR algorithm is very sensitive to the choice of the parameter␤. The optimal parameter␤ falls in the range of 0艋␤艋1 for all noise conditions, which is quite different from the values of 15艋␤艋30 recommended in Ref. 11. By contrast, the optimal parameter ␦ is relatively constant共⬇0.5兲 for all types of background noise except for “babble” 共␦= 0.871兲, which is also different from the value

␦= 1.5 recommended in Ref. 11. The recommended param-eter ␦ should be in the range of 0.5艋␦艋1.5 because ␦ de-cides the transition point of the sigmoid function in the

pre-vious TRA algorithm. The transition point can be considered as a threshold to discriminate speech presence from speech absence according to the a posteriori SNR. In the present study, a judicious but more reasonable range of 0.5艋␦

艋1.5 is recommended.

V. CONCLUSIONS

An optimized MMSE-TRA-NR algorithm has been pre-sented. The SA optimization technique is exploited to search for optimal TRA parameters, especially the parameter ␤, which has a profound impact on the estimation of the noise spectrum and hence the resulting NR performance of the algorithm. The optimal parameter ␤ generally falls in the range of 0艋␤艋1, whereas the optimal parameter␦stays at a relatively constant value of 0.5 for many types of back-ground noise. In addition, a LPC preprocessor has been pre-sented to enhance the MMSE-TRA-NR algorithm.

The proposed NR algorithms have been compared with several other widely used algorithms via extensive objective and subjective tests. These methods exhibit different degrees in trading off reduction performance and speech quality. It can be concluded that the MMSE-VAD-NR, MMSE-TRA-NR, and LPC-MMSE-TRA-NR algorithms are more superior TABLE VI. The post hoc Tukey HSD test of the subjective measures SIG, BAK, and OVL obtained using six

NR algorithms. The NR algorithms that have attained good subjective performance共with no statistical differ-ence兲 are marked with asterisks.

NR algorithms

SIG BAK OVL

Noise condition

White Car White Car White Car

Spectral subtraction * * * Wiener filtering * * * * MMSE-VAD-NR * * * * * * MMSE-TRA-NR * * * * * LPC-MMSE-TRA-NR * * * * * KLT-NR * * *

FIG. 11.共Color online兲 Sensitivity analysis of the optimal parameters␤and

(10)

to the other algorithms. Overall, these three algorithms per-formed equally well in terms of all subjective indices in the white and car noise scenarios. For background noise with rapidly varying levels, however, the MMSE-TRA-NR algo-rithm is more practical than the MMSE-VAD-NR.

ACKNOWLEDGMENTS

This work was supported by the National Science Coun-cil of Republic of China, under Project No. NSC 95-2221-E-009-179.

1Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process. 32, 1109–1121共1984兲.

2R. J. McAulay and M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process. 28, 137–145共1980兲.

3E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control: A Practical Approach共Wiley, New York, 2004兲.

4R. E. Crochiere, “A weighted overlap-add method of short-time Fourier analysis/synthesis,” IEEE Trans. Acoust., Speech, Signal Process. 281, 99–102共1980兲.

5M. R. Portnoff, “Implementation of the digital phase vocoder using the fast Fourier transform,” IEEE Trans. Acoust., Speech, Signal Process. 24, 243–248共1976兲.

6U. Zölzer, DAFX—Digital Audio Effects共Wiley, New York, 2002兲. 7S. L. Gay and J. Benesty, Acoustic Signal Processing for

Telecommunica-tion共Kluwer Academic, Norwell, MA, 2000兲.

8N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series with Engineering Applications共Wiley, New York, 1949兲. 9B. Farhang-Boroujeny, Adaptive Filters Theory and Application共Wiley,

New York, 2000兲.

10S. V. Vaseghi, Advanced Signal Processing and Digital Noise Reduction 共Wiley, New York, 1996兲.

11P. C. Loizou, Speech Enhancement Theory and Practice共CRC, New York, 2007兲.

12Y. Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored noise,” IEEE Trans. Acoust., Speech, Signal Process. 11, 334–341共2003兲.

13L. Lin, W. Holmes, and E. Ambikairajah, “Adaptive noise estimation al-gorithm for speech enhancement,” Electron. Lett. 39, 754–755共2003兲. 14N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E.

Teller, “Equations of state calculations by fast computing machines,” J. Chem. Phys. 21, 1087–1092共1953兲.

15Quantum Annealing and Related Optimization Methods, edited by A. Das and B. K. Chakrabarti共Springer, Heidelberg, 2005兲.

16J. De Vicente, J. Lanchares, and R. Hermida, “Placement by thermody-namic simulated annealing,” Phys. Lett. A 317, 415–423共2003兲. 17J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE 63, 561–

580共1975兲.

18J. D. Markel and A. H. Gray, Linear Prediction of Speech 共Springer-Verlag, Berlin, 1976兲.

19S. J. Orfanidis, Optimum Signal Processing: An Introduction 共McGraw-Hill, New York, 1996兲.

20ITU-T Rec. P.862, “Perceptual evaluation of speech quality共PESQ兲, and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” International Telecommunica-tions Union, Geneva, Switzerland, 2000.

21ITU-T Rec. P.835, “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” Inter-national Telecommunications Union, Geneva, Switzerland, 2003. 22R. Martin, “Noise power spectral density estimation based on optimal

smoothing and minimum statistics,” IEEE Trans. Acoust., Speech, Signal Process. 9, 504–512共2001兲.

23G. Keppel and S. Zedeck, Data Analysis for Research Designs共Freeman, New York, 1989兲.

數據

Figure 1 illustrates the general three-step structure of NR algorithms. 11 The noisy signal is forward transformed using unitary transformations, e.g., Fourier transform,  dis-crete cosine transform, and KLT transform
Figure 2 plots the smoothing factor ␣ for different val- val-ues of the parameter ␤ and ␦ = 1
FIG. 3. Comparison of the VAD and TRA algorithms. The noise estimated using the VAD and the TRA algorithms are superimposed in the left side of the top and the bottom panels
FIG. 5. 共Color online兲 The spectrograms of a male speech sentence in the white noise scenario
+5

參考文獻

相關文件

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Tseng, Growth behavior of a class of merit functions for the nonlinear comple- mentarity problem, Journal of Optimization Theory and Applications, vol. Fukushima, A new

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =>

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix

incapable to extract any quantities from QCD, nor to tackle the most interesting physics, namely, the spontaneously chiral symmetry breaking and the color confinement.. 

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

Miroslav Fiedler, Praha, Algebraic connectivity of graphs, Czechoslovak Mathematical Journal 23 (98) 1973,

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the