A robust speech enhancement system for vehicular applications using H(infinity) adaptive filtering

(1)

2006IEEEInternational Conferenceon Systems, Man,andCybernetics

October 8-11, 2006, Taipei, Taiwan

A

Robust

Speech Enhancement

System

for Vehicular

Applications

Using

H.

Adaptive Filtering

Chieh-Cheng Cheng, Wei-Han Liu, Chia-Hsing Yang, and Jwu-Sheng Hu, Member, IEEE

Abstract-This work proposes a novel and robust adaptive

speech enhancement system, which contains both time-domain and frequency-domain beamformers using H,O filtering

approach in vehicle environments. A corresponding

microphone array data acquisition hardware is also designed

and implemented. Traditionally, mutually matched microphones areneeded, but this requirementis notpractical.

To conquerthis issue, theproposedsystemadapts the mismatch

dynamics to allow unmatched microphones to be used in an

array.Furthermore,toachieveasatisfactory speechrecognition performance, the speech recognizer is usually required to be retrained for different vehicle environments due to different noise characteristics and channel effects. The channel effect

usuallycausesthemodelingerrorinachannel recoveryprocess because of the long channel response. The proposed system

using the H,, filtering approach, which makesno assumptions

aboutnoise and disturbance, is robust to the modeling error.

Consequently, the proposed frequency-domain beamformer

providesasatisfactory performancewithout the needtoretrain thespeechrecognizer.

I. INTRODUCTION

T HE use of mobile phones and electronic systems in vehicles is becoming increasing. Considering driving safety and convince, mobile phones and many in-car electronicsystems suchas global positioning system(GPS), CD, air conditioner, etc. should not be accessed by hands whiledriving. Consequently, intelligenthands-free interfaces with speech recognition were proposed in recent years. However,the echo of the far-end speechandenvironmental noises degrade the recognition performance and result in a low acceptance of hands free to consumers. Therefore, methods such as single-channel [1]-[2] and multi-channel speechenhancementtechniques [3]-[8]have beenintroduced. Although single-channel based methods can reduce the hardware complexity, the performance degrades due to variousproblems[3].

Jwu-ShengHuiswithDepartmentof ElectricalandControl Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, ROC. (e-mail:

[email protected]).

Chieh-Cheng Cheng is with Department of Electrical and Control Engineering, NationalChiao TungUniversity, Hsinchu 300, Taiwan, ROC. (e-mail:[email protected])..

Wei-Han Liuis with Departmentof ElectricalandControl Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, ROC. (e-mail:

[email protected]).

Chia-Hsing Yang is with Department of Electrical and Control Engineering,NationalChiaoTungUniversity, Hsinchu 300, Taiwan, ROC. (e-mail: chyang.ece92ggnctu.edu.tw)

To overcome the limitation, the microphone array based noisesuppressionapproaches,such as Frostbeamformer[4], robust adaptive beamformer [5], and generalized sidelobe canceller (GSC) [6] are proposed. However, these methods still suffer from several non-ideal factors. For example, the microphones are required to be mutually matched and no coherentinterferencesignalexists. Dahlet.al.[7]proposeda finite impulse response (FIR) based normalized least-mean-square (NLMS) filtering approach to perform indirect microphone calibration and to minimize the sound signal distortiondue to channeleffectby usingapre-recorded speech signal and a desired signal acquired when the environment was quiet. Because the variation between pre-recorded signals and the desired signal contain useful informationabout thedynamicsof channel andmicrophones' characteristics, this methodoutperforms other un-calibrated algorithms in real applications [8]. However, theFIR filter using the finite number of taps is unlikely to completely characterize the channeldynamics [9].Moreover,the NLMS based formulation assumes the disturbance is uncorrelated to the source, zero mean and Gaussian distributed. Thesewill limit theperformance of speechenhancement.

Onthe contrary,the proposedHo, filtering approaches are robusttothemodelingerrorcaused by finitetaplength of FIR filters and have no assumption made regarding the characteristics of environmental noises [10]-[11]. Furthermore, the method of using the pre-recorded signals and thedesiredsignalcansuppressthegainfrom noisestothe output and the characteristics of received multi-channel signals canbe automatically adjustedtothose of thedesired signal. Therefore, extra training processes for speech recognizerinvehicles are notneededin this work. Forspeech communicationviahands-free mobile phones,atime-domain beamformer using

H,.

isproposed to produce a more clean and undisturbed speech waveform. Secondly, for speech recognition applications, a frequency-domain beamformer using H,x, is proposed to reduce the effect of uncertainty in signal transformation between the time-domain and the frequency-domain bytreating several frames as a single block. Theproposed approaches using two microphones outperform dual channel delay-and-sum beamformer with a high-pass filterintroduced in [3].Different choices of the number of the microphone are also compared

The remainder of this paper is organized as follows. The proposed speech enhancement system and the microphone

(2)

Fig. 1.Overall system architecture Amplifiers andFilters

PowerControl AnalogSwitch Fig.2.Microphonearra' array data acquisition hardware designed are introduced in section 2. Section 3 presents the two proposed HI, filtering based beamformers in both time and frequency-domain. Several representative experiments in a real vehicle are shown and the performances of experimental results are discussed in section 4. Finally, the conclusion is made in the last section.

II. SPEECHENHANCEMENTSYSTEMANDMICROPHONE ARRAY DATAACQUISITION HARDWARE IMPLEMENTATION

Theoverall system architecture can be illustrated as Fig. 1 andcanbedividedinto twosub-systems.Thefirstsub-system consists of a microphone array whose geometry can be flexiblyarrangedandadataacquisitionelectronics prototype designedbythis work. The main feature of thisdesignis that the system can digitalize the received sound signals and transmit them in real-time via USB interface. The second sub-system represents thespeechenhancementsystem.

A. Microphone Array andMicrophone ArrayData AcquisitionBoard

The microphone array consists of M omni-directional condenser microphones and a headset microphone. The frequency response ofthemicrophoneisrangedfrom50Hzto 16kHz. The microphone array acquisition board made ofa four-layer boardcanbedividedinto threeparts.

Inthe firstpart,the microphonesignals areamplifiedand

Sample and.Hold

The Socket for Connecting

AIDConverter withUSBInterfacePlatform

ydataacquisition board

filtered by six amplifiers and six band-passfiltersdesigned to take the microphone sensitivity and anti-aliasing into consideration. The gain of the M microphones and the headset microphone are set to 60dB and20dB individually. The second partcomprises sixsample-and-hold circuits (S/H), one analog switch circuit, and one analog-to-digital (A/D) converter. The third part contains the control and data transmission lines which are controlled by the universal serial bus (USB) interface. Through the control line, the USB interface platform can control the timing of the sample-and-hold circuits, switch, and A/D converter. The switching frequency and the timing of the system can be selectedflexibly,and thesampling frequency in this work is setto 8kHz.The converted 16-bitdigital data are transmitted inreal-time through the USB interface.

Mie.rnnhnni- Arrnv H4enckce-t Min.ronhone

(3)

The pictureofthemicrophonearraydataacquisition board is shown in Fig. 2. Fig. 3 shows the installation of the array inside the vehicle. Note that the headsetmicrophone is used only for collecting the desiredsignal, i.e., the userdoesnot need the headset microphoneduring the onlineapplications.

B. SpeechEnhancement System

This system can be separated intotwo stages, silent stage andspeech stage, by a voice activity detector(VAD) which can distinguish whether the received signals contain speech signals or not. The voice activity detectionalgorithm canbe found in reference [12]. If the result of VAD isequaltozero, which meansthatnospeechexists,the systemwillberunin thesilent stage. When the result of VAD isequaltoone,the systemcouldbeswitchedtothespeechstage.

Thepre-recorded speech signals showninthesilent stage inFig. 1 are collected whentheenvironmentis quietand the speaker is atthe desired location. The pre-recorded speech signals contain both the characteristics ofmicrophones and the acoustical characteristics of the desired location. The desiredsignal, d(n),is derived fromaheadsetmicrophoneat the same time when the pre-recoded speech signals are collected. Since the headsetisclosetothemouth,the desired signals contain little channel distortion. The desired signal onlyneedstobe collected when thedesiredlocationvaries,so the headset microphone is not needed during the online applications. In the silent stage, the environmental noise signals without speech signals are recorded online. The environmentalnoisesignalsareassumed to beadditive,sothe signals received when a speaker is talking in a noisy environmentcanbeexpressedas alinearcombination of the speech signals and the environmental noises. Therefore, in this stage, the system combines the online recorded environmental noise signals,

n,

(n),* nM (n) , with the pre-recorded speech signals,

s1(n),..s,

5M(n),

to construct training

signals,

x (n),

,x,

(n) The training signals are used to adapt theweighting vectorusing

Ho.

basedadaptive filtering approach. Inthe speech stage, thetrainedweighting vector is passed to the lower beamformer to purify and recoverthe noisyreceivedsignals,

y,

(n),

'YM

(n).

III. PROPOSED SPEECH ENHANCEMENTAPPROACHES A. Time-Domain Beamformer Using

H,.

Filtering Approach

Based on the system architecture shown in Fig. 1, the formulation ofmicrophonearrayspeech enhancement system canbeexpressedasthe followinglinearmodel:

d(n) =XT (n)w+e(n) (1)

In this work, italics fonts represent scalars, bold italics fonts represent vectors, and bold upright fonts represent matrices. M denotesthenumberofmicrophones, P denotes

the filter order ofeach microphone, and the superscript T denotes the transpose operation. d(n) is the desired signal and x(n)=[xl(n) ... XM(n)]T is a MP x 1 training signal

vector.

xi

(n)=

[xi

(n) ...

xi

(n-P +1)] is a Px1 training

signal vector. In addition, w is the MPx1 unknown filter coefficient vector ofthe time-domain beamformer that we intent to estimate. e(n) is the unknown estimation disturbance,which may also includemodelingerror.

To apply the adaptive Ho. filtering algorithms, the linear model, asin(1),istransformed into its state space form:

w(n+1)= w(n)

d(n)=XT(n)w(n)+e(n) with w(n)=w

The criterioninthesenseofHo. is:

min max J=-I

,2,u-I

ww)2

w(n) (e(n),w(0))2

I N .

2+E-I

w-v(n)

le(n)2

2n=O

(3)

where

,u

is a weighting parameter and wi'(n) is the MPx1

estimated filter coefficient vector.

H'2

denotes the square of

the 2-norm. According to [13], the solution of uiv(n) can be

approximated bythe iteration: M-l(n+1)=M '(n)+x(n)xT (n) 21

-XT (n))

wi(n+1)= i(n)+M(n)x(n)

(d(n)

(n)iv

(1+XT(n)M(n)x(n))

(4)

(5)

(6) w'(O) 0, M-1(0) =

(,uO-

_

,2)L

where M(n) is an MPxMP matrix and ()-l denotes the

matrix inverse operation. In order to ensure that M(n)

remains positive definite, ; should be chosen such that

M-'

(n)+x(n)xT (n)-;-21>0.For this reason, 4is selected

as gjeig(M -l(n)+x(n)XT (n)) during the iteration, where

eig(z) denotes the maximum eigenvalue ofz and 5>1 in

ordertokeep , greaterthanthe minimumvalue.

Theadaptation ofthefiltercoefficientvectorisperformed

in the silent stage. When the system is switched to speech

stage, theadaptationstops and the filtercoefficientvectoris passed to lower beamformer. The output of the speech purificationsystemcanbecalculatedby

5(n)= yT(n)wv(n)

(7)

where P(n) is the purified result, and y(n) is the (2)

(4)

MPx1online recordedpolluted speech signalvectoracquired by the microphone array.

B. Frequency-Domain Beamforming Using

H,

Filtering Approach

The unknown estimation disturbance at frame k and frequency co can be written as:

E(w, k)

-

D(co,

k)-

WH(co,

k)X(co, k)

(8)

with W

(o),k)-=W

(o))

whereD(co, k)is the desired signal inthe frequency-domain and W(co) denotes the Mx1 unknown weighting vector at frequency co . The superscript H denotes Hermitian

operation. X(w,k) , N(co,k) and S(co,k) represent the frequency-domaintraining signal vector, the online recorded environmental noise vector, and the pre-recorded speech signalvector,respectively.

Ingeneral, the windowsizeinthe STFT hastoequaltothat in ASR in ordertoobtain a more accurateresult. However, the window size may be too small to capture the acoustic channel response. For this reason, a previous work [8] proposed an approach calledsoft penalty frequencydomain block beamformer (SPFDBB). However, the NLMS algorithmused inthat work[8] limits itsperformance dueto its inherent assumptions on the disturbances and channel dynamics. Consequently, the H. based

filtering approach

is adoptedtoimprovetheperformancefurther. The

H,,

iterative solutionscanbe shownas:

W(f,k

+1) =W(co,k)+K(co,k)LDL (,k)

-H(o,k)W(co,k)] (9)

K(co,k)=

P(o,k)H

(co,k)(I+

H(o,k)P(co,k)H"(w,k))

1(10) P-

(co,k+1)=P-Pl(c,k)+HH(w,k)H(co,k)-

-2IM

(I

1)

k k+L-1 H

lH(co,k)= X(c,k) X(,k+L -1) p XS(Co, j)

Lj=k k+L-1I

DL(a,k)= D(co, k) D(c,k+L-1) pAZ2ED(woj)

L j=k

P

(C)l)

=,OIM

and

W(co,0)

=[o

0 ...

0]T

(12) (13) (14) where W(co)denotes an unknown weighting vectors, the

superscript * denotes the complex

conjugate,

and

H(wo,k)

is a (L+1)x M dimensional matrix at kth block.

IM

is an identity matrix with dimension M xM. The value of 4

during the iteration is chosen as

ig(P-T'(co,k)

+HH(w,k)H(co,k)) to keep 4 greater than the minimum value.

Consequently, the purified output signal at k th block can beobtained by the following equation:

A(m,k)=WH(co,k)X(co,k) ₍₁₇₎

where

5(c,k)

and

kQo,

k) is the M x L online recorded polluted speech signal matrix. The step k is chosen as 0,L, 2L, 3L, to perform the adaptation process every L frames.

IV. EXPERIMENTALRESULTS A. Experimental Conditions and Parameters

The experiment was performed on passenger seat of a mini-van vehicle instead of thedriver's seat due to the driving safety consideration. A uniform linearmicrophone array of five un-calibrated microphones with 0.07 m spacing is mounted in front of the passenger seat. In addition, the distance between themicrophonearrayand the mouth of the speakerwhosits in passengerseatisabout 0.62m. Toshow theperformanceof theproposedapproaches, 341pairsof the vehicleidentification numbers and ten conditions(C1 -C10 of TableI)wereused. Theaverage SNR's in the tenconditions are shown in Table I and a music piece containing vocal sound isplayed repeatedly bysixbuild-inloudspeakers when the in-car audio system is tumed on. The desired signal utilized in this experiment is derived from the headset microphone which contains lowest channel distortion. The first and second microphones are utilized for dual microphone case (M=2) and the first, second, and third microphones are used when M =3 and so on. For

comparison purpose, thedelay-and-sumbeamformer with a high-pass filter(DS+HP) introduced in[3]isimplemented.

B. Time-DomainPerformanceEvaluation

Instead of using signal to noise ratio (SNR), two performance indices, signal recover ratio (SRR) and noise

TABLE I

TEN EXPERIMENTALCONDITIONSANDISOLATEDAVERAGE SNR Powerof In-car AudioSystem Off Off Off Off Off Average SNR(dB) 4.20 2.84 2.72 -1.90 -3.04 Condition Number C6 C7 C8 C9 CIO Condition Number C1 C2 C3 C4 CS Speed 20 Km/hr 40 Km/hr 60 Km/hr 80 Km/hr 100 Km/hr Speed 20 Km/hr 40 Km/hr 60 Km/hr 80 Km/hr 100 Km/hr Powerof In-car AudioSystem On On On On On Average SNR(dB) -0.08 -2.19 -2.28 -4.75 -5.40

=

(5)

power ratio (NPR), are defined to evaluate the degree of signal distortion and noise suppression. This is because a higher SNR does not imply that the signal distortion is low. SRRis defined as:

SRR(n) cov((c(n), (w(n)T s(n))))

jcov(d(n), d(n))x

co4v(w(n)

Ts(n)), (w(n)T s(n))) (18)

where cov(.) is the covariance operation. Further, NPR is definedas: NPR(n)= j

[w(n)T

n(n)4

/,

n7(n) n=l n=l 0. 0. 0.7 0.6 c 0.5 0.4 0.3-0.2 0.1 (19) where V in(19)denotes thelengthofthedesiredsignal.SRR is defined as thecorrelation coefficient betweenthe desired signal, and therecovered signal (w(n)T s(n)). Consequently, ahighervalue of SRRmeans abetter speechrecovery. NPR represents the ratio of the noise power after beamformer processing (w(n)T n(n)) to thenoise powermeasured at the silent stage. The smaller value of NPR represents a more clean speech signal. The order of the time-domain beamformerusing Ho, filtering approachwas setto 128, and Po and ; were set to0.9 and 0.95

individually.

The values of SRR and NPR after the DS+HP and time-domain beamformer using

HR,

filtering technique for the ten testing conditionsareillustrated in Fig. 5. As shown inFig. 5(a) and Fig. 5 (b),the SRR's of the proposed approach is higher than those of the DS+HP when two microphones are utilized includingthe cases when the in-caraudiosystemisturnedon (conditionsC6toC10).This is because theproposedsystem can recover the channel distortion and is robusttomodeling error.Moreover, thehigh-passfilter in DS+HPsuppresses the magnitudeoflowfrequencycomponentsof the speechsignal, which may decreasethe SRR further. The values of NPR of theproposed method alsooutperformthe traditional DS+HP inconditions

Cl

to

C(10.

The values ofNPR in C6 toC(10are larger than those in

Cl

to C5 becauseturning on the in-car audio system raises the complexity of the noise. The improvement of SRR and NPR are consistent with the number ofmicrophone used. It means a larger number of microphones couldprovide a better sound quality.

0.7 0.4 0.4 0.3 0.2 0.1 -E- DS+HP --M=2 M=3 Ig M=4 -E&-M=5 ---e- DS+HP -6 M=2 -l(-M=3 -A4 M=4 -e M=5 6 C7 C8 C9 Conditions (b) SRR's of conditions C6toCIO. CIo 0.8r -e-DS+HP -O- M=2 - M=3 -A-M=4 M-5 0.21

0°

'

0 0.7 0.6 a-0.4 0.1 ol 1 C2 C3 C4 Conditions (c)NPR'sof conditions C1 to C5 C5 -M=2 M=3 -*-_-A-_MM-3₄

---=----

-

---

--E-tM- -C6 C7 C8 C9 Conditions (d)NPR'sof conditions C6toC 10 Fig. 5. SRR's and NPR's of conditions C1 toC1O

C1o

C. Frequency-DomainPerformanceEvaluation

The resultsof thefrequency-domain beamformer using

H,.

ilitenng approact

are

uirectiy

ueilverec to

a

bencnmark

speechrecognizer,HTK [14]. Inthe experiments,

pao

and , weresetto0.9 and 0.95individually and the soft penalty A is setto 2. In addition, the frame number L is set to 40. The window contains 256 zero padded samples and a 32ms speech signal whichgives atotal of 512 samples. The best possible recognitionrateusing the desired signal is 97.15%. Abaseline of therecognition rate using the first microphone only is established. As shown in Fig. 6 and 7, the baseline

performance

is pooras expected. When only the car noises C4 C5 exist (conditions

Cl-C5),

the DS+HP can improve the recognitionrate in 15.52% to25.25% range compared with C5. the baseline. Because the DS+HP only attempts to suppress

ol_{C 1} _C2 _C3 Conditions (a)SRR'sofconditionsC I to . . * 0.6[ a.0.4 0. ---r.) ---0.8

(6)

the noise signals instead of dealing distortion, the performance cannotbe sat recognizer is re-trained. As indicate improvement using the proposed met becomes more significant when the env higher. The improvements are more si, music isturnedon(Fig. 7). The recogniti drops because it can only suppress a Sr

wideband music signal. Comparing Fig proposed method maintains a si performanceat avehiclespeedwithoutar background. 91 ct .._80 xa70 B0 a) (o"50 40[ C1 90 ae- 80-a) tr o70 : a 60 40 a) v)50 40 I--*- Baseline -% DS+HP -+- M=2 -_- M=3 -0- M=4 =5 C2 C3 Conditions

Fig.6Speechrecognitionrateof Con

,' ----

-with the channel

.. Pr , 1 ,1 REFERENCES

istactoryunless the [1] s. F. Boll, "Suppression of acoustic noise inspeech using spectral d by Fig. 6, the subtraction," IEEE Trans. Acoust. Speech, Signal Processing, vol.

[hod over DS+HP _{[2] A. Kawamura, Y.}ASSP-27,pp. 113-120,Apr._liguni,_{and Y. Itoh,}1979. _{"A noise reduction method based} ironmental noise iS on linear prediction with variable step-size," IEICE Tran.

gnificant when the Fundamentals,vol.E88-A,no.4, pp.855-86lApril2005.

ionrate for DS+HP [3] s. Ahn and H. Ko, "Background noise reduction via dial-channel

schemefor speechrecognitionin vehicularenvironment,"IEEETrans.

nall amount of the ConsumerElectronics,vol. 51, no. 1, pp. 22-27,Feb. 2005.

6 and Fig. 7, the [4] 0. L Frost, "An Algorithmfor Linear Constrained Adaptive Array

milar recognition Processing," Proc.IEEE, vol. 60,no.8, pp.926-935,Aug.1972.

with

musci [5] H.

Cox,

R. M.

Zeskind,

and M. M. Owen., "Robust Adaptive

Beamforming,"IEEETrans. Acoust.SpeechandsignalProcess.,vol. ASSP-35, pp.1365-1376,Oct.1987.

[6] L. J. Griffiths and C. W. Jim, "An alternative approachto linearly constrained adaptive beamforming," IEEE Trans. Antennas Propagation,vol.AP-30, pp. 27-34, Jan. 1982.

[7] M. Dahl, andI. Claesson "Acousticnoise and echo canceling with

microphone array," IEEE Trans. Vehicular Technology, vol. 48,

- ... o pp.1518-1526,Sept. 1999.

[8] J. S. HuandChieh-Cheng Cheng, "Frequencydomain microphone arraycalibration andbeamforming forautomatic speech recognition," IEICE Trans.Fundamentals, vol. E88-A, no. 9, pp. 2401-2411, Sep. 2005.

[9] H.Kuttruf,Room acoustics.London: Elsevier, 1991, chapter 3, pp. 56. [10] W. Zhuang, "Adaptive H infinity channel equation for wireless personalcommunications,"IEEETrans. VehicularTechnology, vol.48, no. 1, pp.126-136,January1999.

04 05 [11] B.Hassibi,andT.Kailath,"H:,adaptive filtering," IEEEinternational ConferenceonAcoustics,Speech,andSignalProcessing,vol.2, pp.

iditionsI to 5 949-952, May 1995.

[12] J.Ramirez,J.C.Segura, C.Benitez,d.l.Torre,Angel;et.al."Efficient voice activity detection algorithms using long-term speech information," Speech Communication, vol. 42, pp. 271-287, April 2004.

______________ [13] X. Shen and L. Deng, "A dynamic system approach to speech enhancement using the H- filtering algorithm,"IEEE Trans. Speech Baseline andAudioProcess.,vol. 7, pp. 391-399, July 1999.

M-3 [14] Hidden Markov Model Toolkit(http://htk.eng.cam.ac.uk/) -0- M=4

M=5

---0

---CE6 C7 C8 C9 C10

Conditions

Fig.7SpeechrecognitionrateofConditions6to10 V. CONCLUSION

A time-domain and a

frequency-domain adaptive

beamformerusingH,,

filtering approaches

are

proposed.

The

methods canbe applied as a hands-free

speech acquisition

interface for communication or

speech

recognition

in a

vehicle. The performance indexes

(SRR,

NPR,

and

speech

recognition rate) ofdifferent numbers of

microphone

are

introduced andcompared to provide

design

tradeoffamong the number ofmicrophone

used, performance

and circuit complexity. The

experimental

results show that the

proposed

system could improve the communication

quality

and the speech recognition rate

significantly

without the time consuming

re-training

process for the