Processing of speech signals using a microphone array for intelligent robots

(1)

http://pii.sagepub.com/

Control Engineering

Engineers, Part I: Journal of Systems and

http://pii.sagepub.com/content/219/2/133

The online version of this article can be found at:

DOI: 10.1243/095965105X9461

2005 219: 133

Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering

J Hu, C C Cheng and W H Liu

Processing of speech signals using a microphone array for intelligent robots

Published by:

http://www.sagepublications.com

On behalf of:

Institution of Mechanical Engineers

can be found at: Engineering

Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Additional services and information for

http://pii.sagepub.com/cgi/alerts Email Alerts: http://pii.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://pii.sagepub.com/content/219/2/133.refs.html Citations:

What is This?

- Mar 1, 2005

Version of Record

>>

(2)

Processing of speech signals using a

microphone array for intelligent robots

J Hu*, C C Cheng, and W H Liu

Department of Electrical and Control Engineering, National Chiao Tung University, Taiwan, Republic of China

The manuscript was received on 22 January 2004 and was accepted after revision for publication on 11 November 2004.

DOI: 10.1243/095965105X9461

Abstract: For intelligent robots to interact with people, an eﬃcient human–robot communication

interface is very important (e.g. voice command). However, recognizing voice command or speech represents only part of speech communication. The physics of speech signals includes other information, such as speaker direction. Secondly, a basic element of processing the speech signal is recognition at the acoustic level. However, the performance of recognition depends greatly on the reception. In a noisy environment, the success rate can be very poor. As a result, prior to speech recognition, it is important to process the speech signals to extract the needed content while rejecting others (such as background noise). This paper presents a speech puriﬁcation system for robots to improve the signal-to-noise ratio of reception and an algorithm with a multidirection calibration beamformer.

Keywords: beamforming, beamformer, DOA, microphone array, robot hearing, speech enhancement

1 INTRODUCTION people say or which command is given. Although speech recognition can have high accuracy in a quiet environment, undesirable signal components due to With the advent of computing power of

micro-the ambient noise and channel distortion render micro-the processors and digital signal processors, the

possi-recognizer unusable for real-world applications. An bility of constructing an intelligent robot to perform

adaptive microphone array system is thus designed complex tasks is not such a far-reaching goal. Among

to purify the polluted signal and to improve the various features oﬀered by an intelligent robot, the

recognition rate. communication interface is still an on-going research

Using adaptive microphone array algorithms for topic. It is generally believed that the interface should

enhancing speech reception in a noisy environ-not be restricted to keyboard, mouse, or remote

con-ment has been developed for many years. Earlier troller, but also to the nature language instead. For

approaches, such as the Frost beamformer [3], GSC [4], these reasons, robot hearing research has received

and the robust adaptive beamformer [5], are only much attention over the years. Chun and Caudell [1]

good in the ideal case. The ideal case here means tried to use the inferior colliculus structure and the

that the microphones are mutually matched and head related transfer function (HRTF) information

the environment is a free space. To cope with these combined with the image processing technique to

limitations, Hoshuyama et al. [6] proposed two ﬁnd general rules of human hearing. Schauer and

robust constraints on the blocking matrix design. Gross [2] use interaural time diﬀerence (ITD) and

Weinstein [7] proposed a new channel estimation interaural intensity diﬀerence (IID) signals to

per-method for standard GSC architecture in the fre-form a 360° direction of arrival (DOA) estimation.

quency domain. However, its estimation accuracy Speech recognition will inevitably be incorporated

would be decreased by a louder noise and circuit into an intelligent robot to make it understand what

noise. Dahl and Claesson [8] proposed an adaptive algorithm which calibrates both the microphone

* Corresponding author: Department of Electrical and Control mismatch and channel eﬀect using a priori infor-Engineering, National Chiao Tung University, Hsinchu, Taiwan, mation. This a priori information is a set of speech data recorded by the same microphone array in a

(3)

quiet environment. It then serves as a reference adaptation of the upper beamformer. By incorporating DOA knowledge the beam-steer filter is used to steer signal to update the coefficients of the filters when

the direction of the beam for acquiring clean speech the speaker is silent (or non-speech segments) and

of a speaker. Because the target is a speech signal, a the environment is noisy. With this a priori

infor-broadband beam-steer ﬁlter is needed. The third part mation, the calibration problem would be solved

is to apply the beamformer computation to increase implicitly. Dahl’s algorithm is suitable in the car

the signal-to-noise ratio (SNR). environment where the speaker’s position is ﬁxed

The paper is organized as follows. The customized (e.g. the driver). To apply the algorithm for mobile

wide-band eigenstructure-based DOA estimation robots, it is necessary to record reference signals

algorithm will be described in section 2. Section 3 from all directions since the speaker’s position might

discusses the modiﬁed beamformer, speech activity not be ﬁxed. In this paper, a beamforming

archi-detection, and the beam-steer ﬁlter. Section 4 pro-tecture modiﬁed from the method proposed by Dahl

vides experimental results of the DOA and and Claesson [8] is constructed by using a

beam-former obtained with the speaker in several diﬀerent steer ﬁlter with only one set of pre-recorded speech

directions. Finally, a conclusion will be given in source. As a result, the memory requirement and the

section 5. eﬀort of pre-recording are reduced tremendously.

This modiﬁed architecture could be more suitable for a robot hearing application.

The direction of the speaker must be known before

2 DIRECTION OF ARRIVAL (DOA) ESTIMATION

the beam is formed in the speaker direction. In a noisy environment, the conventional delay estimation

The idea of a blind DOA estimation algorithm called method in the time domain [9] or in the frequency

MUSIC [14] is adopted in this platform to detect the domain [10–13] is not able to obtain satisfactory

speaker’s direction. The received signal contains d results. In order to make a sound source direction _{sources and can be presented as}

available, a customized wide-band eigenstructure-based DOA estimation algorithm is proposed in

x m(t)= ∑ d k=1 a mksk(t−tmk)+nm(t) (1) this system. This method is based on a blind DOA

estimation algorithm called MUSIC (multiple signals

classiﬁcation) [14], with modiﬁcations to decrease Generally, sources here may include speech source the computing time and increase the accuracy of the and interference signals from the acoustic environ-DOA estimation. ment. Noise n

m(t) is referred to non-directional

The overall system is shown in Fig. 1. The ﬁrst interference signals such as electronic noise (called part consists of a speech activity detection to decide non-directional noise in the following context). In when the adaptive beamformer should be switched order to express the delay relations into the phase shift, the received signal is transformed into the on or oﬀ. The second part is a DOA estimation and

(4)

frequency domain over a ﬁnite observation interval T and the rank of C_xx(v_l) is d. Then the following equations can be derived

X m(vl)= 1 T

P

T/2 −T/2 x m(t) e−jvltdt RangeSpace (C xx(vl))=span {A1(vl), … ,Ad(vl)} =span {E₁(v_l), … ,E d(vl)} v_l=2p

Tl, for l=1, … , L RangeSpace (A(v

l)))=span {Ed+1(vl), … ,EM(vl)} Combining the equations above, the signal subspace (2)

can be deﬁned as wherev

1andvLare the lowest and highest frequencies _{span {}_E

1(vl), … ,Ed(vl)} is the source subspace included in bandwidth B.

The original model can be described as _{span {}_E

d+1(vl), … ,EM(vl)}

is the non-directional noise subspace

X

m(vl)= ∑d k=1

a

mkSk(vl) e−jvltmk+Nm(vl) (3) _{Because the source subspace is orthogonal to the} non-directional noise subspace

Rewrite equation (3) in matrix form as

EH_j (v l)Ai(vl)=0, i=1, … , d; j=d+1, … , M X(v_l)=A(v l)S(vl)+N(vl) (4) (8) where

By equation (8), a non-directional noise projection

XT(v_l)=[X 1(vl), … , XM(vl)] matrix P N(vl) can be established as NT(v_l)=[N₁(v_l), … , N M(vl)] _P N(vl)= ∑ M i=d+1 E_i(v_l)EH i (vl) (9) ST(v_l)=[S₁(v_l), … , S d(vl)]

The number of sources d can be determined by the distribution of eigenvalues. The DOA can be detected by projecting the direction vector on to the

A(v_l)=

C

a 11e−jvlt11 … a1d e−jvlt1d e e a M1e−jvltM1 … aMd e−jvltMd

D

non-directional noise projection matrix when

P_N(v_l)A

i(vl)=0 (10) Note that each column presents the delay relations

Usually, the maximum d values are regarded as the caused by diﬀerent sources between microphones,

dsource directions the ith column vector of A(v_l) being denoted by

A_i(v_l) and referred to as the direction vector.

Suppose noises are mutually independent. If the 1 (1/L) WL

l=1dEHj(vl)Ai(vl)d22 noise correlation matrix is the diagonal matrix

s2(v_l)I, the received signal correlation matrix can be

= 1 (1/L) WL l=1AHi (vl)PN(vl)Ai(vl) (11) described as R_xx(v_l)=A(v

l)Rss(vl)AH(vl)+s2(vl)I (5) The computing requirement of equation (11) can be reduced by considering only signiﬁcant fre-where

quencies of concern. The selection criterion is based

R_ss(v

l)=E[S(vl)SH(vl)] on the assumption that non-directional noises are mutually independent. Therefore, the non-diagonal and the eigenvalue decomposition

components of correlation matrix exclude non-directional noise terms. It means the following terms

R_xx(v l)= ∑

M i=1

[l

i(vl)−s2n(vl)]Ei(vl)EHi (vl) (6) in the correlation matrix (5) should be small with eigenvaluesl1(vl)l2(vl),lM(vl). From _R

xixj(vl)= ∑ d p=1 ∑d o=1 a ipajoR_spso(vl), Yi≠j (12) equations (4) and (5), the source part correlation

matrix is _{Then the Q signiﬁcant frequencies}

vˆ1,…, vˆQ can be selected as C_xx(v l)=A(vl)Rss(vl)AH(vl) vˆ_q=

T

∑M i=1 ∑M j=i+1 |R xixj(vl)|

U

q (13) = ∑d i=1 [l i(vl)−s2n(vl)]Ei(vl)EHi (vl) (7)

(5)

As a result, the d source directions can be estimated algorithm is

by searching maximum d values of _{w[k+1]=w[k]+m(y[k]−y}

b[k])(rˆ[k]+f[k]) wT[k]=[w₁₁[k], … , w 1F[k−F−1] J(h i)= 1 (1/Q) WQ q=1AHi (vˆq)PN(vˆq)Ai(vˆq) (14) w 21[k], … , wMF[k−F−1]] Searching the spectrum for d peaks to determine the _fT[k]=[f

1[k], … ,fM[k]] direction of arrival still requires plenty of process

rˆT[k]=[rˆ₁[k], … , rˆ M[k]] time when the accuracy requirement is high. This is

(15) the drawback of this method, which requires further

improvements. Although there is the root-ﬁnding

3.2 Speech activity detection

MUSIC [15] algorithm to calculate the DOA without

searching the spectrum, a uniform-shaped array is _{Two possible speech detection methods,} energy-needed. Because the shape of the microphone array _{based and entropy-based [16], can be used. They are} on the robot may change with diﬀerent applications, _{based on the assumption that the noise is static} the root-ﬁnding method is not implemented in the _{stationary or slowly varying in time. The} entropy-proposed platform. _{based method is chosen in this paper because it is} able to detect voice activity in a low SNR environment. Observation of the spectrogram of very noisy speech signals shows that the speech segments are

3 SPEECH ENHANCEMENT

more organized than noise segments. Because of this fact, Shannon’s entropy [17] can be used to measure

3.1 The modiﬁed beamformer approach

the organization of the speech signals and was The approach could be arranged in the following _{deﬁned as}

steps:

H(G )=− ∑U

u=1

f (g(u)) log

2[ f (g(u))] (16) Step 1 is to pre-record the speech source.

Step 2 is speech activity detection described in

where f ( g(u)) is the probability density function of section 3.2.

a speech signal of symbol u. The concept of entropy Step 3 is to adjust the pre-recorded speech source

applied to speech activity detection is based on the by the beam-steer ﬁlter in order to produce the

assumption that the signal is more organized in correct reference signals. The DOA information

speech segments than in non-speech segments. The is obtained by the MUSIC algorithm mentioned

measure of entropy is redeﬁned in the spectral above. Generally, the MUSIC spectrum contains

domain as both directional information of the speaker and an

interference signal during the speech segment. In H(|G(v, z)|2)

order to determine the speaker’s direction, the MUSIC spectrum is computed contiguously and

=− ∑L l=1 |G(v_l, z)|2 ∑L l=1 |G(v_l, z)|2 log

C

|G(v_l, z)|2 ∑L l=1 |G(v_l, z)|2

D

(17) then the speaker’s direction can be obtained by

comparing the spectrums before and after the speech activity is detected. The design of the

where z means the zth frame and beam-steer ﬁlter will be mentioned in section 3.3

and the modiﬁed reference signals are denoted as _{|G(z)|2=[|G(v}

1, z)|2, … , |G(v2, z)|2, … ,

rˆ1[n], … , rˆM[n]. |G(v_L, z)|2]T

In step 4, the weighting matrix of the upper

beam-former is modiﬁed in the non-speech segments, is the magnitude spectrum for frame z. When the input is a white noise, H(|G(v, z)|2) is maximized and and the newly updated weighting matrix is passed

to the lower beamformer in the speech segments. the maximum value is log(v). On the other hand,

H(|G(v, z)|2) is minimized when the input is a pure The LMS method is used here to perform the

adaptation in the non-speech segments. If the tone and the minimum value is zero. The dynamic of H(|G(v, z)|2) is thus bounded between 0 and log(v) speech segments are detected, the data would

ﬂow through the lower beamformer and then the and the entropy of the non-speech segments should be larger than that of the speech segments.

output data sequence yˆ[n] could be produced.

Assume that the order of the weighting vector in Figure 2 shows the waveform for the utterance ‘nine three eight’ (in Mandarin) contaminated by each microphone is F. The adaptation of LMS

(6)

Fig. 2 Noisy signal at an SNR of−5 dB in white Gaussian noise for ‘nine three eight’, measured entropy distribution, and the detection of non-speech segments with a ﬁxed threshold of 2.85

white Gaussian noise with a global SNR of −5 dB, 4 EXPERIMENTAL RESULTS

measured entropy distribution, and the detection of

non-speech segments with a ﬁxed threshold of 2.85. A uniform, linear array using six microphones is constructed for the experiment. The larger spacing The entropy detection shows an acceptable detection

of non-speech segments in highly noisy conditions. between the microphones could achieve a better beamforming result, but the MUSIC algorithm needs a smaller spacing to prevent the spatial aliasing eﬀect

3.3 Beam-steer ﬁlter

in the lower frequency range. Because the frequency A simple delay-and-sum algorithm is used for the

range, 0–2400 Hz, contains the major information of beam-steering ﬁlter. To cope with the fractional

the speech source, the spacing between the micro-delay problem, an optimal fraction micro-delay FIR ﬁlter

phones is chosen as 7 cm. The ampliﬁed microphone design technique [18] is implemented. Without loss

signals are sampled by a 16 kHz, 16 bits A/D of generality, the signals are assumed to have no

fre-(analogue-to-digital) card and the computing plat-quency components aboveap rad/s (0<a<1) and the

form is a Pentium III 550 MHz PC. The array is optimal estimation cˆ(i) through linear combination

mounted on an easel with a height of 1 m and 3 m of the sample values is

to the nearest wall. The environment is a 20 m×15 m room full of oﬃce furniture to simulate a real

cˆ(i )= ∑V

v=0

h

vc(v) (18) environment. The interference signals in the experi-ment are mutually uncorrelated white noise. The ﬁrst scenario (Fig. 3) tests the performance under a

C

h 0 h 1 h 2 e h V

D

=

C

K(0, 0) K(0, 1) … K(0, V ) K(1, 0) K(1, 1) … K(1, V ) K(2, 0) K(2, 1) … K(2, V ) e e e K(V , 0) K(V−1, 1) … K(V , V )

D

−1 ×

C

K(0, i ) K(1, i ) K(2, i ) e K(V , i )

D

(19)

Fig. 3 Testing scenario 1: array of six microphones in a noisy environment

(7)

Table 2 Beamforming result with order 30 ﬁxed interference signal and diﬀerent speech source

directions. Loudspeakers are used to produce these

Original Modiﬁed

signals. The interference signal comes from 60° with Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB)

a distance of 150 cm. The second scenario (Fig. 4)

tests the performance under a ﬁxed speech source ₄₅ _5.7539 _22.3684 _21.4832

30 5.6336 21.2468 20.2601

and a diﬀerent number of interference signals. Other

15 4.0356 19.4224 19.1934

than the performance of the proposed algorithm

0 4.3570 20.3941 20.3941

(Fig. 1), the original adaptive beamformer proposed ₋₁₅ _3.5473 _21.3124 _21.0396

−30 4.5161 23.9333 22.3824

by Dahl and Claesson [8] is also tested for

com-−45 4.0351 21.7139 20.9475

parison. The results are shown in the following sections.

4.1 Scenario 1 4.1.1 DOA result

Table 1 shows the statistics of the estimation result of the proposed DOA algorithm where the SNR in diﬀerent angles can be seen in Table 2. This result is compared with the DOA algorithm that processes all frequencies in a signal bandwidth. Although the pro-posed algorithm chooses only ten signiﬁcant fre-quencies to estimate the power spectrum (as listed in left half of the table), the statistical result shows that it has a better accuracy than the algorithm that processes all frequencies in the signal bandwidth. In Fig. 5, the dotted line and the solid line represent

the estimated MUSIC spectrum in the non-speech _{Fig. 5} _{Customized DOA spectrum}

segment and in the speech segment. By comparing these two spectrums the speaker source direction can be determined.

4.1.2 Beamforming result

Tables 2 to 4 show the SNR improvements in the experiments when the filter tap length in the beam-former is 30, 60, and 90. For the modified algorithm, the beam-steer filter’s tap length is 4 (section 3.3). The results show a little degradation of the modified algorithm compared with the original one by Dahl

Fig. 4 Testing scenario 2: array of six microphones in

a noisy environment and Claesson. However, the modiﬁed algorithm only

Table 1 Customized DOA estimation result

Number of frequencies selected

Ten signiﬁcant frequencies are selected All frequencies are selected Correct angle

(deg) Mean Standard deviation Mean Standard deviation −45 −43.7619 1.3381 −43.8571 2.1974 −30 −30.2381 2.644 −30.4762 3.0922 −15 −15 2.4698 −14.4762 3.4441 0 2.9524 3.7878 2.6667 5.0133 15 14.8095 2.2939 14.3333 3.3066 30 29.5238 2.9431 29.4286 3.0589 45 43.4762 1.4703 43.0476 2.4388

(8)

Table 4 Beamforming result with order 90

Table 3 Beamforming result with order 60

Original Modiﬁed Original Modiﬁed

Correct angle Input SNR beamformer beamformer Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB) (deg) (dB) (dB) (dB) 45 5.7539 21.3245 21.0223 45 5.7539 22.3891 22.0821 30 5.6336 22.3814 21.3591 30 5.6336 22.8585 21.4578 15 4.0356 21.9316 19.3706 15 4.0356 20.9760 19.2551 0 4.3570 20.5921 20.5921 0 4.3570 21.7993 21.7993 −15 3.5473 23.0127 21.4250 −15 3.5473 22.4586 21.5892 −30 4.5161 24.5836 22.4966 −30 4.5161 25.3235 22.3848 −45 4.0351 22.9967 22.2750 −45 4.0351 22.9700 22.0310

records one set of the source signal at 0°. This shows enhanced to about 19.2–25 dB from about 3.5–5.7 dB. that with correct DOA information, a simple delay- With the increase of the ﬁlter tap length, the SNR is and-sum beam-steering, can simulate the source improved, as shown in Fig. 7.

signal well in diﬀerent directions for the adaptive

algorithm to be eﬀective. However, this does not mean 4.2 Scenario 2

that the delay-and-sum beam-steering captures the

4.2.1 DOA result

spatial characteristics accurately. In other words,

per-formance may be degraded due to other uncertain- In this scenario, a speaker source is ﬁxed in one direction with diﬀerent interference signals from ties such as misplacement of sensors or mismatch

in the delay time. Figure 6 shows the time-domain other directions. As shown in Table 5, the standard deviation of the DOA estimation increases with the waveforms of the source signal, the interference, and

the enhanced results. In general, the SNR can be number of interference signals. This is because

(9)

Fig. 7 Average SNR

Table 5 DOA result in scenario 2

Correct Without interference signals Interference signals at 60° and −30° Interference signals at 60°, −30°, and −60° angle

(deg) Mean Standard deviation Mean Standard deviation Mean Standard deviation

0 1.45 1.3168 2.8 4.1624 2.9 6.5042

30 31.85 1.6944 29.25 5.8658 28 6.8133 −15 −14.7 1.7501 −17.7 4.2932 −18.6 5.0928

Table 7 Beamforming result with noise angles of 60°,

increasing the number of interference signals leads

−30°, and −60° to a lower SNR and less degrees of freedom in the

noise subspace. Although the estimation accuracy _Original _Modiﬁed

Correct angle Input SNR beamformer beamformer

decreases in the complex environment, it still remains

(deg) (dB) (dB) (dB) in an acceptable range. 30 −0.2980 16.8331 15.7307 0 −1.8639 14.4653 14.4653 4.2.2 Beamforming result −15 −2.6842 14.8471 13.2040

Tables 6 and 7 are the beamforming results with a 60th-order weighting vector applied for each

micro-after processing. The puriﬁed signal may be used to phone. Compared with Table 3, the modiﬁed

beam-perform speech recognition in order to understand former still works well by increasing the number of

voice commands for robots. If the feature of recorded interference signals.

speech is changed after processing, the proposed beamformer would not be suitable when speech

4.3 Improvement of the MFCC error distance

recognition is required. Because the Mel-frequency Besides the noise power reduction, another important _{cepstral coeﬃcient (MFCC) is the most popular} point that should be considered is whether the _{feature for speech recognition, minimizing the cepstral} cepstrum feature of the reference signal is changed _{error distance would increase the speech recognition}

rate. The cepstral error distance is deﬁned as

Table 6 Beamforming result with noise angles of 60°

E c= ∑ P p=1dMFCCpure( p)−MFCCcomparison( p)d22 and−30° Original Modiﬁed ₍₂₀₎

Correct angle Input SNR beamformer beamformer

(deg) (dB) (dB) (dB) _{Figure 8 shows the MFCC of one frame. The solid}

line denotes the MFCC of the pre-recorded speech

30 2.8234 20.0548 18.9461

0 1.2637 17.3820 17.3820 _{source in the ideal situation for speech recognition.} −15 0.4372 17.9555 16.7834

(10)

2 Schauer, C. and Gross, H.-M. Model and application

of a binaural 360 degree sound localization system. In International Joint INNS–IEEE Conference on

Neural Networks, Washington DC, 14–19 July, 2001. 3 Frost, O. L. An algorithm for linear constrained

adaptive array processing. Proc. IEEE, August 1972,

60(8), 926–935.

4 Griﬃths, L. J. and Jim, C. W. An alternative

approach to linearly constrained adaptive beam-forming. IEEE Trans. Antennas Propagation, January 1982, AP-30, 27–34.

5 Henry, C. Robust adaptive beamforming. IEEE Trans. Acoust. Speech, Signal Processing, October 1987, ASSP-35, 1365–1376.

6 Hoshuyama, O., Sugiyama, A., and Hirano, A. A

robust adaptive beamformer for microphone arrays with blocking matrix using constrained adaptive ﬁlters. IEEE Trans. Signal Processing, October 1999,

Fig. 8 MFCC distance

47(10).

7 Gannot, S., Burshtein, D., and Weinstein, E.

Signal enhancement using beamforming and non-environment as scenario 1, the average cepstral error

stationarity with applications to speech. IEEE Trans. distance increased to 10.699 (. line), which _{Signal Processing, August 2001, 49, 1614–1626.} means the cepstrum feature of the reference signal _{8 Dahl, M. and Claesson, I. Acoustic noise and} is changed by environmental noise and channel echo cancelling with microphone array. IEEE Trans.

Vehicular Technol., September 1999, 48(5), 1518–

distortion. After the contaminated signal is processed

1526. by the proposed beamformer, the average cepstral

9 Abdallah, S., Montre´sor, and Baudry, M. Speech

error distance drops to 0.8941 (solid line), which

signal detection in noisy environment using a greatly reduces the inﬂuence of the interference. _{localentropic criterion. In Eurospeech, Rhodes,}

Greece, September 1997.

10 Knapp, C. H. and Carter, G. C. The generalized 5 CONCLUSION correlation method for estimation of time delay.

IEEE Trans. Acoust. Speech, Signal Processing, August

1976, ASSP-24(4), 320–327. A microphone array with a customized wide-band

11 Brandstein, M. S. and Silverman, H. F. A robust

eigenstructure-based DOA estimation algorithm and

method for speech signal time-delay estimation in a modiﬁed beamformer is proposed in this paper.

reverberant rooms. In ICASSP-97, Vol. 1, April 1997. The experimental result shows that this customized _{12 Hu, J., Su, T. M., Cheng, C. C., Liu, W. H., and} DOA can detect the speaker direction with an accept- Wu, T. I. A self-calibrated speaker tracking system

able error range. Further, the modiﬁed beamformer using both audio and video data. In IEEE Con-ference on Control Applications, September 2002. can also reduce the cepstral distance, overcome

13 Hu, J., Cheng, C. C., Liu, W. H., and Su, T. M. A

the calibration problem caused by the mismatch

speaker tracking system with distance estimation between microphones, and enhance the SNR. With

using microphone array. In IEEE/ASME Inter-a beInter-am-steer ﬁlter, the request of extrInter-a memory _{national Conference on Advanced Manufacturing} needed to form a beam in an arbitrary direction is _{Technologies and Education, August 2002.}

greatly decreased, and the beam direction is inﬁnite. 14 Schmidt, R. O. Multiple emitter location and signal

parameter estimation. IEEE Trans. Antennas and The modiﬁed beamformer is easy to implement and

Propagation, AP-34, 276–280.

the hardware cost is low compared with other robust

15 Rao, B. D. and Hari, K. V. S. Performance analysis

beamformers.

of root-MUSIC. Acoust. Speech, Signal Processing, 1989, ASSP-37, 1939–1949.

16 Junqua, J.-C., Mak, B., and Reaves, B. A robust REFERENCES _{algorithm for word boundary detection in presence}

of noise. IEEE Trans. Speech and Audio Processing,

1 Chun, G. D. and Caudell, T. P. A model for auditory July 1994, 2(3), 406–412.

localization in robotic systems based on the neuro- 17 Gokhale, D. V. Maximum entropy characterization

biology of the inferior colliculus and analysis of of some distributions. In Statistical Distributions HRTF data. In Proceedings of the International Joint in Scientiﬁc Work (Eds Patil, Kotz, and Ord).

Conference on Neural Networks (IJCNN ’01), 2001, 1975, Vol. 3, pp. 299–304 (M.A. Reidel, Boston, Massachusetts).

(11)

18 Yu, S. H. and Hu. J. Optimal synthesis of a fractional _{N1(vl), … , NM(vl)} _{non-directional noises from}

delay FIR ﬁlter in a reproducing kernel Hilbert _{microphone 1 to M in} space. IEEE Signal Processing Lett., June 2001, 8(6). _frequency_v

l

N(v_l) non-directional noise vector in the frequency domain

P frame number of calculated

APPENDIX data P N(vl) non-directional noise Notation projection matrix in a

mk amplitude from the kth frequencyv_l

speech source to the mth _{r1[n], … , rM[n]} pre-recorded speech sources microphone from microphone 1 to M in

A(v

l) direction matrix in the discrete time domain

frequencyv_l _{rˆ1[n], … , rˆM[n]} modiﬁed reference signals

A

i(vl) direction vector in frequency_v from microphone 1 to M in

l the discrete time domain

c(v) undelayed original signal rˆ[k] modiﬁed reference signal

cˆ(i) estimated delay signal vector at the kth iteration

C

xx(vl) source part correlation_{matrix in frequency} _v Rss(vl) source correlation matrix in

l frequencyvl

d number of sources R

spso(vl) correlation between source p D number of signiﬁcant and source o in frequencyv

l

frequencies R_xx(v_l) received signal correlation DOA direction of arrival matrix in frequency v

l e[n] error signal R

xixj(vl) correlation between received E

c MFCC error distance signal i and received signal j E1(vl),…, EM(vl) eigenvectors of R

xx(vl) in frequencyvl

f(Ω) probability density function _{s1(t), … , sd(t)} sources in the continuous

G=[g(1), … , g(U)] speech signal of U symbols time domain

h

v v_{steer ﬁlter}th component of the beam- S1(vl), … , Sd(vl) sources in frequency vl S(v

l) source vector in frequency

H(.) entropy v_l

HRTF head related transfer SNR signal-to-noise ratio function T ﬁnite observation interval IID interaural intensity U number of symbols

difference V order of the beam-steer filter ITD interaural time difference w[k] weighting vector at the kth

J(h

i) cost function for a DOA iteration

estimation at h

i x1[n], … , xM[n] practical received signals K(.) sinc function from microphone 1 to M in

L number of frequency the discrete time domain components _{x1(t), … , xM(t)} practical received signals LMS least mean square from microphone 1 to M in

M number of microphones continuous time domain

MFCCcomparison(p) MFCC of the polluted signal _{X1(vl), … , XM(vl)} practical received signals or the processed signal in from microphone 1 to M in the pth frame frequencyv

l MFCCpure(p) MFCC of the original signal X(v

l) practical received signal

in the pth frame vector in frequency v_l MUSIC multiple signals y[n] desired signal

classiﬁcation y

b[n] output data signal of the

upper beamformer

n1(t), … , nM(t) non-directional noises from

microphone 1 to M in the yˆ[n] output data signal of the lower beamformer continuous time domain

(12)

t_mk time delay from the kth f1[n],…, fM[n] environmental noises

from microphone 1 to M speech source to the mth microphone

in the discrete time