• 沒有找到結果。

Processing of speech signals using a microphone array for intelligent robots

N/A
N/A
Protected

Academic year: 2021

Share "Processing of speech signals using a microphone array for intelligent robots"

Copied!
12
0
0

加載中.... (立即查看全文)

全文

(1)

http://pii.sagepub.com/

Control Engineering

Engineers, Part I: Journal of Systems and

http://pii.sagepub.com/content/219/2/133

The online version of this article can be found at:

DOI: 10.1243/095965105X9461

2005 219: 133

Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering

J Hu, C C Cheng and W H Liu

Processing of speech signals using a microphone array for intelligent robots

Published by:

http://www.sagepublications.com

On behalf of:

Institution of Mechanical Engineers

can be found at: Engineering

Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Additional services and information for

http://pii.sagepub.com/cgi/alerts Email Alerts: http://pii.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://pii.sagepub.com/content/219/2/133.refs.html Citations:

What is This?

- Mar 1, 2005

Version of Record

>>

(2)

Processing of speech signals using a

microphone array for intelligent robots

J Hu*, C C Cheng, and W H Liu

Department of Electrical and Control Engineering, National Chiao Tung University, Taiwan, Republic of China

The manuscript was received on 22 January 2004 and was accepted after revision for publication on 11 November 2004.

DOI: 10.1243/095965105X9461

Abstract: For intelligent robots to interact with people, an efficient human–robot communication

interface is very important (e.g. voice command). However, recognizing voice command or speech represents only part of speech communication. The physics of speech signals includes other information, such as speaker direction. Secondly, a basic element of processing the speech signal is recognition at the acoustic level. However, the performance of recognition depends greatly on the reception. In a noisy environment, the success rate can be very poor. As a result, prior to speech recognition, it is important to process the speech signals to extract the needed content while rejecting others (such as background noise). This paper presents a speech purification system for robots to improve the signal-to-noise ratio of reception and an algorithm with a multidirection calibration beamformer.

Keywords: beamforming, beamformer, DOA, microphone array, robot hearing, speech enhancement

1 INTRODUCTION people say or which command is given. Although speech recognition can have high accuracy in a quiet environment, undesirable signal components due to With the advent of computing power of

micro-the ambient noise and channel distortion render micro-the processors and digital signal processors, the

possi-recognizer unusable for real-world applications. An bility of constructing an intelligent robot to perform

adaptive microphone array system is thus designed complex tasks is not such a far-reaching goal. Among

to purify the polluted signal and to improve the various features offered by an intelligent robot, the

recognition rate. communication interface is still an on-going research

Using adaptive microphone array algorithms for topic. It is generally believed that the interface should

enhancing speech reception in a noisy environ-not be restricted to keyboard, mouse, or remote

con-ment has been developed for many years. Earlier troller, but also to the nature language instead. For

approaches, such as the Frost beamformer [3], GSC [4], these reasons, robot hearing research has received

and the robust adaptive beamformer [5], are only much attention over the years. Chun and Caudell [1]

good in the ideal case. The ideal case here means tried to use the inferior colliculus structure and the

that the microphones are mutually matched and head related transfer function (HRTF) information

the environment is a free space. To cope with these combined with the image processing technique to

limitations, Hoshuyama et al. [6] proposed two find general rules of human hearing. Schauer and

robust constraints on the blocking matrix design. Gross [2] use interaural time difference (ITD) and

Weinstein [7] proposed a new channel estimation interaural intensity difference (IID) signals to

per-method for standard GSC architecture in the fre-form a 360° direction of arrival (DOA) estimation.

quency domain. However, its estimation accuracy Speech recognition will inevitably be incorporated

would be decreased by a louder noise and circuit into an intelligent robot to make it understand what

noise. Dahl and Claesson [8] proposed an adaptive algorithm which calibrates both the microphone

* Corresponding author: Department of Electrical and Control mismatch and channel effect using a priori infor-Engineering, National Chiao Tung University, Hsinchu, Taiwan, mation. This a priori information is a set of speech data recorded by the same microphone array in a

(3)

quiet environment. It then serves as a reference adaptation of the upper beamformer. By incorporating DOA knowledge the beam-steer filter is used to steer signal to update the coefficients of the filters when

the direction of the beam for acquiring clean speech the speaker is silent (or non-speech segments) and

of a speaker. Because the target is a speech signal, a the environment is noisy. With this a priori

infor-broadband beam-steer filter is needed. The third part mation, the calibration problem would be solved

is to apply the beamformer computation to increase implicitly. Dahl’s algorithm is suitable in the car

the signal-to-noise ratio (SNR). environment where the speaker’s position is fixed

The paper is organized as follows. The customized (e.g. the driver). To apply the algorithm for mobile

wide-band eigenstructure-based DOA estimation robots, it is necessary to record reference signals

algorithm will be described in section 2. Section 3 from all directions since the speaker’s position might

discusses the modified beamformer, speech activity not be fixed. In this paper, a beamforming

archi-detection, and the beam-steer filter. Section 4 pro-tecture modified from the method proposed by Dahl

vides experimental results of the DOA and and Claesson [8] is constructed by using a

beam-former obtained with the speaker in several different steer filter with only one set of pre-recorded speech

directions. Finally, a conclusion will be given in source. As a result, the memory requirement and the

section 5. effort of pre-recording are reduced tremendously.

This modified architecture could be more suitable for a robot hearing application.

The direction of the speaker must be known before

2 DIRECTION OF ARRIVAL (DOA) ESTIMATION

the beam is formed in the speaker direction. In a noisy environment, the conventional delay estimation

The idea of a blind DOA estimation algorithm called method in the time domain [9] or in the frequency

MUSIC [14] is adopted in this platform to detect the domain [10–13] is not able to obtain satisfactory

speaker’s direction. The received signal contains d results. In order to make a sound source direction sources and can be presented as

available, a customized wide-band eigenstructure-based DOA estimation algorithm is proposed in

x m(t)= ∑ d k=1 a mksk(t−tmk)+nm(t) (1) this system. This method is based on a blind DOA

estimation algorithm called MUSIC (multiple signals

classification) [14], with modifications to decrease Generally, sources here may include speech source the computing time and increase the accuracy of the and interference signals from the acoustic environ-DOA estimation. ment. Noise n

m(t) is referred to non-directional

The overall system is shown in Fig. 1. The first interference signals such as electronic noise (called part consists of a speech activity detection to decide non-directional noise in the following context). In when the adaptive beamformer should be switched order to express the delay relations into the phase shift, the received signal is transformed into the on or off. The second part is a DOA estimation and

(4)

frequency domain over a finite observation interval T and the rank of Cxx(vl) is d. Then the following equations can be derived

X m(vl)= 1 T

P

T/2 −T/2 x m(t) e−jvltdt RangeSpace (C xx(vl))=span {A1(vl), … ,Ad(vl)} =span {E1(vl), … ,E d(vl)} vl=2p

Tl, for l=1, … , L RangeSpace (A(v

l)))=span {Ed+1(vl), … ,EM(vl)} Combining the equations above, the signal subspace (2)

can be defined as wherev

1andvLare the lowest and highest frequencies span {E

1(vl), … ,Ed(vl)} is the source subspace included in bandwidth B.

The original model can be described as span {E

d+1(vl), … ,EM(vl)}

is the non-directional noise subspace

X

m(vl)= ∑d k=1

a

mkSk(vl) e−jvltmk+Nm(vl) (3) Because the source subspace is orthogonal to the non-directional noise subspace

Rewrite equation (3) in matrix form as

EHj (v l)Ai(vl)=0, i=1, … , d; j=d+1, … , M X(vl)=A(v l)S(vl)+N(vl) (4) (8) where

By equation (8), a non-directional noise projection

XT(vl)=[X 1(vl), … , XM(vl)] matrix P N(vl) can be established as NT(vl)=[N1(vl), … , N M(vl)] P N(vl)= ∑ M i=d+1 Ei(vl)EH i (vl) (9) ST(vl)=[S1(vl), … , S d(vl)]

The number of sources d can be determined by the distribution of eigenvalues. The DOA can be detected by projecting the direction vector on to the

A(vl)=

C

a 11e−jvlt11 … a1d e−jvlt1d e e a M1e−jvltM1 … aMd e−jvltMd

D

non-directional noise projection matrix when

PN(vl)A

i(vl)=0 (10) Note that each column presents the delay relations

Usually, the maximum d values are regarded as the caused by different sources between microphones,

dsource directions the ith column vector of A(vl) being denoted by

Ai(vl) and referred to as the direction vector.

Suppose noises are mutually independent. If the 1 (1/L) WL

l=1dEHj(vl)Ai(vl)d22 noise correlation matrix is the diagonal matrix

s2(vl)I, the received signal correlation matrix can be

= 1 (1/L) WL l=1AHi (vl)PN(vl)Ai(vl) (11) described as Rxx(vl)=A(v

l)Rss(vl)AH(vl)+s2(vl)I (5) The computing requirement of equation (11) can be reduced by considering only significant fre-where

quencies of concern. The selection criterion is based

Rss(v

l)=E[S(vl)SH(vl)] on the assumption that non-directional noises are mutually independent. Therefore, the non-diagonal and the eigenvalue decomposition

components of correlation matrix exclude non-directional noise terms. It means the following terms

Rxx(v l)= ∑

M i=1

[l

i(vl)−s2n(vl)]Ei(vl)EHi (vl) (6) in the correlation matrix (5) should be small with eigenvaluesl1(vl)l2(vl),lM(vl). From R

xixj(vl)= ∑ d p=1 ∑d o=1 a ipajoRspso(vl), Yi≠j (12) equations (4) and (5), the source part correlation

matrix is Then the Q significant frequencies

vˆ1,…, vˆQ can be selected as Cxx(v l)=A(vl)Rss(vl)AH(vl) q=

T

∑M i=1 ∑M j=i+1 |R xixj(vl)|

U

q (13) = ∑d i=1 [l i(vl)−s2n(vl)]Ei(vl)EHi (vl) (7)

(5)

As a result, the d source directions can be estimated algorithm is

by searching maximum d values of w[k+1]=w[k]+m(y[k]−y

b[k])(rˆ[k]+f[k]) wT[k]=[w11[k], … , w 1F[k−F−1] J(h i)= 1 (1/Q) WQ q=1AHi (vˆq)PN(vˆq)Ai(vˆq) (14) w 21[k], … , wMF[k−F−1]] Searching the spectrum for d peaks to determine the fT[k]=[f

1[k], … ,fM[k]] direction of arrival still requires plenty of process

rˆT[k]=[rˆ1[k], … , rˆ M[k]] time when the accuracy requirement is high. This is

(15) the drawback of this method, which requires further

improvements. Although there is the root-finding

3.2 Speech activity detection

MUSIC [15] algorithm to calculate the DOA without

searching the spectrum, a uniform-shaped array is Two possible speech detection methods, energy-needed. Because the shape of the microphone array based and entropy-based [16], can be used. They are on the robot may change with different applications, based on the assumption that the noise is static the root-finding method is not implemented in the stationary or slowly varying in time. The entropy-proposed platform. based method is chosen in this paper because it is able to detect voice activity in a low SNR environment. Observation of the spectrogram of very noisy speech signals shows that the speech segments are

3 SPEECH ENHANCEMENT

more organized than noise segments. Because of this fact, Shannon’s entropy [17] can be used to measure

3.1 The modified beamformer approach

the organization of the speech signals and was The approach could be arranged in the following defined as

steps:

H(G )=− ∑U

u=1

f (g(u)) log

2[ f (g(u))] (16) Step 1 is to pre-record the speech source.

Step 2 is speech activity detection described in

where f ( g(u)) is the probability density function of section 3.2.

a speech signal of symbol u. The concept of entropy Step 3 is to adjust the pre-recorded speech source

applied to speech activity detection is based on the by the beam-steer filter in order to produce the

assumption that the signal is more organized in correct reference signals. The DOA information

speech segments than in non-speech segments. The is obtained by the MUSIC algorithm mentioned

measure of entropy is redefined in the spectral above. Generally, the MUSIC spectrum contains

domain as both directional information of the speaker and an

interference signal during the speech segment. In H(|G(v, z)|2)

order to determine the speaker’s direction, the MUSIC spectrum is computed contiguously and

=− ∑L l=1 |G(vl, z)|2 ∑L l=1 |G(vl, z)|2 log

C

|G(vl, z)|2 ∑L l=1 |G(vl, z)|2

D

(17) then the speaker’s direction can be obtained by

comparing the spectrums before and after the speech activity is detected. The design of the

where z means the zth frame and beam-steer filter will be mentioned in section 3.3

and the modified reference signals are denoted as |G(z)|2=[|G(v

1, z)|2, … , |G(v2, z)|2, … ,

rˆ1[n], … , rˆM[n]. |G(vL, z)|2]T

In step 4, the weighting matrix of the upper

beam-former is modified in the non-speech segments, is the magnitude spectrum for frame z. When the input is a white noise, H(|G(v, z)|2) is maximized and and the newly updated weighting matrix is passed

to the lower beamformer in the speech segments. the maximum value is log(v). On the other hand,

H(|G(v, z)|2) is minimized when the input is a pure The LMS method is used here to perform the

adaptation in the non-speech segments. If the tone and the minimum value is zero. The dynamic of H(|G(v, z)|2) is thus bounded between 0 and log(v) speech segments are detected, the data would

flow through the lower beamformer and then the and the entropy of the non-speech segments should be larger than that of the speech segments.

output data sequence yˆ[n] could be produced.

Assume that the order of the weighting vector in Figure 2 shows the waveform for the utterance ‘nine three eight’ (in Mandarin) contaminated by each microphone is F. The adaptation of LMS

(6)

Fig. 2 Noisy signal at an SNR of−5 dB in white Gaussian noise for ‘nine three eight’, measured entropy distribution, and the detection of non-speech segments with a fixed threshold of 2.85

white Gaussian noise with a global SNR of −5 dB, 4 EXPERIMENTAL RESULTS

measured entropy distribution, and the detection of

non-speech segments with a fixed threshold of 2.85. A uniform, linear array using six microphones is constructed for the experiment. The larger spacing The entropy detection shows an acceptable detection

of non-speech segments in highly noisy conditions. between the microphones could achieve a better beamforming result, but the MUSIC algorithm needs a smaller spacing to prevent the spatial aliasing effect

3.3 Beam-steer filter

in the lower frequency range. Because the frequency A simple delay-and-sum algorithm is used for the

range, 0–2400 Hz, contains the major information of beam-steering filter. To cope with the fractional

the speech source, the spacing between the micro-delay problem, an optimal fraction micro-delay FIR filter

phones is chosen as 7 cm. The amplified microphone design technique [18] is implemented. Without loss

signals are sampled by a 16 kHz, 16 bits A/D of generality, the signals are assumed to have no

fre-(analogue-to-digital) card and the computing plat-quency components aboveap rad/s (0<a<1) and the

form is a Pentium III 550 MHz PC. The array is optimal estimation cˆ(i) through linear combination

mounted on an easel with a height of 1 m and 3 m of the sample values is

to the nearest wall. The environment is a 20 m×15 m room full of office furniture to simulate a real

cˆ(i )= ∑V

v=0

h

vc(v) (18) environment. The interference signals in the experi-ment are mutually uncorrelated white noise. The first scenario (Fig. 3) tests the performance under a

C

h 0 h 1 h 2 e h V

D

=

C

K(0, 0) K(0, 1) K(0, V ) K(1, 0) K(1, 1) K(1, V ) K(2, 0) K(2, 1) K(2, V ) e e e K(V , 0) K(V−1, 1) … K(V , V )

D

−1 ×

C

K(0, i ) K(1, i ) K(2, i ) e K(V , i )

D

(19)

Fig. 3 Testing scenario 1: array of six microphones in a noisy environment

(7)

Table 2 Beamforming result with order 30 fixed interference signal and different speech source

directions. Loudspeakers are used to produce these

Original Modified

signals. The interference signal comes from 60° with Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB)

a distance of 150 cm. The second scenario (Fig. 4)

tests the performance under a fixed speech source 45 5.7539 22.3684 21.4832

30 5.6336 21.2468 20.2601

and a different number of interference signals. Other

15 4.0356 19.4224 19.1934

than the performance of the proposed algorithm

0 4.3570 20.3941 20.3941

(Fig. 1), the original adaptive beamformer proposed −15 3.5473 21.3124 21.0396

−30 4.5161 23.9333 22.3824

by Dahl and Claesson [8] is also tested for

com-−45 4.0351 21.7139 20.9475

parison. The results are shown in the following sections.

4.1 Scenario 1 4.1.1 DOA result

Table 1 shows the statistics of the estimation result of the proposed DOA algorithm where the SNR in different angles can be seen in Table 2. This result is compared with the DOA algorithm that processes all frequencies in a signal bandwidth. Although the pro-posed algorithm chooses only ten significant fre-quencies to estimate the power spectrum (as listed in left half of the table), the statistical result shows that it has a better accuracy than the algorithm that processes all frequencies in the signal bandwidth. In Fig. 5, the dotted line and the solid line represent

the estimated MUSIC spectrum in the non-speech Fig. 5 Customized DOA spectrum

segment and in the speech segment. By comparing these two spectrums the speaker source direction can be determined.

4.1.2 Beamforming result

Tables 2 to 4 show the SNR improvements in the experiments when the filter tap length in the beam-former is 30, 60, and 90. For the modified algorithm, the beam-steer filter’s tap length is 4 (section 3.3). The results show a little degradation of the modified algorithm compared with the original one by Dahl

Fig. 4 Testing scenario 2: array of six microphones in

a noisy environment and Claesson. However, the modified algorithm only

Table 1 Customized DOA estimation result

Number of frequencies selected

Ten significant frequencies are selected All frequencies are selected Correct angle

(deg) Mean Standard deviation Mean Standard deviation −45 −43.7619 1.3381 −43.8571 2.1974 −30 −30.2381 2.644 −30.4762 3.0922 −15 −15 2.4698 −14.4762 3.4441 0 2.9524 3.7878 2.6667 5.0133 15 14.8095 2.2939 14.3333 3.3066 30 29.5238 2.9431 29.4286 3.0589 45 43.4762 1.4703 43.0476 2.4388

(8)

Table 4 Beamforming result with order 90

Table 3 Beamforming result with order 60

Original Modified Original Modified

Correct angle Input SNR beamformer beamformer Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB) (deg) (dB) (dB) (dB) 45 5.7539 21.3245 21.0223 45 5.7539 22.3891 22.0821 30 5.6336 22.3814 21.3591 30 5.6336 22.8585 21.4578 15 4.0356 21.9316 19.3706 15 4.0356 20.9760 19.2551 0 4.3570 20.5921 20.5921 0 4.3570 21.7993 21.7993 −15 3.5473 23.0127 21.4250 −15 3.5473 22.4586 21.5892 −30 4.5161 24.5836 22.4966 −30 4.5161 25.3235 22.3848 −45 4.0351 22.9967 22.2750 −45 4.0351 22.9700 22.0310

records one set of the source signal at 0°. This shows enhanced to about 19.2–25 dB from about 3.5–5.7 dB. that with correct DOA information, a simple delay- With the increase of the filter tap length, the SNR is and-sum beam-steering, can simulate the source improved, as shown in Fig. 7.

signal well in different directions for the adaptive

algorithm to be effective. However, this does not mean 4.2 Scenario 2

that the delay-and-sum beam-steering captures the

4.2.1 DOA result

spatial characteristics accurately. In other words,

per-formance may be degraded due to other uncertain- In this scenario, a speaker source is fixed in one direction with different interference signals from ties such as misplacement of sensors or mismatch

in the delay time. Figure 6 shows the time-domain other directions. As shown in Table 5, the standard deviation of the DOA estimation increases with the waveforms of the source signal, the interference, and

the enhanced results. In general, the SNR can be number of interference signals. This is because

(9)

Fig. 7 Average SNR

Table 5 DOA result in scenario 2

Correct Without interference signals Interference signals at 60° and −30° Interference signals at 60°, −30°, and −60° angle

(deg) Mean Standard deviation Mean Standard deviation Mean Standard deviation

0 1.45 1.3168 2.8 4.1624 2.9 6.5042

30 31.85 1.6944 29.25 5.8658 28 6.8133 −15 −14.7 1.7501 −17.7 4.2932 −18.6 5.0928

Table 7 Beamforming result with noise angles of 60°,

increasing the number of interference signals leads

−30°, and −60° to a lower SNR and less degrees of freedom in the

noise subspace. Although the estimation accuracy Original Modified

Correct angle Input SNR beamformer beamformer

decreases in the complex environment, it still remains

(deg) (dB) (dB) (dB) in an acceptable range. 30 −0.2980 16.8331 15.7307 0 −1.8639 14.4653 14.4653 4.2.2 Beamforming result −15 −2.6842 14.8471 13.2040

Tables 6 and 7 are the beamforming results with a 60th-order weighting vector applied for each

micro-after processing. The purified signal may be used to phone. Compared with Table 3, the modified

beam-perform speech recognition in order to understand former still works well by increasing the number of

voice commands for robots. If the feature of recorded interference signals.

speech is changed after processing, the proposed beamformer would not be suitable when speech

4.3 Improvement of the MFCC error distance

recognition is required. Because the Mel-frequency Besides the noise power reduction, another important cepstral coefficient (MFCC) is the most popular point that should be considered is whether the feature for speech recognition, minimizing the cepstral cepstrum feature of the reference signal is changed error distance would increase the speech recognition

rate. The cepstral error distance is defined as

Table 6 Beamforming result with noise angles of 60°

E c= ∑ P p=1dMFCCpure( p)−MFCCcomparison( p)d22 and−30° Original Modified (20)

Correct angle Input SNR beamformer beamformer

(deg) (dB) (dB) (dB) Figure 8 shows the MFCC of one frame. The solid

line denotes the MFCC of the pre-recorded speech

30 2.8234 20.0548 18.9461

0 1.2637 17.3820 17.3820 source in the ideal situation for speech recognition. −15 0.4372 17.9555 16.7834

(10)

2 Schauer, C. and Gross, H.-M. Model and application

of a binaural 360 degree sound localization system. In International Joint INNS–IEEE Conference on

Neural Networks, Washington DC, 14–19 July, 2001. 3 Frost, O. L. An algorithm for linear constrained

adaptive array processing. Proc. IEEE, August 1972,

60(8), 926–935.

4 Griffiths, L. J. and Jim, C. W. An alternative

approach to linearly constrained adaptive beam-forming. IEEE Trans. Antennas Propagation, January 1982, AP-30, 27–34.

5 Henry, C. Robust adaptive beamforming. IEEE Trans. Acoust. Speech, Signal Processing, October 1987, ASSP-35, 1365–1376.

6 Hoshuyama, O., Sugiyama, A., and Hirano, A. A

robust adaptive beamformer for microphone arrays with blocking matrix using constrained adaptive filters. IEEE Trans. Signal Processing, October 1999,

Fig. 8 MFCC distance

47(10).

7 Gannot, S., Burshtein, D., and Weinstein, E.

Signal enhancement using beamforming and non-environment as scenario 1, the average cepstral error

stationarity with applications to speech. IEEE Trans. distance increased to 10.699 (. line), which Signal Processing, August 2001, 49, 1614–1626. means the cepstrum feature of the reference signal 8 Dahl, M. and Claesson, I. Acoustic noise and is changed by environmental noise and channel echo cancelling with microphone array. IEEE Trans.

Vehicular Technol., September 1999, 48(5), 1518–

distortion. After the contaminated signal is processed

1526. by the proposed beamformer, the average cepstral

9 Abdallah, S., Montre´sor, and Baudry, M. Speech

error distance drops to 0.8941 (solid line), which

signal detection in noisy environment using a greatly reduces the influence of the interference. localentropic criterion. In Eurospeech, Rhodes,

Greece, September 1997.

10 Knapp, C. H. and Carter, G. C. The generalized 5 CONCLUSION correlation method for estimation of time delay.

IEEE Trans. Acoust. Speech, Signal Processing, August

1976, ASSP-24(4), 320–327. A microphone array with a customized wide-band

11 Brandstein, M. S. and Silverman, H. F. A robust

eigenstructure-based DOA estimation algorithm and

method for speech signal time-delay estimation in a modified beamformer is proposed in this paper.

reverberant rooms. In ICASSP-97, Vol. 1, April 1997. The experimental result shows that this customized 12 Hu, J., Su, T. M., Cheng, C. C., Liu, W. H., and DOA can detect the speaker direction with an accept- Wu, T. I. A self-calibrated speaker tracking system

able error range. Further, the modified beamformer using both audio and video data. In IEEE Con-ference on Control Applications, September 2002. can also reduce the cepstral distance, overcome

13 Hu, J., Cheng, C. C., Liu, W. H., and Su, T. M. A

the calibration problem caused by the mismatch

speaker tracking system with distance estimation between microphones, and enhance the SNR. With

using microphone array. In IEEE/ASME Inter-a beInter-am-steer filter, the request of extrInter-a memory national Conference on Advanced Manufacturing needed to form a beam in an arbitrary direction is Technologies and Education, August 2002.

greatly decreased, and the beam direction is infinite. 14 Schmidt, R. O. Multiple emitter location and signal

parameter estimation. IEEE Trans. Antennas and The modified beamformer is easy to implement and

Propagation, AP-34, 276–280.

the hardware cost is low compared with other robust

15 Rao, B. D. and Hari, K. V. S. Performance analysis

beamformers.

of root-MUSIC. Acoust. Speech, Signal Processing, 1989, ASSP-37, 1939–1949.

16 Junqua, J.-C., Mak, B., and Reaves, B. A robust REFERENCES algorithm for word boundary detection in presence

of noise. IEEE Trans. Speech and Audio Processing,

1 Chun, G. D. and Caudell, T. P. A model for auditory July 1994, 2(3), 406–412.

localization in robotic systems based on the neuro- 17 Gokhale, D. V. Maximum entropy characterization

biology of the inferior colliculus and analysis of of some distributions. In Statistical Distributions HRTF data. In Proceedings of the International Joint in Scientific Work (Eds Patil, Kotz, and Ord).

Conference on Neural Networks (IJCNN ’01), 2001, 1975, Vol. 3, pp. 299–304 (M.A. Reidel, Boston, Massachusetts).

(11)

18 Yu, S. H. and Hu. J. Optimal synthesis of a fractional N1(vl), … , NM(vl) non-directional noises from

delay FIR filter in a reproducing kernel Hilbert microphone 1 to M in space. IEEE Signal Processing Lett., June 2001, 8(6). frequencyv

l

N(vl) non-directional noise vector in the frequency domain

P frame number of calculated

APPENDIX data P N(vl) non-directional noise Notation projection matrix in a

mk amplitude from the kth frequencyvl

speech source to the mth r1[n], … , rM[n] pre-recorded speech sources microphone from microphone 1 to M in

A(v

l) direction matrix in the discrete time domain

frequencyvl rˆ1[n], … , rˆM[n] modified reference signals

A

i(vl) direction vector in frequencyv from microphone 1 to M in

l the discrete time domain

c(v) undelayed original signal rˆ[k] modified reference signal

cˆ(i) estimated delay signal vector at the kth iteration

C

xx(vl) source part correlationmatrix in frequency v Rss(vl) source correlation matrix in

l frequencyvl

d number of sources R

spso(vl) correlation between source p D number of significant and source o in frequencyv

l

frequencies Rxx(vl) received signal correlation DOA direction of arrival matrix in frequency v

l e[n] error signal R

xixj(vl) correlation between received E

c MFCC error distance signal i and received signal j E1(vl),…, EM(vl) eigenvectors of R

xx(vl) in frequencyvl

f(Ω) probability density function s1(t), … , sd(t) sources in the continuous

G=[g(1), … , g(U)] speech signal of U symbols time domain

h

v vsteer filterth component of the beam- S1(vl), … , Sd(vl) sources in frequency vl S(v

l) source vector in frequency

H(.) entropy vl

HRTF head related transfer SNR signal-to-noise ratio function T finite observation interval IID interaural intensity U number of symbols

difference V order of the beam-steer filter ITD interaural time difference w[k] weighting vector at the kth

J(h

i) cost function for a DOA iteration

estimation at h

i x1[n], … , xM[n] practical received signals K(.) sinc function from microphone 1 to M in

L number of frequency the discrete time domain components x1(t), … , xM(t) practical received signals LMS least mean square from microphone 1 to M in

M number of microphones continuous time domain

MFCCcomparison(p) MFCC of the polluted signal X1(vl), … , XM(vl) practical received signals or the processed signal in from microphone 1 to M in the pth frame frequencyv

l MFCCpure(p) MFCC of the original signal X(v

l) practical received signal

in the pth frame vector in frequency vl MUSIC multiple signals y[n] desired signal

classification y

b[n] output data signal of the

upper beamformer

n1(t), … , nM(t) non-directional noises from

microphone 1 to M in the yˆ[n] output data signal of the lower beamformer continuous time domain

(12)

tmk time delay from the kth f1[n],…, fM[n] environmental noises

from microphone 1 to M speech source to the mth microphone

in the discrete time

domain v frequency value

vc central frequency f[k] environmental noise vector

at the kth iteration v

l lth frequency component

q qth significant frequency

l1(vl),…, lM(vl) eigenvalues of Rxx(vl)

m step size for the LMS component .q qth biggest values

數據

Fig. 1 Overall system structure
Fig. 3 Testing scenario 1: array of six microphones in a noisy environment
Table 2 Beamforming result with order 30 fixed interference signal and different speech source
Table 4 Beamforming result with order 90
+3

參考文獻

相關文件

For MIMO-OFDM systems, the objective of the existing power control strategies is maximization of the signal to interference and noise ratio (SINR) or minimization of the bit

In the development of data acquisition interface, matlab, a scientific computing software, was applied to acquire ECG data with real-time signal processing.. The developed

Research on Analog and Mixed-Signal Processing Integrated Circuit Design for ISFET-Based Linear Sensor Array

Classifier which uses OpenCV (Open Source Computer Vision Library) was a detector that has been training for face detecting.. The result of the experiment is 2 frames per second, and

(英文) The Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP 2010). 發表 論文

Ahmad, A Variable Block Size Motion Estimation Algorithm for Real-time H.264 Video Encoding,  Signal Processing: Image Communication,

D.Wilcox, “A hidden Markov model framework for video segmentation using audio and image features,” in Proceedings of the 1998 IEEE Internation Conference on Acoustics, Speech,

[7]Jerome M .Shapiro “Embedded Image Using Zerotree of Wavelet Coefficients”IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL,41,NO.12,DECEMBER 1993. [8 ]Amir Said Willam