http://pii.sagepub.com/
Control Engineering
Engineers, Part I: Journal of Systems and
http://pii.sagepub.com/content/219/2/133
The online version of this article can be found at:
DOI: 10.1243/095965105X9461
2005 219: 133
Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering
J Hu, C C Cheng and W H Liu
Processing of speech signals using a microphone array for intelligent robots
Published by:
http://www.sagepublications.com
On behalf of:
Institution of Mechanical Engineers
can be found at: Engineering
Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Additional services and information for
http://pii.sagepub.com/cgi/alerts Email Alerts: http://pii.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://pii.sagepub.com/content/219/2/133.refs.html Citations:
What is This?
- Mar 1, 2005
Version of Record
>>
Processing of speech signals using a
microphone array for intelligent robots
J Hu*, C C Cheng, and W H Liu
Department of Electrical and Control Engineering, National Chiao Tung University, Taiwan, Republic of China
The manuscript was received on 22 January 2004 and was accepted after revision for publication on 11 November 2004.
DOI: 10.1243/095965105X9461
Abstract: For intelligent robots to interact with people, an efficient human–robot communication
interface is very important (e.g. voice command). However, recognizing voice command or speech represents only part of speech communication. The physics of speech signals includes other information, such as speaker direction. Secondly, a basic element of processing the speech signal is recognition at the acoustic level. However, the performance of recognition depends greatly on the reception. In a noisy environment, the success rate can be very poor. As a result, prior to speech recognition, it is important to process the speech signals to extract the needed content while rejecting others (such as background noise). This paper presents a speech purification system for robots to improve the signal-to-noise ratio of reception and an algorithm with a multidirection calibration beamformer.
Keywords: beamforming, beamformer, DOA, microphone array, robot hearing, speech enhancement
1 INTRODUCTION people say or which command is given. Although speech recognition can have high accuracy in a quiet environment, undesirable signal components due to With the advent of computing power of
micro-the ambient noise and channel distortion render micro-the processors and digital signal processors, the
possi-recognizer unusable for real-world applications. An bility of constructing an intelligent robot to perform
adaptive microphone array system is thus designed complex tasks is not such a far-reaching goal. Among
to purify the polluted signal and to improve the various features offered by an intelligent robot, the
recognition rate. communication interface is still an on-going research
Using adaptive microphone array algorithms for topic. It is generally believed that the interface should
enhancing speech reception in a noisy environ-not be restricted to keyboard, mouse, or remote
con-ment has been developed for many years. Earlier troller, but also to the nature language instead. For
approaches, such as the Frost beamformer [3], GSC [4], these reasons, robot hearing research has received
and the robust adaptive beamformer [5], are only much attention over the years. Chun and Caudell [1]
good in the ideal case. The ideal case here means tried to use the inferior colliculus structure and the
that the microphones are mutually matched and head related transfer function (HRTF) information
the environment is a free space. To cope with these combined with the image processing technique to
limitations, Hoshuyama et al. [6] proposed two find general rules of human hearing. Schauer and
robust constraints on the blocking matrix design. Gross [2] use interaural time difference (ITD) and
Weinstein [7] proposed a new channel estimation interaural intensity difference (IID) signals to
per-method for standard GSC architecture in the fre-form a 360° direction of arrival (DOA) estimation.
quency domain. However, its estimation accuracy Speech recognition will inevitably be incorporated
would be decreased by a louder noise and circuit into an intelligent robot to make it understand what
noise. Dahl and Claesson [8] proposed an adaptive algorithm which calibrates both the microphone
* Corresponding author: Department of Electrical and Control mismatch and channel effect using a priori infor-Engineering, National Chiao Tung University, Hsinchu, Taiwan, mation. This a priori information is a set of speech data recorded by the same microphone array in a
quiet environment. It then serves as a reference adaptation of the upper beamformer. By incorporating DOA knowledge the beam-steer filter is used to steer signal to update the coefficients of the filters when
the direction of the beam for acquiring clean speech the speaker is silent (or non-speech segments) and
of a speaker. Because the target is a speech signal, a the environment is noisy. With this a priori
infor-broadband beam-steer filter is needed. The third part mation, the calibration problem would be solved
is to apply the beamformer computation to increase implicitly. Dahl’s algorithm is suitable in the car
the signal-to-noise ratio (SNR). environment where the speaker’s position is fixed
The paper is organized as follows. The customized (e.g. the driver). To apply the algorithm for mobile
wide-band eigenstructure-based DOA estimation robots, it is necessary to record reference signals
algorithm will be described in section 2. Section 3 from all directions since the speaker’s position might
discusses the modified beamformer, speech activity not be fixed. In this paper, a beamforming
archi-detection, and the beam-steer filter. Section 4 pro-tecture modified from the method proposed by Dahl
vides experimental results of the DOA and and Claesson [8] is constructed by using a
beam-former obtained with the speaker in several different steer filter with only one set of pre-recorded speech
directions. Finally, a conclusion will be given in source. As a result, the memory requirement and the
section 5. effort of pre-recording are reduced tremendously.
This modified architecture could be more suitable for a robot hearing application.
The direction of the speaker must be known before
2 DIRECTION OF ARRIVAL (DOA) ESTIMATION
the beam is formed in the speaker direction. In a noisy environment, the conventional delay estimation
The idea of a blind DOA estimation algorithm called method in the time domain [9] or in the frequency
MUSIC [14] is adopted in this platform to detect the domain [10–13] is not able to obtain satisfactory
speaker’s direction. The received signal contains d results. In order to make a sound source direction sources and can be presented as
available, a customized wide-band eigenstructure-based DOA estimation algorithm is proposed in
x m(t)= ∑ d k=1 a mksk(t−tmk)+nm(t) (1) this system. This method is based on a blind DOA
estimation algorithm called MUSIC (multiple signals
classification) [14], with modifications to decrease Generally, sources here may include speech source the computing time and increase the accuracy of the and interference signals from the acoustic environ-DOA estimation. ment. Noise n
m(t) is referred to non-directional
The overall system is shown in Fig. 1. The first interference signals such as electronic noise (called part consists of a speech activity detection to decide non-directional noise in the following context). In when the adaptive beamformer should be switched order to express the delay relations into the phase shift, the received signal is transformed into the on or off. The second part is a DOA estimation and
frequency domain over a finite observation interval T and the rank of Cxx(vl) is d. Then the following equations can be derived
X m(vl)= 1 T
P
T/2 −T/2 x m(t) e−jvltdt RangeSpace (C xx(vl))=span {A1(vl), … ,Ad(vl)} =span {E1(vl), … ,E d(vl)} vl=2pTl, for l=1, … , L RangeSpace (A(v
l)))=span {Ed+1(vl), … ,EM(vl)} Combining the equations above, the signal subspace (2)
can be defined as wherev
1andvLare the lowest and highest frequencies span {E
1(vl), … ,Ed(vl)} is the source subspace included in bandwidth B.
The original model can be described as span {E
d+1(vl), … ,EM(vl)}
is the non-directional noise subspace
X
m(vl)= ∑d k=1
a
mkSk(vl) e−jvltmk+Nm(vl) (3) Because the source subspace is orthogonal to the non-directional noise subspace
Rewrite equation (3) in matrix form as
EHj (v l)Ai(vl)=0, i=1, … , d; j=d+1, … , M X(vl)=A(v l)S(vl)+N(vl) (4) (8) where
By equation (8), a non-directional noise projection
XT(vl)=[X 1(vl), … , XM(vl)] matrix P N(vl) can be established as NT(vl)=[N1(vl), … , N M(vl)] P N(vl)= ∑ M i=d+1 Ei(vl)EH i (vl) (9) ST(vl)=[S1(vl), … , S d(vl)]
The number of sources d can be determined by the distribution of eigenvalues. The DOA can be detected by projecting the direction vector on to the
A(vl)=
C
a 11e−jvlt11 … a1d e−jvlt1d e e a M1e−jvltM1 … aMd e−jvltMdD
non-directional noise projection matrix when
PN(vl)A
i(vl)=0 (10) Note that each column presents the delay relations
Usually, the maximum d values are regarded as the caused by different sources between microphones,
dsource directions the ith column vector of A(vl) being denoted by
Ai(vl) and referred to as the direction vector.
Suppose noises are mutually independent. If the 1 (1/L) WL
l=1dEHj(vl)Ai(vl)d22 noise correlation matrix is the diagonal matrix
s2(vl)I, the received signal correlation matrix can be
= 1 (1/L) WL l=1AHi (vl)PN(vl)Ai(vl) (11) described as Rxx(vl)=A(v
l)Rss(vl)AH(vl)+s2(vl)I (5) The computing requirement of equation (11) can be reduced by considering only significant fre-where
quencies of concern. The selection criterion is based
Rss(v
l)=E[S(vl)SH(vl)] on the assumption that non-directional noises are mutually independent. Therefore, the non-diagonal and the eigenvalue decomposition
components of correlation matrix exclude non-directional noise terms. It means the following terms
Rxx(v l)= ∑
M i=1
[l
i(vl)−s2n(vl)]Ei(vl)EHi (vl) (6) in the correlation matrix (5) should be small with eigenvaluesl1(vl)l2(vl),lM(vl). From R
xixj(vl)= ∑ d p=1 ∑d o=1 a ipajoRspso(vl), Yi≠j (12) equations (4) and (5), the source part correlation
matrix is Then the Q significant frequencies
vˆ1,…, vˆQ can be selected as Cxx(v l)=A(vl)Rss(vl)AH(vl) vˆq=
T
∑M i=1 ∑M j=i+1 |R xixj(vl)|U
q (13) = ∑d i=1 [l i(vl)−s2n(vl)]Ei(vl)EHi (vl) (7)As a result, the d source directions can be estimated algorithm is
by searching maximum d values of w[k+1]=w[k]+m(y[k]−y
b[k])(rˆ[k]+f[k]) wT[k]=[w11[k], … , w 1F[k−F−1] J(h i)= 1 (1/Q) WQ q=1AHi (vˆq)PN(vˆq)Ai(vˆq) (14) w 21[k], … , wMF[k−F−1]] Searching the spectrum for d peaks to determine the fT[k]=[f
1[k], … ,fM[k]] direction of arrival still requires plenty of process
rˆT[k]=[rˆ1[k], … , rˆ M[k]] time when the accuracy requirement is high. This is
(15) the drawback of this method, which requires further
improvements. Although there is the root-finding
3.2 Speech activity detection
MUSIC [15] algorithm to calculate the DOA without
searching the spectrum, a uniform-shaped array is Two possible speech detection methods, energy-needed. Because the shape of the microphone array based and entropy-based [16], can be used. They are on the robot may change with different applications, based on the assumption that the noise is static the root-finding method is not implemented in the stationary or slowly varying in time. The entropy-proposed platform. based method is chosen in this paper because it is able to detect voice activity in a low SNR environment. Observation of the spectrogram of very noisy speech signals shows that the speech segments are
3 SPEECH ENHANCEMENT
more organized than noise segments. Because of this fact, Shannon’s entropy [17] can be used to measure
3.1 The modified beamformer approach
the organization of the speech signals and was The approach could be arranged in the following defined as
steps:
H(G )=− ∑U
u=1
f (g(u)) log
2[ f (g(u))] (16) Step 1 is to pre-record the speech source.
Step 2 is speech activity detection described in
where f ( g(u)) is the probability density function of section 3.2.
a speech signal of symbol u. The concept of entropy Step 3 is to adjust the pre-recorded speech source
applied to speech activity detection is based on the by the beam-steer filter in order to produce the
assumption that the signal is more organized in correct reference signals. The DOA information
speech segments than in non-speech segments. The is obtained by the MUSIC algorithm mentioned
measure of entropy is redefined in the spectral above. Generally, the MUSIC spectrum contains
domain as both directional information of the speaker and an
interference signal during the speech segment. In H(|G(v, z)|2)
order to determine the speaker’s direction, the MUSIC spectrum is computed contiguously and
=− ∑L l=1 |G(vl, z)|2 ∑L l=1 |G(vl, z)|2 log
C
|G(vl, z)|2 ∑L l=1 |G(vl, z)|2D
(17) then the speaker’s direction can be obtained bycomparing the spectrums before and after the speech activity is detected. The design of the
where z means the zth frame and beam-steer filter will be mentioned in section 3.3
and the modified reference signals are denoted as |G(z)|2=[|G(v
1, z)|2, … , |G(v2, z)|2, … ,
rˆ1[n], … , rˆM[n]. |G(vL, z)|2]T
In step 4, the weighting matrix of the upper
beam-former is modified in the non-speech segments, is the magnitude spectrum for frame z. When the input is a white noise, H(|G(v, z)|2) is maximized and and the newly updated weighting matrix is passed
to the lower beamformer in the speech segments. the maximum value is log(v). On the other hand,
H(|G(v, z)|2) is minimized when the input is a pure The LMS method is used here to perform the
adaptation in the non-speech segments. If the tone and the minimum value is zero. The dynamic of H(|G(v, z)|2) is thus bounded between 0 and log(v) speech segments are detected, the data would
flow through the lower beamformer and then the and the entropy of the non-speech segments should be larger than that of the speech segments.
output data sequence yˆ[n] could be produced.
Assume that the order of the weighting vector in Figure 2 shows the waveform for the utterance ‘nine three eight’ (in Mandarin) contaminated by each microphone is F. The adaptation of LMS
Fig. 2 Noisy signal at an SNR of−5 dB in white Gaussian noise for ‘nine three eight’, measured entropy distribution, and the detection of non-speech segments with a fixed threshold of 2.85
white Gaussian noise with a global SNR of −5 dB, 4 EXPERIMENTAL RESULTS
measured entropy distribution, and the detection of
non-speech segments with a fixed threshold of 2.85. A uniform, linear array using six microphones is constructed for the experiment. The larger spacing The entropy detection shows an acceptable detection
of non-speech segments in highly noisy conditions. between the microphones could achieve a better beamforming result, but the MUSIC algorithm needs a smaller spacing to prevent the spatial aliasing effect
3.3 Beam-steer filter
in the lower frequency range. Because the frequency A simple delay-and-sum algorithm is used for the
range, 0–2400 Hz, contains the major information of beam-steering filter. To cope with the fractional
the speech source, the spacing between the micro-delay problem, an optimal fraction micro-delay FIR filter
phones is chosen as 7 cm. The amplified microphone design technique [18] is implemented. Without loss
signals are sampled by a 16 kHz, 16 bits A/D of generality, the signals are assumed to have no
fre-(analogue-to-digital) card and the computing plat-quency components aboveap rad/s (0<a<1) and the
form is a Pentium III 550 MHz PC. The array is optimal estimation cˆ(i) through linear combination
mounted on an easel with a height of 1 m and 3 m of the sample values is
to the nearest wall. The environment is a 20 m×15 m room full of office furniture to simulate a real
cˆ(i )= ∑V
v=0
h
vc(v) (18) environment. The interference signals in the experi-ment are mutually uncorrelated white noise. The first scenario (Fig. 3) tests the performance under a
C
h 0 h 1 h 2 e h VD
=C
K(0, 0) K(0, 1) … K(0, V ) K(1, 0) K(1, 1) … K(1, V ) K(2, 0) K(2, 1) … K(2, V ) e e e K(V , 0) K(V−1, 1) … K(V , V )D
−1 ×C
K(0, i ) K(1, i ) K(2, i ) e K(V , i )D
(19)Fig. 3 Testing scenario 1: array of six microphones in a noisy environment
Table 2 Beamforming result with order 30 fixed interference signal and different speech source
directions. Loudspeakers are used to produce these
Original Modified
signals. The interference signal comes from 60° with Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB)
a distance of 150 cm. The second scenario (Fig. 4)
tests the performance under a fixed speech source 45 5.7539 22.3684 21.4832
30 5.6336 21.2468 20.2601
and a different number of interference signals. Other
15 4.0356 19.4224 19.1934
than the performance of the proposed algorithm
0 4.3570 20.3941 20.3941
(Fig. 1), the original adaptive beamformer proposed −15 3.5473 21.3124 21.0396
−30 4.5161 23.9333 22.3824
by Dahl and Claesson [8] is also tested for
com-−45 4.0351 21.7139 20.9475
parison. The results are shown in the following sections.
4.1 Scenario 1 4.1.1 DOA result
Table 1 shows the statistics of the estimation result of the proposed DOA algorithm where the SNR in different angles can be seen in Table 2. This result is compared with the DOA algorithm that processes all frequencies in a signal bandwidth. Although the pro-posed algorithm chooses only ten significant fre-quencies to estimate the power spectrum (as listed in left half of the table), the statistical result shows that it has a better accuracy than the algorithm that processes all frequencies in the signal bandwidth. In Fig. 5, the dotted line and the solid line represent
the estimated MUSIC spectrum in the non-speech Fig. 5 Customized DOA spectrum
segment and in the speech segment. By comparing these two spectrums the speaker source direction can be determined.
4.1.2 Beamforming result
Tables 2 to 4 show the SNR improvements in the experiments when the filter tap length in the beam-former is 30, 60, and 90. For the modified algorithm, the beam-steer filter’s tap length is 4 (section 3.3). The results show a little degradation of the modified algorithm compared with the original one by Dahl
Fig. 4 Testing scenario 2: array of six microphones in
a noisy environment and Claesson. However, the modified algorithm only
Table 1 Customized DOA estimation result
Number of frequencies selected
Ten significant frequencies are selected All frequencies are selected Correct angle
(deg) Mean Standard deviation Mean Standard deviation −45 −43.7619 1.3381 −43.8571 2.1974 −30 −30.2381 2.644 −30.4762 3.0922 −15 −15 2.4698 −14.4762 3.4441 0 2.9524 3.7878 2.6667 5.0133 15 14.8095 2.2939 14.3333 3.3066 30 29.5238 2.9431 29.4286 3.0589 45 43.4762 1.4703 43.0476 2.4388
Table 4 Beamforming result with order 90
Table 3 Beamforming result with order 60
Original Modified Original Modified
Correct angle Input SNR beamformer beamformer Correct angle Input SNR beamformer beamformer (deg) (dB) (dB) (dB) (deg) (dB) (dB) (dB) 45 5.7539 21.3245 21.0223 45 5.7539 22.3891 22.0821 30 5.6336 22.3814 21.3591 30 5.6336 22.8585 21.4578 15 4.0356 21.9316 19.3706 15 4.0356 20.9760 19.2551 0 4.3570 20.5921 20.5921 0 4.3570 21.7993 21.7993 −15 3.5473 23.0127 21.4250 −15 3.5473 22.4586 21.5892 −30 4.5161 24.5836 22.4966 −30 4.5161 25.3235 22.3848 −45 4.0351 22.9967 22.2750 −45 4.0351 22.9700 22.0310
records one set of the source signal at 0°. This shows enhanced to about 19.2–25 dB from about 3.5–5.7 dB. that with correct DOA information, a simple delay- With the increase of the filter tap length, the SNR is and-sum beam-steering, can simulate the source improved, as shown in Fig. 7.
signal well in different directions for the adaptive
algorithm to be effective. However, this does not mean 4.2 Scenario 2
that the delay-and-sum beam-steering captures the
4.2.1 DOA result
spatial characteristics accurately. In other words,
per-formance may be degraded due to other uncertain- In this scenario, a speaker source is fixed in one direction with different interference signals from ties such as misplacement of sensors or mismatch
in the delay time. Figure 6 shows the time-domain other directions. As shown in Table 5, the standard deviation of the DOA estimation increases with the waveforms of the source signal, the interference, and
the enhanced results. In general, the SNR can be number of interference signals. This is because
Fig. 7 Average SNR
Table 5 DOA result in scenario 2
Correct Without interference signals Interference signals at 60° and −30° Interference signals at 60°, −30°, and −60° angle
(deg) Mean Standard deviation Mean Standard deviation Mean Standard deviation
0 1.45 1.3168 2.8 4.1624 2.9 6.5042
30 31.85 1.6944 29.25 5.8658 28 6.8133 −15 −14.7 1.7501 −17.7 4.2932 −18.6 5.0928
Table 7 Beamforming result with noise angles of 60°,
increasing the number of interference signals leads
−30°, and −60° to a lower SNR and less degrees of freedom in the
noise subspace. Although the estimation accuracy Original Modified
Correct angle Input SNR beamformer beamformer
decreases in the complex environment, it still remains
(deg) (dB) (dB) (dB) in an acceptable range. 30 −0.2980 16.8331 15.7307 0 −1.8639 14.4653 14.4653 4.2.2 Beamforming result −15 −2.6842 14.8471 13.2040
Tables 6 and 7 are the beamforming results with a 60th-order weighting vector applied for each
micro-after processing. The purified signal may be used to phone. Compared with Table 3, the modified
beam-perform speech recognition in order to understand former still works well by increasing the number of
voice commands for robots. If the feature of recorded interference signals.
speech is changed after processing, the proposed beamformer would not be suitable when speech
4.3 Improvement of the MFCC error distance
recognition is required. Because the Mel-frequency Besides the noise power reduction, another important cepstral coefficient (MFCC) is the most popular point that should be considered is whether the feature for speech recognition, minimizing the cepstral cepstrum feature of the reference signal is changed error distance would increase the speech recognition
rate. The cepstral error distance is defined as
Table 6 Beamforming result with noise angles of 60°
E c= ∑ P p=1dMFCCpure( p)−MFCCcomparison( p)d22 and−30° Original Modified (20)
Correct angle Input SNR beamformer beamformer
(deg) (dB) (dB) (dB) Figure 8 shows the MFCC of one frame. The solid
line denotes the MFCC of the pre-recorded speech
30 2.8234 20.0548 18.9461
0 1.2637 17.3820 17.3820 source in the ideal situation for speech recognition. −15 0.4372 17.9555 16.7834
2 Schauer, C. and Gross, H.-M. Model and application
of a binaural 360 degree sound localization system. In International Joint INNS–IEEE Conference on
Neural Networks, Washington DC, 14–19 July, 2001. 3 Frost, O. L. An algorithm for linear constrained
adaptive array processing. Proc. IEEE, August 1972,
60(8), 926–935.
4 Griffiths, L. J. and Jim, C. W. An alternative
approach to linearly constrained adaptive beam-forming. IEEE Trans. Antennas Propagation, January 1982, AP-30, 27–34.
5 Henry, C. Robust adaptive beamforming. IEEE Trans. Acoust. Speech, Signal Processing, October 1987, ASSP-35, 1365–1376.
6 Hoshuyama, O., Sugiyama, A., and Hirano, A. A
robust adaptive beamformer for microphone arrays with blocking matrix using constrained adaptive filters. IEEE Trans. Signal Processing, October 1999,
Fig. 8 MFCC distance
47(10).
7 Gannot, S., Burshtein, D., and Weinstein, E.
Signal enhancement using beamforming and non-environment as scenario 1, the average cepstral error
stationarity with applications to speech. IEEE Trans. distance increased to 10.699 (. line), which Signal Processing, August 2001, 49, 1614–1626. means the cepstrum feature of the reference signal 8 Dahl, M. and Claesson, I. Acoustic noise and is changed by environmental noise and channel echo cancelling with microphone array. IEEE Trans.
Vehicular Technol., September 1999, 48(5), 1518–
distortion. After the contaminated signal is processed
1526. by the proposed beamformer, the average cepstral
9 Abdallah, S., Montre´sor, and Baudry, M. Speech
error distance drops to 0.8941 (solid line), which
signal detection in noisy environment using a greatly reduces the influence of the interference. localentropic criterion. In Eurospeech, Rhodes,
Greece, September 1997.
10 Knapp, C. H. and Carter, G. C. The generalized 5 CONCLUSION correlation method for estimation of time delay.
IEEE Trans. Acoust. Speech, Signal Processing, August
1976, ASSP-24(4), 320–327. A microphone array with a customized wide-band
11 Brandstein, M. S. and Silverman, H. F. A robust
eigenstructure-based DOA estimation algorithm and
method for speech signal time-delay estimation in a modified beamformer is proposed in this paper.
reverberant rooms. In ICASSP-97, Vol. 1, April 1997. The experimental result shows that this customized 12 Hu, J., Su, T. M., Cheng, C. C., Liu, W. H., and DOA can detect the speaker direction with an accept- Wu, T. I. A self-calibrated speaker tracking system
able error range. Further, the modified beamformer using both audio and video data. In IEEE Con-ference on Control Applications, September 2002. can also reduce the cepstral distance, overcome
13 Hu, J., Cheng, C. C., Liu, W. H., and Su, T. M. A
the calibration problem caused by the mismatch
speaker tracking system with distance estimation between microphones, and enhance the SNR. With
using microphone array. In IEEE/ASME Inter-a beInter-am-steer filter, the request of extrInter-a memory national Conference on Advanced Manufacturing needed to form a beam in an arbitrary direction is Technologies and Education, August 2002.
greatly decreased, and the beam direction is infinite. 14 Schmidt, R. O. Multiple emitter location and signal
parameter estimation. IEEE Trans. Antennas and The modified beamformer is easy to implement and
Propagation, AP-34, 276–280.
the hardware cost is low compared with other robust
15 Rao, B. D. and Hari, K. V. S. Performance analysis
beamformers.
of root-MUSIC. Acoust. Speech, Signal Processing, 1989, ASSP-37, 1939–1949.
16 Junqua, J.-C., Mak, B., and Reaves, B. A robust REFERENCES algorithm for word boundary detection in presence
of noise. IEEE Trans. Speech and Audio Processing,
1 Chun, G. D. and Caudell, T. P. A model for auditory July 1994, 2(3), 406–412.
localization in robotic systems based on the neuro- 17 Gokhale, D. V. Maximum entropy characterization
biology of the inferior colliculus and analysis of of some distributions. In Statistical Distributions HRTF data. In Proceedings of the International Joint in Scientific Work (Eds Patil, Kotz, and Ord).
Conference on Neural Networks (IJCNN ’01), 2001, 1975, Vol. 3, pp. 299–304 (M.A. Reidel, Boston, Massachusetts).
18 Yu, S. H. and Hu. J. Optimal synthesis of a fractional N1(vl), … , NM(vl) non-directional noises from
delay FIR filter in a reproducing kernel Hilbert microphone 1 to M in space. IEEE Signal Processing Lett., June 2001, 8(6). frequencyv
l
N(vl) non-directional noise vector in the frequency domain
P frame number of calculated
APPENDIX data P N(vl) non-directional noise Notation projection matrix in a
mk amplitude from the kth frequencyvl
speech source to the mth r1[n], … , rM[n] pre-recorded speech sources microphone from microphone 1 to M in
A(v
l) direction matrix in the discrete time domain
frequencyvl rˆ1[n], … , rˆM[n] modified reference signals
A
i(vl) direction vector in frequencyv from microphone 1 to M in
l the discrete time domain
c(v) undelayed original signal rˆ[k] modified reference signal
cˆ(i) estimated delay signal vector at the kth iteration
C
xx(vl) source part correlationmatrix in frequency v Rss(vl) source correlation matrix in
l frequencyvl
d number of sources R
spso(vl) correlation between source p D number of significant and source o in frequencyv
l
frequencies Rxx(vl) received signal correlation DOA direction of arrival matrix in frequency v
l e[n] error signal R
xixj(vl) correlation between received E
c MFCC error distance signal i and received signal j E1(vl),…, EM(vl) eigenvectors of R
xx(vl) in frequencyvl
f(Ω) probability density function s1(t), … , sd(t) sources in the continuous
G=[g(1), … , g(U)] speech signal of U symbols time domain
h
v vsteer filterth component of the beam- S1(vl), … , Sd(vl) sources in frequency vl S(v
l) source vector in frequency
H(.) entropy vl
HRTF head related transfer SNR signal-to-noise ratio function T finite observation interval IID interaural intensity U number of symbols
difference V order of the beam-steer filter ITD interaural time difference w[k] weighting vector at the kth
J(h
i) cost function for a DOA iteration
estimation at h
i x1[n], … , xM[n] practical received signals K(.) sinc function from microphone 1 to M in
L number of frequency the discrete time domain components x1(t), … , xM(t) practical received signals LMS least mean square from microphone 1 to M in
M number of microphones continuous time domain
MFCCcomparison(p) MFCC of the polluted signal X1(vl), … , XM(vl) practical received signals or the processed signal in from microphone 1 to M in the pth frame frequencyv
l MFCCpure(p) MFCC of the original signal X(v
l) practical received signal
in the pth frame vector in frequency vl MUSIC multiple signals y[n] desired signal
classification y
b[n] output data signal of the
upper beamformer
n1(t), … , nM(t) non-directional noises from
microphone 1 to M in the yˆ[n] output data signal of the lower beamformer continuous time domain
tmk time delay from the kth f1[n],…, fM[n] environmental noises
from microphone 1 to M speech source to the mth microphone
in the discrete time
domain v frequency value
vc central frequency f[k] environmental noise vector
at the kth iteration v
l lth frequency component
vˆq qth significant frequency
l1(vl),…, lM(vl) eigenvalues of Rxx(vl)
m step size for the LMS component .q qth biggest values