An investigation of time delay estimation in room acoustic environment using magnitude ratio

(1)

Fast communication

An investigation of time delay estimation in room acoustic

environment using magnitude ratio

Jwu-Sheng Hu, Chia-Hsin Yang

, Wei-Han Liu

Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, Taiwan

a r t i c l e

i n f o

Article history:

Received 19 January 2009 Received in revised form 20 May 2009

Accepted 8 June 2009 Available online 16 June 2009 Keywords:

Time delay estimation Magnitude ratio

a b s t r a c t

This paper investigates the relation between the nonstationary sound source and the frequency domain magnitude ratio of two microphones based on short-term frequency analysis. The fluctuation level of nonstationary sound sources is modeled by the exponent of polynomials from the concept of moving pole model. According to this model, the sufficient condition for utilizing the fluctuation level and magnitude ratio to estimate the time delay between two microphones is suggested. Simulation results are presented to show the performance of the suggested method.

1. Introduction

Time delay estimation (TDE) between two spatially separated sensors is a useful piece of information for many applications such as source localization, beamform-ing or sonar systems. This technique has been widely used in room acoustic environments for sound source localiza-tion and speech enhancement for the past several years. Natural sound sources are usually nonstationary and the real environment contains reverberations. However, very little research on TDE in the past emphasized on the nonstationary nature of sound sources in a reverberant environment.

Among various TDE techniques, the generalized cross correlation (GCC) proposed by Knapp and Carter is the most popular method [1]. In GCC, the time delay is estimated by ﬁnding the time-lag which maximizes the cross correlation between two ﬁltered versions of received signals. The GCC method performs fairly well in non-reverberant environments, however, it has limited

perfor-mance in reverberant condition[2]. Some algorithms[3,4]

have been proposed to improve the GCC method perfor-mance in the presence of reverberation or when the interference is directional. Chen et al. proposed the multichannel TDE algorithm based on multichannel cross correlation coefﬁcient (MCCC) [5,6]. This method uses more than two microphones and takes the advantage of redundancy. They also found that the performance in response to noise and reverberation is better as the microphone number increases.

Unlike the cross correlation based methods, Benesty

[7] proposed an adaptive eigenvalue decomposition algorithm for TDE. This method focuses directly on identifying the impulse responses between the sound source and the microphones in order to estimate the time delay. In this method, the eigenvector corresponding to the minimum eigenvalue of the correlation matrix of the received signal contains the impulse response informa-tion. The time delay is determined by ﬁnding the direct paths from the two estimated impulse responses.

For estimating the time delay between two micro-phones, phase difference between two microphones is an intuitive cue. Magnitude ratio is relatively more unreliable than phase difference for TDE due to its ambiguity problem [8], especially if the source signal is nonsta-tionary signal. Hence, most work on TDE focus on the Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/sigpro

Signal Processing

Corresponding author. Lab 905, Engineering Building No. 5, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan. Tel.: +886 3 5712121x54424; fax: +886 3 5715998.

E-mail addresses:jshu@cn.nctu.edu.tw (J.-S. Hu), chyang.ece92g@nc-tu.edu.tw (C.-H. Yang),lukeliu.ece89g@nctu.edu.tw (W.-H. Liu).

(2)

phase information process, and very little research estimate time delay using only magnitude ratio. The work in[8]utilizes magnitude ratio information to estimate the sound source location and multiple microphone pairs are employed to solve the magnitude ratio ambiguity pro-blem. Although, it seems unlikely that the time delay can be estimated using only magnitude ratio with two microphones. This paper ﬁnds the important relation between magnitude ratio and the nonstationary sound source and presents a preliminary investigation into the possibility of using magnitude ratio for TDE. In this paper, the relation between magnitude ratio and the nonsta-tionary sound source which can be used to estimate time delay is investigated. The idea of moving pole model[9]is employed to model the nonstationary sound source and the acoustic room model is used to simulate the reverberation environment. It is shown that the time delay can be obtained by estimating the slope between magnitude ratio and source ﬂuctuation level parameter using least-square method. The performance of the proposed algorithm is evaluated by simulation and estimation error is also discussed.

2. Nonstationary sound source time delay estimation using magnitude ratio

Before describing the investigation of TDE method using magnitude ratio, the ambiguity problem of TDE using magnitude ratio in a free space environment is presented in Section 2.1.

2.1. Magnitude ratio formulation and ambiguity problem Consider a sound source and two microphones in a generic free space environment. According to the acoustic inverse-square-law, the i-th microphone can be expressed as

yiðnÞ ¼ sðnÞ

di

(1) where n denotes a discrete time index; s(n) represents the input sound signal and diis the distance from the sound

source to the i-th microphone. Thus, the energy received by the i-th microphone can be obtained by integrating the square of the discrete time interval 0Ne:

Ei¼ XNe n¼0 y2 iðnÞ ¼ 1 d2i XNe n¼0 s2_ðnÞ ₍₂₎

Eq. (2) means the received energy decreases as the inverse of the square of the distance to the source. The above equation has the simple relationship between the ﬁrst and the second microphone:

E1d21¼E2d22 (3)

Let (x, y) and (xi, yi) be the coordinates of the sound source

and the i-th microphone. Then d2i ¼ ðx xiÞ2þ ðy yiÞ2. It was shown in[8]that Eq. (3) can be written as a circular equation when E1aE2: x cx ce 2 þ y cy ce 2 ¼E1E2d 2 12 c2 e (4) where ce¼E1E2; cx¼E1x12E2x2; cy ¼ E1y1E2y2 and d212¼ ðx1x2Þ2þ ðy1y2Þ2

According to Eq. (4), the sound source is constrained to lie on a circle centered ðcx=ce;cy=ceÞ with a radius of d12

ffiffiffiffiffiffiffiffiffiffi E1E2 p

=ce. Besides, when E1¼E2, Eq. (3) can be written

as 2cxx þ 2cyy ¼ E1 x21þy21 E2 x22þy22 (5) The equation above represents the sound source is located along the line passing through the mid-point of two

(3)

microphones and perpendicular to the segment line between two microphones. Let us deﬁne the magnitude ratio as

D

E ¼pffiffiffiffiffiffiffiffiffiffiffiffiffiE1=E2. According to Eqs. (4) and (5), the log value of magnitude ratio (10 log

D

E) and sound source location relation is shown inFig. 1. Two microphones are located at (0.5, 0) and (0.5, 0). As can be seen, the magnitude ratio can be used to judge whether the sound source is from left or right of the microphone pairs but there is an ambiguity problem for TDE using magnitude ratio. For example, points A and B inFig. 1have the same magnitude ratio but have different time delay obviously. Hence, with only two microphones magnitude ratio may not be able to estimate the time delay even for the free space environment and this is the reason why rare work estimate time delay between two microphones using only magnitude ratio. With different point of view, this paper investigates the relation between nonstationary sound source and magnitude ratio and some important ﬁndings regarding the application of magnitude ratio to calculate the time delay are presented in the next section. 2.2. Proposed time delay estimation method

Now, consider a sound source situated within the reverberant environment. The received signal at the i-th microphone can be expressed as

yiðnÞ ¼ hiðnÞ sðnÞ ¼ XL1

l¼0

hi;lsðn lÞ; hi;l0 (6) where hiðnÞ ¼PL1l¼0hi;l

d

ðn lÞ is the room impulse re-sponse (RIR) with length L between the sound source and the i-th microphone. hi,l are the coefﬁcients of the ﬁnite

impulse response (FIR) model for RIR. Without loss of generality, the stationary input signal is assumed to be a complex exponential signal with frequency

o

^k and constant amplitude A:

sðnÞ ¼ Aej ^okn ₍₇₎

where ^

o

k¼2

p

k=N represents the sampled frequency of an N-point short-time Fourier transform (STFT) and k is a integer between 0 and N/21. To analyze the relation between magnitude ratio and nonstationary sound source, a parameterized model for nonstationary sound source is needed. Based on the studies of modeling nonstationary sound source in[9], a nonstationary sound source in an analysis window can be expressed as a sum of moving pole models. In this work, the idea that approximates source signal amplitude as an exponent of polynomial is utilized[9]. Hence, for the nonstationary sound signal, the constant A in Eq. (7) is replaced by time-varying amplitude Anwhich can be expressed as

An¼e PNa

t¼0atðn=fsÞt ₍₈₎

where Na is the degree of the polynomial; at is the

coefﬁcient of the polynomial and fsdenotes the sampling

rate. To simplify the analysis, we leave out the terms of tZ2; therefore, Anis modeled as

An¼ea0þðn=fsÞa1 (9)

Therefore, for the deﬁned sound source, the sound

received by the i-th microphone can be represented as yiðnÞ ¼ X L1 l¼0 hi;lsðn lÞ ¼ XL1 l¼0 hi;lAnlej ôkðnlÞ ¼ X L1 l¼0 Anlhi;lej ôklej ôkn (10)

Take the STFT at frequency ^

o

k:

Yiðn; ^

o

kÞ ¼ X N1 t¼0 yiðn þ

t

Þej ôkðnþtÞ ¼X N1 t¼0 X L1 l¼0 Anþtlhi;lej ôklej ôkðnþtÞej ôkðnþtÞ ¼X N1 t¼0 X L1 l¼0 Anþtlhi;lej ôkl (11)

Substituting An¼ea0þðn=fsÞa1into Eq. (11), Yiðn; ^

o

kÞcan be rewritten as

Yiðn; ^

o

kÞ ¼ ½ea0þðn=fsÞa1þea0þððnþ1Þ=fsÞa1þ þea0þððnþN1Þ=fsÞa1_h

i;0ej ^ok0 þ ½ea0þððnþ1Þ=fsÞa1_þ_ea0þðn=fsÞa1_þ

þea0þððnþN2Þ=fsÞa1_h i;1ej ^ok1

.. .

þ ½ea0þððnðL1ÞÞ=fsÞa1_þ_ea0þððnðL2ÞÞ=fsÞa1_þ

þea0þððnþðNLÞÞ=fsÞa1_h

i;L1ej ^okðL1Þ (12) Eq. (12) can be rearranged as

Yiðn; ^

o

kÞ ¼ea0þðn=fsÞa1½1 þ eð1=fsÞa1þ þeððN1Þ=fsÞa1_h

i;0ej ^ok0

þeða1=fsÞ_ea0þðn=fsÞa1_{½1 þ e}ð1=fsÞa1_þ

þeððN1Þ=fsÞa1_h i;1ej ^ok1

.. .

þeðL1Þða1=fsÞ_ea0þðn=fsÞa1_{½1 þ e}ð1=fsÞa1_þ

þeððN1Þ=fsÞa1_h

i;L1ej ^okðL1Þ

¼ea0þðn=fsÞa1_{1 e}ðNða1=fsÞÞ._{1 e}ða1=fsÞ

X L1 l¼0 eða1=fsÞl_h i;lej ^okl ! (13) Therefore, the natural logarithm of magnitude ratio between two microphones is

Mðn; ^

o

kÞ ¼ln Y1ðn; ^

o

kÞ Y2ðn; ^

o

kÞ ¼ ln PL1 l¼0eða1=fsÞlh1;lej ôkl PL1 l¼0eða1=fsÞlh2;lej ôkl (14) By observing Eq. (14), we can find that the values of the magnitude ratio depend on the coefficient of the room impulse response models hi,land the value of a1, which is

the slope of the natural logarithm of An. This result

concludes that the magnitude ratio between two micro-phones is still inﬂuenced by the reverberations in the room. However, the term eða1=fsÞl _{in Eq. (14) decreases}

with the increase of l when a1is positive. This means the

reflection part in the channel model is less weighted and the influence of direct path is becoming significant. Notice that the numerator or denominator of Eq. (14) is the linear

(4)

combination of L vectors. The vector direction is decided by the frequency ^

o

kand l and the magnitude is controlled by eða1=fsÞl_h

i;l. Since the values eða1=fsÞl and hi,l decrease

with the increase of l, the direct path vector, eða1=fsÞli;D1_h_i;l

i;D1e

j ôkli;D1_{, is less influenced by the reflection}

vector ðeða1=fsÞli;Dm_h_i;l i;Dme

j ^okli;Dm;m 2Þ. Hence, when a1 is positive, Mðn; ^

o

Þcan be approximated by

Mðn; ^

o

kÞ ln eða1=fsÞl1;D1_h_1;l 1;D1e j ^okl1;D1 eða1=fsÞl2;D1_h_2;l 2;D1e j ^okl2;D1 ¼ lne ða1=fsÞl1;D1_h_1;l 1;D1 eða1=fsÞl2;D1_h_2;l 2;D1 ¼ ðl1;D1l2;D1Þ fs a1þln h1;l_1;D1 h2;l_2;D1 (15) where l1;D1and l2;D1denote the propagation delay sample

of the direct path from the sound source to the micro-phones. Consequently, the relation between the natural logarithm of magnitude ratio between microphones and a1is approximately linear with a slope of ðl1;D1l2;D1Þ=fs and l1;D1l2;D1 is the time delay sample between

micro-phones. To estimate the time delay between microphones is identical to estimate the slope of the linear relation between Mðn; ^

o

kÞand a1.

In summary, to estimate the TDE between two microphones, a set of sound sources with T values of a1

is emitted. Because the T values of a1are decided by us.

Therefore, we choose positive a1to suppress the reﬂection

part inﬂuence in Eq. (14) ^ Mðna1ð1Þ; ^

o

kÞ .. . ^ Mðna1ðTÞ; ^

o

kÞ 2 6 6 6 4 3 7 7 7 5¼ a1ð1Þ 1 .. . .. . a1ðTÞ 1 2 6 6 4 3 7 7 5 ðl1;D1l2;D1Þ fs lnh1;l1;D1 h2;l_2;D1 2 6 6 6 6 4 3 7 7 7 7 5 (16)

where a1ðtÞ; t ¼ 1; . . . ; T, denotes a set of a1,

Mðna1ðtÞ; ^

o

Þ;t ¼ 1; . . . ; T, denotes the magnitude ratio

obtained with a1(t). To simplify the expression, we let

l1;D1l2;D1¼D; ^ Mðna1ð1Þ; ^okÞ .. . ^ Mðna1ðTÞ; ^okÞ 2 6 6 6 4 3 7 7 7 5¼ ^Y and a1ð1Þ 1 .. . .. . a1ðTÞ 1 2 6 6 4 3 7 7 5 ¼ X

Finally, the time delay sample D can be estimated by the least-square method:

^

D ¼ ½fs0 ðXTXÞ1XTY^ (17)

2.3. Estimation error analysis

Eq. (14) can be approximated by Eq. (15) due to the fact that eða1=fsÞl _{and h}

i,lare decreasing when l is increasing.

However, the delay estimation error occurs when the reflection is strong. The delay estimation error can be defined as D ^D ¼ ½fs0 ðXTXÞ1XTðY ^YÞ ¼½fs0 ðXTXÞ1XT C að1ð1ÞÞ ln PL1 l¼0 eða1 ð1Þ=f sÞlh1;lej ôk l PL1 l¼0 eð_{a1 ð1Þ}=f sÞlh2;lej ôk l .. . C að 1ðTÞÞ ln PL1 l¼0 eða1ðTÞ=f s Þlh1;lej ôk l PL1 l¼0 eð_a1ðTÞ=f s Þlh2;lej ôk l 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ¼1 D Tfsa1ð1Þ þ fs PT i¼1 a1ðiÞ; . . . ; Tfsa1ðTÞ þ fs PT i¼1 a1ðiÞ " #

Cða1ð1ÞÞ Pða1ð1ÞÞ þ Q ða1ð1ÞÞ

.. .

Cða1ðTÞÞ Pða1ðTÞÞ þ Q ða1ðTÞÞ

2 6 6 6 4 3 7 7 7 5 (18) where

D

¼T X T i¼1 a1ðiÞ2 XT i¼1 a1ðiÞ !2 Cða1ðiÞÞ ¼ ðl1;D1l2;D1Þ fs a1ðiÞ þ ln h1;l_1;D1 h2;l_2;D1 Pða1ðiÞÞ ¼ 1 2ln XL1 l¼0 eða1ðiÞ=fsÞl_h 1;lcosð ^

o

klÞ !2 2 4 þ X L1 l¼0 eða1ðiÞ=fsÞl_h 1;lsinð ^

o

klÞ !23 5 Q ða1ðiÞÞ ¼ 1 2ln XL1 l¼0 eða1ðiÞ=fsÞl_h 2;lcosð ^

o

klÞ !2 2 4 þ X L1 l¼0 eða1ðiÞ=fsÞl_h 2;lsinð ^

o

klÞ !23 5

Eq. (18) can be considered as a function of ^

o

kand only the term ^Y is ^

o

k dependent. Hence, for different frequency, Eq. (18) is the combination of constant values, cosine signals and sine signals under the ﬁxed room impulse response. It means that the delay estimation error is varying with different ^

o

k. Different frequency would cause the different estimation error when the impulse response is unchanged. Moreover, it is easy to see that the estimation error should oscillate with frequency. Strong reﬂected environment would cause the larger oscillation amplitude.

3. Simulation results

This section provides the simulation results to access the capability of the time delay estimation using magni-tude ratio proposed in this paper. In these simulations, the image method[10]is adopted to model the room impulse response and the reﬂection coefﬁcient is varying between 0 and 1. The sampling rate is 16 kHz. To test the proposed approach carefully, the source signal is the synthetic signal with the known parameters (a1and ^

o

k). The values of a1 are selected to be ten values (a1¼26, 27,y, 35,

(5)

source signals to generate microphone signals and the STFT size is 1024. The enclosure room size is 10 m 6 m 3.6 m with different reflection coefficients and two microphones with 10 cm spacing are located at (5 2 1.2) and (5.1 2 1.2). Three experiments are carried out in this section and one performance index, root mean square error (RMSE), is defined below to evaluate the perfor-mance of the suggested method:

RMSE ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 NT XNT i¼1 ð ^DiDiÞ2 v u u t (19)

where NTis the total number of estimation; ^Diis the i-th time delay estimation and Di is the i-th correct delay

sample with a integer. The smaller the RMSE is, the better the estimator is.

3.1. Reverberant environment

The ﬁrst experiment is performed in a reverberant environment and the sound source is placed at a distance of 60 cm from the mid-point of two microphones for several directions. For each testing, the source frequency is chosen from the range 100–1000 Hz. The noise is absence in this experiment and the room reverberation time T60 is computed by Sabine’s formulas. Fig. 2

illustrates the RMSE as the function of the reverberation time. The total estimation number NTis 300. As can be

seen from Fig. 2, in the non-reverberant environment (T60¼0 s), the proposed method can accurately identify

the time delay. This is because when the environment is non-reverberant, the impulse response coefﬁcients only contain one value hi;l_i;D1. Hence, Eq. (14) can be equal to Eq. (15) exactly. The estimation error occurs as T60increases.

This can be explained by the fact that the strong reﬂection vector would inﬂuence the magnitude of the direct path vector and cause the approximation error of Eq. (15).Fig. 2

also shows that the proposed method has a small RMSE for slight reverberant environment.

3.2. Noisy and non-reverberant environment

In this section, we will evaluate the performance of the proposed algorithm in the non-reverberant but noisy environment. The white noise is properly scaled and added to each microphone signal to control the signal-to-noise ratio (SNR). The total estimation number NTis 300

and the source frequency is 100–1000 Hz.Fig. 3presents the RMSE with respect to varying SNR. The result states that the RMSE decreases when SNR is increased. The RMSE iso1 even at the lower SNR. It can further be noticed that by comparingFig. 2withFig. 3, the proposed method is signiﬁcantly affected by the reverberation time and is relatively insensitive to the noise. In addition, the noise is also created by the speech source. As can be seen fromFig. 3, the nonstationary noise would affect the performance more serious than the stationary noise.

3.3. Estimation error versus frequency analysis

The RMSE results of Sections 3.1 and 3.2 are the statistical results for different source frequencies. In fact, different source frequency will lead to the different estimation error under the ﬁxed impulse response condi-tion. This section will analyze the relation between estimation error and source frequency. The estimation error is deﬁned in Eq. (18) and the simulation result is depicted inFig. 4. The source is located at (4.7, 2.52, 1.2). As can be seen, the estimation error remains at zero for different frequencies when T60¼0 s. This is expected

since Eq. (18) becomes frequency independent when the environment has no reverberation. However, the estima-tion error starts to oscillate with frequency when T6040 s

and this is because the magnitude ratio components are the combination of some exponential signals. The oscilla-tion amplitude becomes large as the reverberaoscilla-tion time is increased. Fig. 4 also demonstrates that if the impulse response is ﬁxed, there exist some frequencies which can make no estimation error.

(6)

In summary, by observing the simulation results, the proposed method can estimate the time delay exactly using only two microphones and magnitude ratio infor-mation in the non-reverberant environment but the performance degrades as the reverberation is present. In this paper, we present a preliminary investigation into the possibility of using magnitude ratio for TDE and the moving pole model with the known parameters (a1 and

^

o

k) is needed to be the sound source. In order to apply the proposed method to handle the real nonstationary sound source (such as speech) or to be more robust to the

reverberant environment, the more complex models may be incorporated. This is left as a further research topic.

4. Conclusion

This paper investigates the relation between nonsta-tionary sound source and magnitude ratio when STFT is utilized. From the investigation, a method which can be used to estimate the time delay is suggested. In this method, the time delay can be obtained by estimating the Fig. 3. RMSE versus SNR.

(7)

slope between magnitude ratio and a parameter of the moving pole model of the nonstationary sound source. The performance of the proposed method in different reverberation environments and SNR is presented with simulation and the relation between the performance and source signal frequency is also discussed.

References

[1] C.H. Knapp, G.C. Carter, The generalized correlation method for estimation of time delay, IEEE Trans. Acoust. Speech Signal Process. 24 (1976) 320–327.

[2] B. Champagne, S. Bedard, A. Stephenne, Performance of time-delay estimation in the presence of room reverberation, IEEE Trans. Speech Audio Process. 4 (2) (1996) 148–152.

[3] M. Brandstein, H. Silverman, A robust method for speech signal time-delay estimation in reverberant rooms, in: Proceedings of the

IEEE International Conference on Acoustics, Speech and Signal Processing, 1997, pp. 375–378.

[4] C. Nikias, R. Pan, Time delay estimation in unknown Gaussian spatially correlated noise, IEEE Trans. Acoust. Speech Signal Process. 36 (1988) 1706–1714.

[5] J. Chen, J. Benesty, Y. Huang, Robust time delay estimation exploiting redundancy among multiple microphones, IEEE Trans. Speech Audio Process. 11 (2003) 549–557.

[6] J. Benesty, Y. Huang, J. Chen, Time delay estimation via minimum entropy, IEEE Signal Process. Lett. 14 (2007) 157–160.

[7] J. Benesty, Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, J. Acoust. Soc. Am. 107 (1) (2000) 384–391.

[8] S.T. Birchﬁeld, R. Gangishetty, Acoustic localization by interaural level difference, IEEE Int. Conf. Acoust. Speech Signal Process. 4 (2005) 1109–1112.

[9] F. Casacuberta, E. Vidal, A nonstationary model for the analysis of transient speech signals, IEEE Trans. Acoust. Speech Signal Process. 35 (2) (1987) 226–228.

[10] J.B. Allen, D.A. Berkley, Image method for efﬁciently simulating small-room acoustics, J. Acoust. Soc. Am. 65 (1978) 943–950.