Fast communication
An investigation of time delay estimation in room acoustic
environment using magnitude ratio
Jwu-Sheng Hu, Chia-Hsin Yang
, Wei-Han Liu
Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, Taiwan
a r t i c l e
i n f o
Article history:
Received 19 January 2009 Received in revised form 20 May 2009
Accepted 8 June 2009 Available online 16 June 2009 Keywords:
Time delay estimation Magnitude ratio
a b s t r a c t
This paper investigates the relation between the nonstationary sound source and the frequency domain magnitude ratio of two microphones based on short-term frequency analysis. The fluctuation level of nonstationary sound sources is modeled by the exponent of polynomials from the concept of moving pole model. According to this model, the sufficient condition for utilizing the fluctuation level and magnitude ratio to estimate the time delay between two microphones is suggested. Simulation results are presented to show the performance of the suggested method.
&2009 Elsevier B.V. All rights reserved.
1. Introduction
Time delay estimation (TDE) between two spatially separated sensors is a useful piece of information for many applications such as source localization, beamform-ing or sonar systems. This technique has been widely used in room acoustic environments for sound source localiza-tion and speech enhancement for the past several years. Natural sound sources are usually nonstationary and the real environment contains reverberations. However, very little research on TDE in the past emphasized on the nonstationary nature of sound sources in a reverberant environment.
Among various TDE techniques, the generalized cross correlation (GCC) proposed by Knapp and Carter is the most popular method [1]. In GCC, the time delay is estimated by finding the time-lag which maximizes the cross correlation between two filtered versions of received signals. The GCC method performs fairly well in non-reverberant environments, however, it has limited
perfor-mance in reverberant condition[2]. Some algorithms[3,4]
have been proposed to improve the GCC method perfor-mance in the presence of reverberation or when the interference is directional. Chen et al. proposed the multichannel TDE algorithm based on multichannel cross correlation coefficient (MCCC) [5,6]. This method uses more than two microphones and takes the advantage of redundancy. They also found that the performance in response to noise and reverberation is better as the microphone number increases.
Unlike the cross correlation based methods, Benesty
[7] proposed an adaptive eigenvalue decomposition algorithm for TDE. This method focuses directly on identifying the impulse responses between the sound source and the microphones in order to estimate the time delay. In this method, the eigenvector corresponding to the minimum eigenvalue of the correlation matrix of the received signal contains the impulse response informa-tion. The time delay is determined by finding the direct paths from the two estimated impulse responses.
For estimating the time delay between two micro-phones, phase difference between two microphones is an intuitive cue. Magnitude ratio is relatively more unreliable than phase difference for TDE due to its ambiguity problem [8], especially if the source signal is nonsta-tionary signal. Hence, most work on TDE focus on the Contents lists available atScienceDirect
journal homepage:www.elsevier.com/locate/sigpro
Signal Processing
0165-1684/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2009.06.009
Corresponding author. Lab 905, Engineering Building No. 5, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan. Tel.: +886 3 5712121x54424; fax: +886 3 5715998.
E-mail addresses:jshu@cn.nctu.edu.tw (J.-S. Hu), chyang.ece92g@nc-tu.edu.tw (C.-H. Yang),lukeliu.ece89g@nctu.edu.tw (W.-H. Liu).
phase information process, and very little research estimate time delay using only magnitude ratio. The work in[8]utilizes magnitude ratio information to estimate the sound source location and multiple microphone pairs are employed to solve the magnitude ratio ambiguity pro-blem. Although, it seems unlikely that the time delay can be estimated using only magnitude ratio with two microphones. This paper finds the important relation between magnitude ratio and the nonstationary sound source and presents a preliminary investigation into the possibility of using magnitude ratio for TDE. In this paper, the relation between magnitude ratio and the nonsta-tionary sound source which can be used to estimate time delay is investigated. The idea of moving pole model[9]is employed to model the nonstationary sound source and the acoustic room model is used to simulate the reverberation environment. It is shown that the time delay can be obtained by estimating the slope between magnitude ratio and source fluctuation level parameter using least-square method. The performance of the proposed algorithm is evaluated by simulation and estimation error is also discussed.
2. Nonstationary sound source time delay estimation using magnitude ratio
Before describing the investigation of TDE method using magnitude ratio, the ambiguity problem of TDE using magnitude ratio in a free space environment is presented in Section 2.1.
2.1. Magnitude ratio formulation and ambiguity problem Consider a sound source and two microphones in a generic free space environment. According to the acoustic inverse-square-law, the i-th microphone can be expressed as
yiðnÞ ¼ sðnÞ
di
(1) where n denotes a discrete time index; s(n) represents the input sound signal and diis the distance from the sound
source to the i-th microphone. Thus, the energy received by the i-th microphone can be obtained by integrating the square of the discrete time interval 0Ne:
Ei¼ XNe n¼0 y2 iðnÞ ¼ 1 d2i XNe n¼0 s2ðnÞ (2)
Eq. (2) means the received energy decreases as the inverse of the square of the distance to the source. The above equation has the simple relationship between the first and the second microphone:
E1d21¼E2d22 (3)
Let (x, y) and (xi, yi) be the coordinates of the sound source
and the i-th microphone. Then d2i ¼ ðx xiÞ2þ ðy yiÞ2. It was shown in[8]that Eq. (3) can be written as a circular equation when E1aE2: x cx ce 2 þ y cy ce 2 ¼E1E2d 2 12 c2 e (4) where ce¼E1E2; cx¼E1x12E2x2; cy ¼ E1y1E2y2 and d212¼ ðx1x2Þ2þ ðy1y2Þ2
According to Eq. (4), the sound source is constrained to lie on a circle centered ðcx=ce;cy=ceÞ with a radius of d12
ffiffiffiffiffiffiffiffiffiffi E1E2 p
=ce. Besides, when E1¼E2, Eq. (3) can be written
as 2cxx þ 2cyy ¼ E1 x21þy21 E2 x22þy22 (5) The equation above represents the sound source is located along the line passing through the mid-point of two
microphones and perpendicular to the segment line between two microphones. Let us define the magnitude ratio as
D
E ¼pffiffiffiffiffiffiffiffiffiffiffiffiffiE1=E2. According to Eqs. (4) and (5), the log value of magnitude ratio (10 logD
E) and sound source location relation is shown inFig. 1. Two microphones are located at (0.5, 0) and (0.5, 0). As can be seen, the magnitude ratio can be used to judge whether the sound source is from left or right of the microphone pairs but there is an ambiguity problem for TDE using magnitude ratio. For example, points A and B inFig. 1have the same magnitude ratio but have different time delay obviously. Hence, with only two microphones magnitude ratio may not be able to estimate the time delay even for the free space environment and this is the reason why rare work estimate time delay between two microphones using only magnitude ratio. With different point of view, this paper investigates the relation between nonstationary sound source and magnitude ratio and some important findings regarding the application of magnitude ratio to calculate the time delay are presented in the next section. 2.2. Proposed time delay estimation methodNow, consider a sound source situated within the reverberant environment. The received signal at the i-th microphone can be expressed as
yiðnÞ ¼ hiðnÞ sðnÞ ¼ XL1
l¼0
hi;lsðn lÞ; hi;l0 (6) where hiðnÞ ¼PL1l¼0hi;l
d
ðn lÞ is the room impulse re-sponse (RIR) with length L between the sound source and the i-th microphone. hi,l are the coefficients of the finiteimpulse response (FIR) model for RIR. Without loss of generality, the stationary input signal is assumed to be a complex exponential signal with frequency
o
^k and constant amplitude A:sðnÞ ¼ Aej ^okn (7)
where ^
o
k¼2p
k=N represents the sampled frequency of an N-point short-time Fourier transform (STFT) and k is a integer between 0 and N/21. To analyze the relation between magnitude ratio and nonstationary sound source, a parameterized model for nonstationary sound source is needed. Based on the studies of modeling nonstationary sound source in[9], a nonstationary sound source in an analysis window can be expressed as a sum of moving pole models. In this work, the idea that approximates source signal amplitude as an exponent of polynomial is utilized[9]. Hence, for the nonstationary sound signal, the constant A in Eq. (7) is replaced by time-varying amplitude Anwhich can be expressed asAn¼e PNa
t¼0atðn=fsÞt (8)
where Na is the degree of the polynomial; at is the
coefficient of the polynomial and fsdenotes the sampling
rate. To simplify the analysis, we leave out the terms of tZ2; therefore, Anis modeled as
An¼ea0þðn=fsÞa1 (9)
Therefore, for the defined sound source, the sound
received by the i-th microphone can be represented as yiðnÞ ¼ X L1 l¼0 hi;lsðn lÞ ¼ XL1 l¼0 hi;lAnlej ^okðnlÞ ¼ X L1 l¼0 Anlhi;lej ^oklej ^okn (10)
Take the STFT at frequency ^
o
k:Yiðn; ^
o
kÞ ¼ X N1 t¼0 yiðn þt
Þej ^okðnþtÞ ¼X N1 t¼0 X L1 l¼0 Anþtlhi;lej ^oklej ^okðnþtÞej ^okðnþtÞ ¼X N1 t¼0 X L1 l¼0 Anþtlhi;lej ^okl (11)Substituting An¼ea0þðn=fsÞa1into Eq. (11), Yiðn; ^
o
kÞcan be rewritten asYiðn; ^
o
kÞ ¼ ½ea0þðn=fsÞa1þea0þððnþ1Þ=fsÞa1þ þea0þððnþN1Þ=fsÞa1hi;0ej ^ok0 þ ½ea0þððnþ1Þ=fsÞa1þea0þðn=fsÞa1þ
þea0þððnþN2Þ=fsÞa1h i;1ej ^ok1
.. .
þ ½ea0þððnðL1ÞÞ=fsÞa1þea0þððnðL2ÞÞ=fsÞa1þ
þea0þððnþðNLÞÞ=fsÞa1h
i;L1ej ^okðL1Þ (12) Eq. (12) can be rearranged as
Yiðn; ^
o
kÞ ¼ea0þðn=fsÞa1½1 þ eð1=fsÞa1þ þeððN1Þ=fsÞa1hi;0ej ^ok0
þeða1=fsÞea0þðn=fsÞa1½1 þ eð1=fsÞa1þ
þeððN1Þ=fsÞa1h i;1ej ^ok1
.. .
þeðL1Þða1=fsÞea0þðn=fsÞa1½1 þ eð1=fsÞa1þ
þeððN1Þ=fsÞa1h
i;L1ej ^okðL1Þ
¼ea0þðn=fsÞa11 eðNða1=fsÞÞ.1 eða1=fsÞ
X L1 l¼0 eða1=fsÞlh i;lej ^okl ! (13) Therefore, the natural logarithm of magnitude ratio between two microphones is
Mðn; ^
o
kÞ ¼ln Y1ðn; ^o
kÞ Y2ðn; ^o
kÞ ¼ ln PL1 l¼0eða1=fsÞlh1;lej ^okl PL1 l¼0eða1=fsÞlh2;lej ^okl (14) By observing Eq. (14), we can find that the values of the magnitude ratio depend on the coefficient of the room impulse response models hi,land the value of a1, which isthe slope of the natural logarithm of An. This result
concludes that the magnitude ratio between two micro-phones is still influenced by the reverberations in the room. However, the term eða1=fsÞl in Eq. (14) decreases
with the increase of l when a1is positive. This means the
reflection part in the channel model is less weighted and the influence of direct path is becoming significant. Notice that the numerator or denominator of Eq. (14) is the linear
combination of L vectors. The vector direction is decided by the frequency ^
o
kand l and the magnitude is controlled by eða1=fsÞlhi;l. Since the values eða1=fsÞl and hi,l decrease
with the increase of l, the direct path vector, eða1=fsÞli;D1hi;l
i;D1e
j ^okli;D1, is less influenced by the reflection
vector ðeða1=fsÞli;Dmhi;l i;Dme
j ^okli;Dm;m 2Þ. Hence, when a1 is positive, Mðn; ^
o
Þcan be approximated byMðn; ^
o
kÞ ln eða1=fsÞl1;D1h1;l 1;D1e j ^okl1;D1 eða1=fsÞl2;D1h2;l 2;D1e j ^okl2;D1 ¼ lne ða1=fsÞl1;D1h1;l 1;D1 eða1=fsÞl2;D1h2;l 2;D1 ¼ ðl1;D1l2;D1Þ fs a1þln h1;l1;D1 h2;l2;D1 (15) where l1;D1and l2;D1denote the propagation delay sampleof the direct path from the sound source to the micro-phones. Consequently, the relation between the natural logarithm of magnitude ratio between microphones and a1is approximately linear with a slope of ðl1;D1l2;D1Þ=fs and l1;D1l2;D1 is the time delay sample between
micro-phones. To estimate the time delay between microphones is identical to estimate the slope of the linear relation between Mðn; ^
o
kÞand a1.In summary, to estimate the TDE between two microphones, a set of sound sources with T values of a1
is emitted. Because the T values of a1are decided by us.
Therefore, we choose positive a1to suppress the reflection
part influence in Eq. (14) ^ Mðna1ð1Þ; ^
o
kÞ .. . ^ Mðna1ðTÞ; ^o
kÞ 2 6 6 6 4 3 7 7 7 5¼ a1ð1Þ 1 .. . .. . a1ðTÞ 1 2 6 6 4 3 7 7 5 ðl1;D1l2;D1Þ fs lnh1;l1;D1 h2;l2;D1 2 6 6 6 6 4 3 7 7 7 7 5 (16)where a1ðtÞ; t ¼ 1; . . . ; T, denotes a set of a1,
Mðna1ðtÞ; ^
o
Þ;t ¼ 1; . . . ; T, denotes the magnitude ratioobtained with a1(t). To simplify the expression, we let
l1;D1l2;D1¼D; ^ Mðna1ð1Þ; ^okÞ .. . ^ Mðna1ðTÞ; ^okÞ 2 6 6 6 4 3 7 7 7 5¼ ^Y and a1ð1Þ 1 .. . .. . a1ðTÞ 1 2 6 6 4 3 7 7 5 ¼ X
Finally, the time delay sample D can be estimated by the least-square method:
^
D ¼ ½fs0 ðXTXÞ1XTY^ (17)
2.3. Estimation error analysis
Eq. (14) can be approximated by Eq. (15) due to the fact that eða1=fsÞl and h
i,lare decreasing when l is increasing.
However, the delay estimation error occurs when the reflection is strong. The delay estimation error can be defined as D ^D ¼ ½fs0 ðXTXÞ1XTðY ^YÞ ¼½fs0 ðXTXÞ1XT C að1ð1ÞÞ ln PL1 l¼0 eða1 ð1Þ=f sÞlh1;lej ^ok l PL1 l¼0 eða1 ð1Þ=f sÞlh2;lej ^ok l .. . C að 1ðTÞÞ ln PL1 l¼0 eða1ðTÞ=f s Þlh1;lej ^ok l PL1 l¼0 eða1ðTÞ=f s Þlh2;lej ^ok l 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ¼1 D Tfsa1ð1Þ þ fs PT i¼1 a1ðiÞ; . . . ; Tfsa1ðTÞ þ fs PT i¼1 a1ðiÞ " #
Cða1ð1ÞÞ Pða1ð1ÞÞ þ Q ða1ð1ÞÞ
.. .
Cða1ðTÞÞ Pða1ðTÞÞ þ Q ða1ðTÞÞ
2 6 6 6 4 3 7 7 7 5 (18) where
D
¼T X T i¼1 a1ðiÞ2 XT i¼1 a1ðiÞ !2 Cða1ðiÞÞ ¼ ðl1;D1l2;D1Þ fs a1ðiÞ þ ln h1;l1;D1 h2;l2;D1 Pða1ðiÞÞ ¼ 1 2ln XL1 l¼0 eða1ðiÞ=fsÞlh 1;lcosð ^o
klÞ !2 2 4 þ X L1 l¼0 eða1ðiÞ=fsÞlh 1;lsinð ^o
klÞ !23 5 Q ða1ðiÞÞ ¼ 1 2ln XL1 l¼0 eða1ðiÞ=fsÞlh 2;lcosð ^o
klÞ !2 2 4 þ X L1 l¼0 eða1ðiÞ=fsÞlh 2;lsinð ^o
klÞ !23 5Eq. (18) can be considered as a function of ^
o
kand only the term ^Y is ^o
k dependent. Hence, for different frequency, Eq. (18) is the combination of constant values, cosine signals and sine signals under the fixed room impulse response. It means that the delay estimation error is varying with different ^o
k. Different frequency would cause the different estimation error when the impulse response is unchanged. Moreover, it is easy to see that the estimation error should oscillate with frequency. Strong reflected environment would cause the larger oscillation amplitude.3. Simulation results
This section provides the simulation results to access the capability of the time delay estimation using magni-tude ratio proposed in this paper. In these simulations, the image method[10]is adopted to model the room impulse response and the reflection coefficient is varying between 0 and 1. The sampling rate is 16 kHz. To test the proposed approach carefully, the source signal is the synthetic signal with the known parameters (a1and ^
o
k). The values of a1 are selected to be ten values (a1¼26, 27,y, 35,source signals to generate microphone signals and the STFT size is 1024. The enclosure room size is 10 m 6 m 3.6 m with different reflection coefficients and two microphones with 10 cm spacing are located at (5 2 1.2) and (5.1 2 1.2). Three experiments are carried out in this section and one performance index, root mean square error (RMSE), is defined below to evaluate the perfor-mance of the suggested method:
RMSE ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 NT XNT i¼1 ð ^DiDiÞ2 v u u t (19)
where NTis the total number of estimation; ^Diis the i-th time delay estimation and Di is the i-th correct delay
sample with a integer. The smaller the RMSE is, the better the estimator is.
3.1. Reverberant environment
The first experiment is performed in a reverberant environment and the sound source is placed at a distance of 60 cm from the mid-point of two microphones for several directions. For each testing, the source frequency is chosen from the range 100–1000 Hz. The noise is absence in this experiment and the room reverberation time T60 is computed by Sabine’s formulas. Fig. 2
illustrates the RMSE as the function of the reverberation time. The total estimation number NTis 300. As can be
seen from Fig. 2, in the non-reverberant environment (T60¼0 s), the proposed method can accurately identify
the time delay. This is because when the environment is non-reverberant, the impulse response coefficients only contain one value hi;li;D1. Hence, Eq. (14) can be equal to Eq. (15) exactly. The estimation error occurs as T60increases.
This can be explained by the fact that the strong reflection vector would influence the magnitude of the direct path vector and cause the approximation error of Eq. (15).Fig. 2
also shows that the proposed method has a small RMSE for slight reverberant environment.
3.2. Noisy and non-reverberant environment
In this section, we will evaluate the performance of the proposed algorithm in the non-reverberant but noisy environment. The white noise is properly scaled and added to each microphone signal to control the signal-to-noise ratio (SNR). The total estimation number NTis 300
and the source frequency is 100–1000 Hz.Fig. 3presents the RMSE with respect to varying SNR. The result states that the RMSE decreases when SNR is increased. The RMSE iso1 even at the lower SNR. It can further be noticed that by comparingFig. 2withFig. 3, the proposed method is significantly affected by the reverberation time and is relatively insensitive to the noise. In addition, the noise is also created by the speech source. As can be seen fromFig. 3, the nonstationary noise would affect the performance more serious than the stationary noise.
3.3. Estimation error versus frequency analysis
The RMSE results of Sections 3.1 and 3.2 are the statistical results for different source frequencies. In fact, different source frequency will lead to the different estimation error under the fixed impulse response condi-tion. This section will analyze the relation between estimation error and source frequency. The estimation error is defined in Eq. (18) and the simulation result is depicted inFig. 4. The source is located at (4.7, 2.52, 1.2). As can be seen, the estimation error remains at zero for different frequencies when T60¼0 s. This is expected
since Eq. (18) becomes frequency independent when the environment has no reverberation. However, the estima-tion error starts to oscillate with frequency when T6040 s
and this is because the magnitude ratio components are the combination of some exponential signals. The oscilla-tion amplitude becomes large as the reverberaoscilla-tion time is increased. Fig. 4 also demonstrates that if the impulse response is fixed, there exist some frequencies which can make no estimation error.
In summary, by observing the simulation results, the proposed method can estimate the time delay exactly using only two microphones and magnitude ratio infor-mation in the non-reverberant environment but the performance degrades as the reverberation is present. In this paper, we present a preliminary investigation into the possibility of using magnitude ratio for TDE and the moving pole model with the known parameters (a1 and
^
o
k) is needed to be the sound source. In order to apply the proposed method to handle the real nonstationary sound source (such as speech) or to be more robust to thereverberant environment, the more complex models may be incorporated. This is left as a further research topic.
4. Conclusion
This paper investigates the relation between nonsta-tionary sound source and magnitude ratio when STFT is utilized. From the investigation, a method which can be used to estimate the time delay is suggested. In this method, the time delay can be obtained by estimating the Fig. 3. RMSE versus SNR.
slope between magnitude ratio and a parameter of the moving pole model of the nonstationary sound source. The performance of the proposed method in different reverberation environments and SNR is presented with simulation and the relation between the performance and source signal frequency is also discussed.
References
[1] C.H. Knapp, G.C. Carter, The generalized correlation method for estimation of time delay, IEEE Trans. Acoust. Speech Signal Process. 24 (1976) 320–327.
[2] B. Champagne, S. Bedard, A. Stephenne, Performance of time-delay estimation in the presence of room reverberation, IEEE Trans. Speech Audio Process. 4 (2) (1996) 148–152.
[3] M. Brandstein, H. Silverman, A robust method for speech signal time-delay estimation in reverberant rooms, in: Proceedings of the
IEEE International Conference on Acoustics, Speech and Signal Processing, 1997, pp. 375–378.
[4] C. Nikias, R. Pan, Time delay estimation in unknown Gaussian spatially correlated noise, IEEE Trans. Acoust. Speech Signal Process. 36 (1988) 1706–1714.
[5] J. Chen, J. Benesty, Y. Huang, Robust time delay estimation exploiting redundancy among multiple microphones, IEEE Trans. Speech Audio Process. 11 (2003) 549–557.
[6] J. Benesty, Y. Huang, J. Chen, Time delay estimation via minimum entropy, IEEE Signal Process. Lett. 14 (2007) 157–160.
[7] J. Benesty, Adaptive eigenvalue decomposition algorithm for passive acoustic source localization, J. Acoust. Soc. Am. 107 (1) (2000) 384–391.
[8] S.T. Birchfield, R. Gangishetty, Acoustic localization by interaural level difference, IEEE Int. Conf. Acoust. Speech Signal Process. 4 (2005) 1109–1112.
[9] F. Casacuberta, E. Vidal, A nonstationary model for the analysis of transient speech signals, IEEE Trans. Acoust. Speech Signal Process. 35 (2) (1987) 226–228.
[10] J.B. Allen, D.A. Berkley, Image method for efficiently simulating small-room acoustics, J. Acoust. Soc. Am. 65 (1978) 943–950.