Chapter 1 Introduction
1.5 Organization of this Dissertation
This dissertation is organized as follows. This chapter provides a brief introduction to the general sound source localization algorithms, including methods which follow the auditory system of human and methods based on microphone array.
Moreover, this chapter also discusses the main contribution and the organization of this dissertation. In the next chapter, the binaural room distribution pattern and related GMBRDM are introduced. In Chapter 3, sound based robot’s location and orientation detection system is proposed and discussed. The related experiments are shown in Chapter 4. Chapter 5 gives some conclusion remarks and avenues for future research.
Chapter 2
Nonstationary Sound Source
Localization Using Binaural Room Distribution Pattern
2.1 Introduction
As discussed in the first chapter, localizing a nonstationary sound source in a reverberant environment can face temporal fluctuation of interaural cues. Recent research results for sound source localization revealed the importance of temporal fluctuation phenomenon of IPDs and ILDs. Rather than eliminate the influence of these fluctuations, these studies attempted to describe these fluctuations using statistical models for sound source localization [56]. The work in [48] investigated localization cues of IPDs and ILDs exhibiting temporal fluctuation phenomena when sound sources are nonstationary and short-term frequency analysis, such as short-term Fourier transform (STFT), is utilized. In [48], distribution patterns of IPDs and ILDs were calculated from the superposition of sound sources recorded in an anechoic
patterns were applied to estimate the azimuth and elevation of a sound source using Bayesian maximum a posteriori estimation. The experiments demonstrated good results in both quiet and noisy conditions. Further, Smaragdis and Boufounos [49]
also used the Gaussian model to model the empirical features in a reverberant room.
In their work, the fluctuation of relative magnitude and phase of the cross spectra is modeled for sound source localization and the wrapping effect is solved using the proposed wrapped Gaussian model.
This study attempts to probe further the cause of IPD and ILD distribution patterns when a sound source is nonstationary and STFT is utilized. To simplify the description, distribution patterns of IPDs and ILDs are called binaural room distribution patterns (BRDPs) in the remainder of this work. The idea of moving pole model is employed to model the nonstationary sound sources; consequently, the level fluctuation is modeled as an exponent of polynomial. Based on this model, it can be shown that BRDPs depend on the content of the nonstationary source signals. The dependency is analyzed to explain the phenomenon of multiple peaks in the BRDPs.
In real environment, more than one peak can exist in the measured BRDPs. For example, Fig. 2-1 illustrates the IPDs and ILDs measured at the location marked “A”
in Fig. 2-2.
(a)
(b)
Figure 2-1 The histograms of IPDs and ILDs measured at location the marked “A”.
(a) The histogram of IPDs at the location marked “A”. (b) The histogram of ILDs at the location marked “A”.
Figure 2-2 The recording environment.
As shown in Fig. 2-1, the IPD and ILD contain multiple peaks. This phenomenon can be explained with the proposed model.
Since the BRDPs can contain multiple peaks, a modeling method that deals with complicated distribution patterns is needed. Although the work in [48] utilized normalized histograms to model distribution patterns, the memory requirement is considerable when histogram resolution is high. Therefore, this work adopts GMMs to model BRDPs and proposes a GMBRDM to parameterize them. Because the proposed GMBRDM is a linear combination of the phase difference GMM and the magnitude ratio GMM, a method is proposed to obtain the optimal weights of the linear combination to enhance the localization ability. Additionally, because BRDPs contain information on direct paths and reflections, localizing a sound source in the azimuth, elevation and distance using the proposed GMBRDM is possible.
The remainder of this chapter is organized as follows. The next section discusses how the nonstationary sound source can influence the IPD and ILD. A simulation of a simplified environment is performed to verify the discussion in Section 2.3. Section 2.4 presents the formulation of the proposed GMBRDM. The summary is given in Section 2.5.
2.2 The Relation between the Nonstationary Sound Source and the BRDP
2.2.1 IPDs and ILDs of Stationary Sound Source
A linear time-invariant (LTI) room acoustic channel is represented by a K tapped finite impulse response (FIR) model
( ) ∑
−( )
received by the ear, and b is the coefficients of the FIR model for the room impulse k response (RIR) from sound source to an ear. Without lost of generality, the stationary input signal is assumed to be a complex exponential signal with frequency ωˆ and constant level A :ω = , which is the sampled frequency of an N-point STFT.
For such input, the corresponding output is
( ) ( )
Take the N-point STFT at frequency ωˆ: nature logarithm is taken for computing the magnitude ratio. As shown in (2-5), the IPD and ILD between YL(ωˆ) and YR(ωˆ) depend only on the frequency responses of the channels and the measured frequency, as discussed in related research.
2.2.2 IPDs and ILDs of Nonstationary Sound Source
Although the nonstationarity of a sound source can be tested in many different domains [50], this work only considers time domain variation. To model time domain variation of a sound source, the level of the complex exponential signal in (2-2) is assumed as time varying:
( )
n Anej ej nx = φ ωˆ (2-6)
where A is a time-varying sound level. Accordingly, the output n y
( )
n can beformulated as:
( ) ( )
∑∑
As shown in (2-10), the phase difference and magnitude ratio become content dependent when STFT is utilized and A is nonstationary. n
2.2.3 Modeling the Nonstationary Sound Source Using Moving Pole Model
To analyze how nonstationarity of a sound source influences the IPD and ILD, a parameterized model for nonstationary sound is needed. Based on the discussion in [51] and [52], a nonstationary sound source in an analysis window can be expressed as a sum of moving pole models. In this work, the idea in [52] that approximate A n as an exponent of polynomial is utilized. In [52],
= ∑= ⎟⎟⎠
where N is the degree of the polynomial, a a is the coefficient of the polynomial t and f denotes the sampling frequency. To simplify the analysis, we omit the terms s of t≥2, as in [30]; hence, A is modeled as: n
This equation can be rearranged as:
Through the same procedure, we have
φ
Consequently, the IPD and ILD are
By observing (2-17), this study finds that the IPD and ILD values depend on the coefficient of the FIR models and the value of a1, which is the slope of the nature logarithm of A . n
2.3 Simulation Verification and Discussion of Proposed Model
2.3.1 Content Dependency of BRDPs Obtained from Nonstationary Sound Source
To verify the proposed analysis, a simplified simulation environment (Fig. 2-3) is assumed (Although the simplified environment is utilized as an example here, the following discussion of the relationship between BRDPs and nonstationary sound sources can be applied to general cases).
Figure 2-3 Simulation configuration.
As depicted in Fig. 2-3, the only cause of reflection is the only cause of reflection is the infinite wall located at x=0. The two microphones are located at
(
x1,y1,z1) (
= 4.8m,0.5m,0m)
and(
x2,y2,z2) (
= 5.2m,0.5m,0m)
and the sound source is located at(
x3,y3,z3) (
= 5m,0m,0m)
. The models from the sound source to the microphones are simulated by the image method introduced in [47] with sound speed c=340m s and sampling rate fs =8000Hz. The wall is assumed to be rigid.The simulated model is depicted in Fig. 2-4.
Figure 2-4 Simulated model from sound source to the microphones.
Two different sources are input into the simulation model to show the content dependency of IPD and ILD histograms. For the first source, the value of a1 in the measured frames is uniformly distributed between
[
−500,0]
. The IPDs and ILDs at a frequency of 140.625 Hz are computed 1000 times. Fig. 2-5 presents the histograms, which can represent the probability distribution, of IPDs and ILDs. The second source is similar to the first, except the value of a1 is uniformly distributed between[
−500,200]
in the measured frames. The histograms are illustrated in Fig. 2-6.(a)
(b)
Figure 2-5 The histograms of IPDs and ILDs of the first sound source. (a) The histogram of IPDs of the first sound source. (b) The histogram of ILDs of the first
(a)
(b)
Figure 2-6 The histograms of IPDs and ILDs of the second sound source. (a) The histogram of IPDs of the second sound source. (b) The histogram of ILDs of the second sound source.
The simulation results in Figs. 2-5 and 2-6 demonstrate that when the sound sources are nonstationary, the IPD and ILD histograms depend on the content of the source signal. Therefore, conditions of the nonstationary sound source must be designed such that the BRDPs can be utilized for localization. In view of the aforementioned discussion, the sufficient condition is that the distribution of a1 of the sound source must be stationary to make the sound source applicable for localization. Care must be exercised when using IPDs and ILDs obtained from nonstationary sound sources for sound source localization to avoid performance degradation.
2.3.2 The Formation of Peaks in the Distribution Patterns of IPDs
As shown by the simulation in Section 2.3.1, the distribution patterns of IPDs exhibit multiple peaks. This phenomenon also appears in the empirical results in real environment. The derivation result of (2-17) can be adopted to explain this phenomenon.
According to (2-17), there are several possible reasons to form peaks in the distribution patterns of IPDs. First, if a1 of a sound source is concentrated at a certain value, a peak in the histogram will result. An obvious example is a stationary sound source. For a stationary sound source, a1 =0 for all measured frames, which makes IPD a fixed value, results in a peak in the distribution pattern.
Secondly, the term f k
a
e s
−1
in (2-17) decreases as k increases when a1 is positive. This means the weights of the reflection part in the channel model is reduced and the influence of the direct path are increased. Hence, when a1 exceeds a certain
⎟⎟⎠ to microphones. Based on (2-18), the phase difference between direct paths from a sound source to microphones is emphasized and can dominate the measured IPDs.
Since the IPDs are approximately the same for all a1 exceed a certain level, a peak can be formed in the distribution pattern. This derivation explains why some previous research results of IPD-based time delay estimation suggested utilizing speech source onset to improve the accuracy [24]. On the contrary, when a1 is negative, the value
of f k
a
e s
− 1
increases with k. In this case, the influence of the direct path is suppressed and the reflections can dominate the measured IPDs.
The second simulation in Section 2.3.1 is utilized to interpret the relationship between a1 and the IPD (Fig. 2-7).
Figure 2-7 Relation between the value of a1 and the IPD.
In Fig. 2-7, as a1>100, the value of IPD approaches 0, which is the phase difference caused by the direct paths from the sound source to microphones. On the other hand, when a1 <−300, the value converges to 1.1, representing the phase difference influenced by wall reflection. It is then easy to understand why there are two peaks at 0 and 1.1 in Fig. 2-6 (a). Generally, reflections appear later in the propagation model than direct paths, meaning that a negative value of a1 is required to emphasize the effect of reflections. Consequently, the more the wall or boundary absorbs the energy of sound source, the smaller value of negative a1 is required to emphasize the effect of reflections.
2.3.3 The Formation of Peaks in the Distribution Patterns of ILDs
Similar to the discussion of IPDs, the a1 of a sound source is concentrated at a certain value, results in a peak in the ILD distribution pattern. However, ILDs behave quite different than IPDs when a1 is either large or small. When a1 is larger than a
Therefore, the relationship between ILDs and a1 is approximately linear (with a slope of when a1 is smaller than a certain level, the influence of the direct path is de-emphasized and the reflection part starts dominating the measured ILDs. The second simulation in Section 2.3.1 is again utilized as an example of the discussion above. Figure 2-8 shows the simulation results for the relationship between the value of a1 and the ILD.
Figure 2-8 Relation between the value of a1 and the ILD.
In Fig. 2-8, when a1>100, the measured ILD is about 0 because the simulation sets
2 0
, 1
, − D =
D k
k and bL,kD,1 =bL,kD,2. This results in a peak at 0 in the histogram, as shown in Fig. 2-6 (b). In addition, when a1 <−300, the measured ILDs change linearly with the value of a1, resulting in a flat area in Fig. 2-6 (b).
2.3.4 Localization of Nonstationary Sound Source Using BRDPs
As mentioned in Chapter 1, detecting the location of sound sources presented at median plane or on a “cone of confusion” is difficult when only IPDs and ILDs of direct paths are utilized. However, sound sources at different locations can propagate through different reflections and with the property of nonstationary sound source discussed above, the nonstationary sound can result in distinguishable distribution
azimuth, elevation, and distance using BRDPs.
2.4 GMBRDM for Nonstationary Sound Source Localization
As discussed in Section 2.3.1, if the environment and head position are unchanged and the distribution of a1 of the sound source is stationary, using BRDPs for sound source localization is possible. Sections 2.3.2 and 2.3.3 also show that BRDPs can be non-Gaussian and contain multiple peaks. Consequently, modeling these distribution patterns as a simple distribution pattern (such as a single Gaussian distribution) can eliminate important details. Utilizing a high-resolution normalized histogram to model the distribution pattern requires considerable computational of memory. In this work, GMMs are employed to model BRDPs (called the GMBRDM) to reduce the memory requirement through parameterization.
2.4.1 The Training Procedure of the Proposed GMBRDM
Let Px
(
nf,ωb)
and Mx(
nf,ωb)
denote the phase difference and magnitude ratio obtained at frame n respectively for constructing GMM at frequency f ωb,{
B}
b∈ 1,..., , which means B frequencies are utilized to construct the model. The phase difference and magnitude ratio GMMs are defined as the weighted sum of N1 and N2 mixtures of Gaussian component densities:
( )
( )
i(
x( )
f)
N
i i P P
f
x n g n
G P
∑
P=
= 1
1
|λ ρ , (2-20)
( )
( )
i(
x( )
f)
N
i i M M
f
x n g n
G M
∑
M=
= 2
1
|λ ρ , (2-21)
wherePx
( )
nf =[
Px(
nf,ω1)
L Px(
nf,ωB) ]
T,( )
f[
x(
f)
x(
f B) ]
Tx n = M n ,ω1 L M n ,ω
M . ρP,i and ρM ,i are the weights of ith
mixture, and gi
(
Px( )
nf)
and gi(
Mx( )
nf)
are the Gaussian density functions.Notably, the mixture weights must satisfy the constraints:
1 1
1 , =
∑
= Ni ρPi and 2 1
1 , =
∑
= Ni ρMi (2-22)
The terms λ and P λ represent the parameters of M N and 1 N2 component densities.
{
P P P}
P μ Σ
λ = ρ , , and λM =
{
ρM,μM,ΣM}
(2-23) where[
P,1 P,N1]
P = ρ L ρ
ρ denotes the phase difference mixture weight vector with dimensions 1 N× 1.
[
M,1 M,N2]
M = ρ L ρ
ρ denotes the magnitude ratio mixture weight vector with dimensions 1 N× 2.
[
P,1 P,N1]
P = μ L μ
μ denotes the phase difference mean matrix with dimensions
N1
B× .
[
M,1 M,N2]
M = μ L μ
μ denotes the magnitude ratio mean matrix with dimensions
[
P,1 P,N1]
P Σ Σ
Σ = L denotes the phase difference covariance matrix with dimensions B×BN1.
[
M,1 M,N2]
M Σ Σ
Σ = L denotes the magnitude ratio covariance matrix with dimensions B×BN2.
The parameters λ and P λ in (2-23) can be estimated by the iterative EM M algorithm [53] which guarantees a monotonic increase in the model’s log-likelihood value. By denoting the training sequence length as NF, the iterative procedure can be divided into the expectation step and maximum step:
Expectation step:
( )
Maximization step:
(i). Estimate the mixture weights:
( )
( )
(ii). Estimate the mean vector:
( )
(iii). Estimate the variances:
( ) ( ( ( ) ) ( )
Nn(
x( )
f P) )
Pi( )
bThe EM algorithm is sensitive to the choice of initial model. A good choice of initial model results in a lower number of iterations of the EM algorithm. K-means related approaches are known to be effective in finding a suitable initial model [54].
This work utilizes an accelerated K-means algorithm proposed by Elkan [55], which can significantly reduce the computational power requirement.
The proposed GMBRDM at location l is defined as the linear combination of
GMBRDM
( )
l =αPG(
Px( )
nf |λP( )
l)
+αMG(
Mx( )
nf |λM( )
l)
(2-32)where αP and αM represent the weighting factors. The values of αP and αM can be chosen based on the sum of the correlation values among trained locations of the phase difference GMM and magnitude ratio GMM. The GMM with higher correlation summation would be assigned a lower weight, since the ability to discriminate is considered lower under this circumstance, and vice versa. Under this principle, αP and αM are determined by the following formula:
In addition,
The proofs of (2-37) and (2-38) are shown in the appendix.
2.4.2 The Testing Procedure of the Proposed GMBRDM
The location is determined by finding the maximum a posteriori location probability for a given observation sequence:
( ) ( ( ) ) ( ( ) )
difference and magnitude ratio computed from the testing sequences denoted as) location is equally likely for a blind search. Moreover, because the probability densities p P and
( )
Y p M are the same for all location models, the detection rule( )
Y2.5 Summary
In this chapter, the relation between the nonstationary sound source and the BRDPs are discussed. Moreover, based on the discussion, a model, named GMBRDM is proposed for nonstationary sound source localization. Theoretically, the GMBRDM is capable of localizing sound source in azimuth, elevation, and distance. The performance of the proposed method is examined in Chapter 4.
Appendix
Proofs of (2-37) and (2-38):
The problem is formulated as
( ) ( )
Setting the first derivative with respect to αP be zero gives
( ) ( )
−∑
2( ) ( )
=0Therefore,
( ) ( )
( ) ( )
( ) ( )
∑
∑
=
=
M p
T M M M M
T P P P P
p M
q q
q q
q q
UC C
UC C
α α1 (A-5)
Chapter 3
Indoor Sound Field Feature Matching for Robot’s Location and Orientation Detection
3.1 Introduction
Indoor robot localization is an important issue in the field of robotics. Various equipments, such as camera, radio frequency identification (RFID), infrared (IR), ultra sonic sensor, laser, wireless LAN based methods and inertial navigation sensor have been adopted to provide different solutions [57 - 64].
For indoor robots, audio devices such as loudspeakers and microphones are becoming basic equipments. These sound-related devices can generally provide a more nature way for robots to communicate with human. Additionally, some researchers believe that these devices can be utilized for robot localization [65, 66].
The BRDPs introduced in Chapter 2 are treated as sound field features and this
matching for robot’s location and orientation detection and proposes a robust sound-based indoor robot’s pose detection system utilizing two microphones.
3.1.1 Traditional Sound Based Robot Localization Methods and Known Problems
The idea of using multiple microphones to localize sound sources has been developed for a long time. Among various kinds of sound source localization methods, generalized cross correlation (GCC) based methods [10, 11, 67, 68] were discussed for robot localization application [65]. In general, sound-based robot localization system uses a loudspeaker mounted on the robot to produce sound and estimates the location of the sound source, which is the robot’s location, by a set of microphone array installed in the room [65, 66]. The main difficulty for indoor robot localization using sound wave is the complex propagation behavior such as reflection and diffraction. Theoretically, the values of phase difference and magnitude ratio among microphones are directly related to the sound wave arrival direction and the distance between a sound source and microphones. However, these straightforward relations only exist in free space or environments with simple geometry. In real environments, these values exhibit stochastic phenomena due to the distributed nature of the propagation path dynamics and the limitation of finite-length data, as discussed in Chapter 2. Furthermore, complex boundary conditions, near-field effect, and local sound scattering make these values hard to correlate with the source location. These variations generally result in uncertain estimation errors and make sound-based localization methods unreliable. Moreover, for indoor applications, the robot may move to a location that is non-line-of-sight to the sensors, i.e., without direct paths between the robot and microphones. Under this circumstance, traditional methods
cannot locate the robot accurately.
Another well-known problem of sound-based robot localization methods is the microphone mismatch problem. If the microphones are not mutually matched, then the phase difference information among microphones may be distorted. However, pre-matched microphones are relatively expensive and mismatched microphones are difficult to calibrate accurately since the characteristics of microphones change with the sound directions. Consequently, the estimation accuracy varies from different microphone pairs and is difficult to be evaluated.
3.1.2 The Proposed Method
Traditional sound source localization algorithms attempt to suppress the effects of complex propagation behavior, as well as estimate the direction of the direct sound source. Instead of trying to eliminate the influence of reflection and diffraction, this work treats the distribution patterns of phase difference and magnitude ratio as a local
Traditional sound source localization algorithms attempt to suppress the effects of complex propagation behavior, as well as estimate the direction of the direct sound source. Instead of trying to eliminate the influence of reflection and diffraction, this work treats the distribution patterns of phase difference and magnitude ratio as a local