Chapter 2 Nonstationary Sound Source Localization Using Binaural Room
2.5 Summary
difference and magnitude ratio computed from the testing sequences denoted as
) location is equally likely for a blind search. Moreover, because the probability densities p P and
( )
Y p M are the same for all location models, the detection rule( )
Y2.5 Summary
In this chapter, the relation between the nonstationary sound source and the BRDPs are discussed. Moreover, based on the discussion, a model, named GMBRDM is proposed for nonstationary sound source localization. Theoretically, the GMBRDM is capable of localizing sound source in azimuth, elevation, and distance. The performance of the proposed method is examined in Chapter 4.
Appendix
Proofs of (2-37) and (2-38):
The problem is formulated as
( ) ( )
Setting the first derivative with respect to αP be zero gives
( ) ( )
−∑
2( ) ( )
=0Therefore,
( ) ( )
( ) ( )
( ) ( )
∑
∑
=
=
M p
T M M M M
T P P P P
p M
q q
q q
q q
UC C
UC C
α α1 (A-5)
Chapter 3
Indoor Sound Field Feature Matching for Robot’s Location and Orientation Detection
3.1 Introduction
Indoor robot localization is an important issue in the field of robotics. Various equipments, such as camera, radio frequency identification (RFID), infrared (IR), ultra sonic sensor, laser, wireless LAN based methods and inertial navigation sensor have been adopted to provide different solutions [57 - 64].
For indoor robots, audio devices such as loudspeakers and microphones are becoming basic equipments. These sound-related devices can generally provide a more nature way for robots to communicate with human. Additionally, some researchers believe that these devices can be utilized for robot localization [65, 66].
The BRDPs introduced in Chapter 2 are treated as sound field features and this
matching for robot’s location and orientation detection and proposes a robust sound-based indoor robot’s pose detection system utilizing two microphones.
3.1.1 Traditional Sound Based Robot Localization Methods and Known Problems
The idea of using multiple microphones to localize sound sources has been developed for a long time. Among various kinds of sound source localization methods, generalized cross correlation (GCC) based methods [10, 11, 67, 68] were discussed for robot localization application [65]. In general, sound-based robot localization system uses a loudspeaker mounted on the robot to produce sound and estimates the location of the sound source, which is the robot’s location, by a set of microphone array installed in the room [65, 66]. The main difficulty for indoor robot localization using sound wave is the complex propagation behavior such as reflection and diffraction. Theoretically, the values of phase difference and magnitude ratio among microphones are directly related to the sound wave arrival direction and the distance between a sound source and microphones. However, these straightforward relations only exist in free space or environments with simple geometry. In real environments, these values exhibit stochastic phenomena due to the distributed nature of the propagation path dynamics and the limitation of finite-length data, as discussed in Chapter 2. Furthermore, complex boundary conditions, near-field effect, and local sound scattering make these values hard to correlate with the source location. These variations generally result in uncertain estimation errors and make sound-based localization methods unreliable. Moreover, for indoor applications, the robot may move to a location that is non-line-of-sight to the sensors, i.e., without direct paths between the robot and microphones. Under this circumstance, traditional methods
cannot locate the robot accurately.
Another well-known problem of sound-based robot localization methods is the microphone mismatch problem. If the microphones are not mutually matched, then the phase difference information among microphones may be distorted. However, pre-matched microphones are relatively expensive and mismatched microphones are difficult to calibrate accurately since the characteristics of microphones change with the sound directions. Consequently, the estimation accuracy varies from different microphone pairs and is difficult to be evaluated.
3.1.2 The Proposed Method
Traditional sound source localization algorithms attempt to suppress the effects of complex propagation behavior, as well as estimate the direction of the direct sound source. Instead of trying to eliminate the influence of reflection and diffraction, this work treats the distribution patterns of phase difference and magnitude ratio as a local feature and uses it to detect the robot’s location and orientation. As discussed in Chapter 2, the complex propagation behaviors of a sound source result in location or orientation dependent phase difference and magnitude ratio distributions. This work adopts GMMs to model these distributions and proposes two models, robot localization model (RLM) and robot orientation model (ROM). The first model (RLM) is used for robot’s location detection and the second model (ROM) is used for robot’s orientation detection. The unique advantage of the proposed method is the detection of location and orientation in non-line-of-sight cases, i.e., when no direct path is available between the robot and the microphones. To adapt to the environmental
The remainder of this chapter is organized as follows. The Section 3.2 introduces the overall system architecture. Section 3.3 describes the design of the directional sound pattern for orientation detection. Section 3.4 presents the formulations of the proposed RLM and ROM. Finally, a summary is drawn in Section 3.5.
3.2 System Architecture
As shown in Fig. 3-1, the proposed system contains two loudspeakers on the robot and a robot’s location and orientation detection agent (RLODA) with two microphones. The RLODA can be placed in any part of the room as long as the reception of sound from the robot is clear enough. The sound patterns generated by Speaker 1 (SP1) are received by the RLODA and the RLMs can be obtained by location dependent phase difference and magnitude ratio distributions between the two microphones. When the system attempts to build the ROMs, both SP1 and SP2 are used to generate a directional sound pattern. Note that the detail of generating a directional sound pattern is described in Section 3.3. Because the sound pattern generated by SP1 and SP2 is directional, the sound field features change with the robot’s orientation and can be utilized for orientation detection.
Figure 3-1 Speaker and microphone configuration of the proposed system.
Figure 3-2 depicts the overall system architecture. Stage I in Fig. 3-2 is the pre-recording stage, in which the robot moves and changes its orientation in the environment when the environment is quiet, and produces sound through the loudspeakers to obtain a pre-recorded database. Since the sound is recorded by the two microphones, the information of the sound field features and microphone response can be obtained by this database.
Figure 3-2 Overall system architecture.
Once the pre-recording stage is finished, the system enters Stage II called silent stage. In this stage, the robot remains silent and the RLODA records the environmental noises. Assuming that noise signals are additive, the sound recorded in real application can be considered as the linear combination of robot’s sound and environmental noises. Therefore, this stage adds the environmental noises to the pre-recorded database to construct the training features, phase difference and
of RLMs and ROMs. Through this process, the effect of environmental noises is adapted in this stage.
When the robot needs to know its location or orientation, the system then switches to the sounding stage, in which the robot produces a sound into the room for the RLODA to detect the robot’s location or orientation. If the robot’s location is required, the SP1 is used to generate sound; conversely, both SP1 and SP2 are excited if the robot’s orientation is needed. Because the microphones used in these three stages are the same, the mismatched characteristics between microphones are collected in the pre-recorded database and would not influence the detection results of proposed system. The sounding and the silent stages can be switched to each other iteratively for location or orientation detection and environmental noises adaptation.
Figure 3-3 illustrates the flowchart of proposed system.
Figure 3-3 Flowchart of the proposed system.
3.3 Directional Sound Pattern Design for Robot Orientation Detection
To detect the robot’s orientation by the sound field features, the sound pattern generated by the robot should be correlated with the robot’s orientation. However, a general omni-directional sound pattern may lead to the same sound fields when the robot changes its orientation because the emitted sound has the same characteristics in all directions. Therefore, a directional sound emission approach must be designed. To realize a directional sound pattern, the idea of speaker array beamforming [69, 70] is adopted in this work to guarantee the directivity of the generated sound pattern.
Besides directivity, another constraint on the generated sound pattern is the number of symmetric axes (β ) in the horizontal plane. Figure 3-4 shows an example of how β affects the orientation detection, where the solid line denotes the generated sound pattern, the dotted line denotes the symmetric axes, and the arrow denotes the robot’s orientation.
As shown in Fig. 3-4, the sound patterns generated when the robot’s orientation is 0°, 90°, 180°, and 270° are exactly the same when β =4. A sound pattern generated when the robot points at a certain direction (0° in the example) would have
−1
β identical sound patterns.
Figure 3-4 Relations between β and the sound pattern.
Therefore, the generated sound can only be symmetrical along one axis (β =1) to avoid confusion in orientation detection. Consequently, this work proposes a method that utilizes two loudspeakers to generate the sound pattern that conforms to the constraint by:
) ( 5 . 0 ) (
) ( ) (
2 1
n J n
J
n J n J
SP SP
×
=
= (3-1)
where J(n) is the original sound source and JSP1(n) and JSP2(n) are the sound emitted by SP1 and SP2. The distance between two loudspeakers is set to 0.2 m.
Figure 3-5 depicts the simulation of the generated sound pattern of the proposed system based on the sound propagation theories in [71] when the robot’s orientation is 0°, where the sound power is measured at 1 m away from the SP1 with the same height. The solid lines in the circle depict the relative sound power in dB. As shown in Fig. 3-5, the generated sound pattern is symmetric along only one axis and is suitable for robot’s orientation detection.
Figure 3-5 Simulation of generated sound pattern.
3.4 Robot Localization Model (RLM) and Robot Orientation Model (ROM)
3.4.1 A Description of the Proposed RLM and ROM
To establish both RLMs and ROMs, the RLODA needs to construct models for the sound fields at different locations and orientations. PSx
(
nf,ωb)
and MSx(
nf,ωb)
denote the phase difference and magnitude ratio at frame n respectively for f constructing RLM (S =L) or ROM (S =O) at frequency ωb, b∈
{
1,...,B}
. TheGMMs are defined as the weighted sum of N1 and N2 mixtures of Gaussian component densities shown below,
( )
Notably, the mixture weights must satisfy the constraints:
1 1
{
SP SP SP}
SP μ Σ
λ = ρ , , and λSM =
{
ρSM,μSM,ΣSM}
(3-5)where
[
SP,1 SP,N1]
SP = ρ L ρ
ρ denotes the phase difference mixture weight vector with dimensions 1 N× 1.
[
SM,1 SM,N2]
SM = ρ L ρ
ρ denotes the magnitude ratio mixture weight vector with dimensions 1 N× 2.
[
SP,1 SP,N1]
SP = μ L μ
μ denotes the phase difference mean matrix with dimensions
N1
B× .
[
SM,1 SM,N2]
SM = μ L μ
μ denotes the magnitude ratio mean matrix with
dimensions B×N2.
[
SP,1 SP,N1]
SP Σ Σ
Σ = L denotes the phase difference covariance matrix with dimensions B×BN1
[
SM,1 SM,N2]
SM Σ Σ
Σ = L denotes the magnitude ratio covariance matrix with dimensions B×BN2 .
The parameters λ and SP λSM in (3-5) can be estimated by the iterative EM algorithm, which guarantees a monotonic increase in the model’s log-likelihood value.
Expectation step:
Maximization step:
(i). Estimate the mixture weights:
( )
(ii). Estimate the mean vector:
( )
(iii). Estimate the variances:
( ) ( ( ( ) ) ( )
Nn(
Sx( )
f SP) )
SPi( )
bN
n Sx f SP Sx f b
b i SP
F
f F
f Gi n P n ω Gi n μ ω
ω
σ2 , =
∑
=1 |P ,λ 2 ,∑
=1 |P ,λ − ,2(3-12)
( ) ( ( ( ) ) ( )
nN(
Sx( )
f SM) )
SMi( )
bN
n Sx f SM Sx f b
b i SM
F
f F
f Gi n M n ω Gi n μ ω
ω
σ 1 1 ,2
2 2
, =
∑
= |M ,λ ,∑
= |M ,λ −(3-13)
An accelerated K-means algorithm proposed in [55] is again utilized to reduce the computational power requirement.
The proposed RLM and ROM at location l and orientation o are defined as the linear combination of the phase difference GMM and the magnitude ratio GMM at location l and orientation o:
( )
l G( ( )
n( )
l)
G( ( )
n( )
l)
FRLM =αLP PLx f |λLP +αLM MLx f |λLM (3-14)
( )
o G( ( )
n( )
o)
G( ( )
n( )
o)
FROM =αOP POx f |λOP +αOM MOx f |λOM (3-15)
where αLP, αOP, αLM and αOM represent the weighting factors. The values of
αSP and αSM can be chosen based on the sum of the correlation values among trained locations of the phase difference GMM and magnitude ratio GMM. The GMM with higher correlation summation would be assigned a lower weight, since the ability to discriminate is considered lower under this circumstance, and vice versa. Under
( ) ( )
In addition,
( )
( )
3.4.2 Location and Orientation Detection
The location and orientation are determined by finding the maximum a posteriori location probability and a posteriori orientation probability for a given observation sequence:
( ) ( ( ) ) ( ( ) )
( ) ( ( ) ) ( ( ) )
difference and magnitude ratio computed from the testing sequences denoted as)
can be selected as 1/O since the probability in each location and orientation is equally likely for a blind search. Moreover, because the probability densities p P
( )
SYand p M
(
SY)
are the same for all location models, the detection rule can be recast as:3.5 Summary
A robot’s location and orientation detection method based on sound field features utilizing two microphones is proposed in this chapter. The proposed method treats
phase difference and magnitude ratio distributions between the microphones as sound field features. Based on this idea, RLMs and ROMs are introduced for robot’s location and orientation detection. The system architecture presented contains a RLODA that can provide adaptation to environmental noises. Moreover, with the pre-recorded database, the non-ideal issues of non-line-of-sight condition and microphone mismatch problem can be solved. The related experimental results are shown in Chapter 4.
Chapter 4
Experimental Results
4.1 Experimental Results of the Proposed GMBRDM
4.1.1 The Experimental Environment
The experiment is performed in a laboratory filled with common furniture and equipment. Fig. 4-1 shows the layout of the environment. The laboratory area is
m2
1 . 7 5 .
10 × and room height is 3 m. The recording equipment comprises two B&K 4935 array microphones, a B&K 2694 conditioning amplifier, and an Azova DAQP-16 analog-to-digital converter.
Figure 4-1 The layout of the experimental environment.
The microphones are mounted in the ears of a dummy head, as depicted in Fig.
4-2. The distance between the dummy head’s ears is 0.16 m. Fig. 4-1 illustrates the location of the dummy head. The ears of the dummy head are placed 1 m above the floor.
Figure 4-2 The dummy head adopted in the experiment.
The sound source is a recording of a female reading a book in Mandarin (Here, we assume the distribution of a1 of the speech signal is stationary during training and testing procedure.). The sound source is played by a loudspeaker. Received signals are sampled at 8000 Hz, and the STFT window is 512 samples. For each experiment, the sound source is played at each tested location to obtain the training sequence to establish the GMBRDM. Training sequence length, NF, is set to 400 and testing sequence length, NV, is set to 100, with a shift of 80 samples between each frame. Hence, 4-second data are utilized for training, and 1-second data are utilized for testing. Six significant frequencies of the sound source are selected in this experiment; therefore, each Gaussian model has six dimensions, B=6. For each location, testing is performed 100 times to acquire the correct rate.
4.1.2 The Experimental Results
The first experiment tests the ability of azimuth localization. In this experiment, distances between the sound source and ears are fixed at 1 m, 1.2 m, 1.4 m, 1.6 m, 1.8 m, and 2 m. For each distance, the azimuth of sound source moves from -60°, -30°, 0°, 30°, to 60° to test the average correct rate of azimuth localization. The elevation of sound source is set the same as that of the ears (1 m). Different mixture numbers are utilized. Table 4-1 shows the average correct rate of azimuth localization at each distance.
Table 4-1 Average correct rates of azimuth localization at each distance Distance (m)
Mixture
Number 1.0 1.2 1.4 1.6 1.8 2.0
1 97 % 83 % 40 % 67 % 70 % 67 %
5 98 % 84 % 59 % 81 % 85 % 72 %
10 99 % 89 % 81 % 87 % 88 % 73 %
15 99 % 91 % 83 % 87 % 89 % 83 %
20 99 % 88 % 83 % 86 % 89 % 85 %
25 99 % 91 % 88 % 87 % 89 % 91 %
As shown in Table 4-1, when the distance between the sound source and ears is 1 m, meaning that the sound source close to the dummy head, the performance of mixture number of 1 is roughly the same as those of high mixture numbers. When the sound source is close to the dummy head, the influence of direct path propagation is much more significant than that of reverberations. Consequently, the BRDPs are influenced less by the reflections and can be modeled using a single Gaussian distribution model.
adopting multiple mixtures is apparent at a long distance, such as 2 m, where the correct rate increases with the mixture number.
The second experiment tests the capability of the proposed GMBRDM for distance localization. In this experiment, the azimuth is fixed at -60°, -30°, 0°, 30°, and 60°. At each azimuth, the distance between the sound source and ears changes from 1 m, 1.2 m, 1.4 m, 1.6 m, 1.8 m, to 2 m to acquire average correct rates. The sound source height is adjusted to 1 m. Table 4-2 shows the average correct rates for distance localization at each azimuth.
Table 4-2 Average correct rates of distance localization at each azimuth Azimuth
Mixture
Number -60° -30° 0° 30° 60°
1 49 % 31 % 48 % 43 % 61 %
5 40 % 47 % 65 % 55 % 64 %
10 76 % 68 % 73 % 58 % 69 %
15 80 % 76 % 73 % 67 % 72 %
20 79 % 73 % 73 % 70 % 74 %
25 86 % 82 % 73 % 73 % 78 %
Because the relationship between the sound source and ears meets the criterion of far-field, the IPDs of direct path at the same azimuth and different distances are approximately identical theoretically. The ILDs of direct paths generate only relatively a slight difference between distant locations. Thus, modeling these BRDPs using a single Gaussian component can lose important details caused by reflections and result in poor localization results. As listed in Table 4-2, the average correct rates when only one mixture is employed are clearly lower than those with a high mixture number. This experimental finding is because the proposed GMBRDM can represent the details of the BRDPs for superior modeling results.
The third experiment tests the elevation localization performance of the proposed GMBRDM. In this experiment, distance between the sound source and ears is 2 m and the azimuth is fixed at -60°, -30°, 0°, 30°, and 60°. At each azimuth, the elevation of the sound source changes from 1 m, 1.25 m, to 1.5 m to acquire average correct rates.
Table 4-3 lists experimental results. Experimental data show that GMBRDM with a high mixture number can properly model the BRDPs at different elevations.
Table 4-3 Average correct rates of elevation localization at each azimuth Azimuth
Mixture
Number -60° -30° 0° 30° 60°
1 59 % 55 % 33 % 46 % 59 %
5 83 % 90 % 60 % 80 % 63 %
10 82 % 93 % 55 % 83 % 74 %
15 88 % 94 % 55 % 88 % 74 %
20 89 % 93 % 67 % 86 % 81 %
25 92 % 98 % 84 % 93 % 88 %
4.2 Experimental Results of the Proposed Robot’s Localization and Orientation Detection Method
4.2.1 The Experimental Environment
Figure 4-3 shows the experimental platform and the proposed RLODA. In Fig.
4-3 (a), the distance between two loudspeakers is 0.2 m. The distance between the two microphones of the RLODA is chosen as 0.07 m, as shown in Fig. 4-3 (b).
(a)
(b)
Figure 4-3 The experimental platform and the proposed RLODA. (a) The experimental platform. (b) The proposed RLODA.
The experiment was performed in an office room filled with furniture, which is
11.4 m in length, 4.73 m in width and 2.8 m in height. Two off-the-shelf, non-calibrated microphones are utilized on the ROLDA in this experiment and the RLODA is implemented on a PC with a stereo recording sound card. The sampling rate is 8000 Hz, and the A/D resolution is 16 bits. The pre-recording is performed every 0.1 m within the region in which the robot is allowed to travel. For orientation detection, the robot is rotated in every 30° step to obtain 12 orientations in 360°.
Figure 4-4 depicts the experimental environment and the location of the RLODA.
Note that there is a partition room in the office. Therefore, the robot is completely under non-line-of-sight case when it is in the partition room. The robot’s moving trajectories are also shown in Fig. 4-4 with the dotted lines from 1 to 8 in sequence.
The sound source utilized in this experiment mimic the sound of dog barking.
The spectrogram of the sound source is illustrated in Fig. 4-5. The lengths of the training sequence and the testing sequence were set to 300 and 30. In other words, a three-second length input datum was set for training, and a 0.3 second length input datum was set for testing. The major noise in this experiment is speech noise and the minor noises are electric noise such as air conditioner noise, computer fan noise to simulate a general indoor environment.
Figure 4-4 Experimental environment.
Figure 4-5 The waveform and spectrogram of barking signal.
4.2.2 The Experimental Results
Table 4-4 lists the average SNRs of all trajectories and the average SNRs of each trajectory pair. Figure 4-6 shows the location detection results along the robot’s
Table 4-4 lists the average SNRs of all trajectories and the average SNRs of each trajectory pair. Figure 4-6 shows the location detection results along the robot’s