Indoor environment - Comparison of NLMS and H ∞ Adaptation Criterions

Chapter 6 Experimental Results

6.2 Comparison of NLMS and H ∞ Adaptation Criterions

6.2.2 Indoor environment

The indoor environment is arranges as Fig. 6-8 and the parameters of ASR is shown in Table 6-6. The FDABB parameters are the same as those in Table 6-4 and the soft penalty is 2. Figure 6-18 presents the ASR rates of SPFDBB and FDABB using NLMS and H∞ adaptation criterions in an indoor environment. Clearly, H∞ adaptation criterion outperforms the NLMS adaptation criterion in this case. Notably, although FDABB can adjust the frame number, L, this is not to say that the performance of FDABB is always better than the one of SPFDBB. For example, under the H∞ adaptation criterion, the ASR rates of SPFDBB are better than those of FDABB in C3 and C6.

Figure 6-18 ASR rates of SPFDBB and FDABB using NLMS and H∞ adaptation criterions in an indoor environment

6.2.3 Vehicular Environment

The experiment conditions are described in Section 6.1.3. Figure 6-19 shows the ASR rates of SPFDBB and FDABB using NLMS and H∞ adaptation criterions in a vehicular environment. Obviously, the observation of that H∞ adaptation criterion outperforms the NLMS remains true in a vehicular environment.

Figure 6-19 ASR rates of SPFDBB and FDABB using NLMS and H∞ adaptation criterions in a vehicular environment

6.3 Reference-signal-based Speaker’s Location Detection

6.3.1. Vehicular Environment

The experiment is performed in a mini-van vehicle with six separated seats [121].

Figure 6-20 presents the locations of the seats. During the experiment, the speakers at these locations slightly move around to mimic real usage scenarios. A uniform linear array of six microphones with 0.05 m spacing is mounted in front of location No. 2. The experiment is performed in various noisy environments. The environmental noise signals changed at various speeds. Table 6-12 lists the SNR ranges at various speeds, corresponding to the six locations. Table 6-13 presents the frequency bands that correspond to the pairs of microphones.

Figure 6-20 Seat number and microphone array position

Table 6-12 The SNR Ranges at Various Speeds

Speed Speed = 0 km/h Speed = 20 km /h The SNR Range (dB) 10.8204 ~ 17.2664 4.1762 ~ 10.6222

Speed Speed = 40 km/h Speed = 60 km/h The SNR Range (dB) -4.5320 ~ 1.9140 -6.2526 ~ 0.1934

Speed Speed = 80 km/h Speed = 100 km/h The SNR Range (dB) -8.4709 ~ -2.0249 -13.0531 ~ -6.6071

Table 6-13 The Frequency Bands Correspond to the Microphone Pairs

Frequency Band Microphone Pairs The Number of Microphone Pair The Range of Frequency Band

Band 1 (b=1) (1,6) J₁=1 0≤ f ≤680Hz

Band 2 (b=2) (1,5); (2,6) J₂=2 680Hz< f ≤850Hz

Band 3 (b=3) (1,4); (2,5); (3,6) J₃=3 850Hz<f ≤1100Hz

Band 4 (b=4) (1,3); (2,4); (3,5); (4,6) J₄=4 1100Hz< f ≤1700Hz Band5 (b=5) (1,2); (2,3); (3,4); (4,5); (5,6) J5=5 1700Hz< f ≤3400Hz

6.3.1.1 MUSIC Algorithm - Single Speaker’s Location Detection

A wideband incoherent MUSIC algorithm [10] with arithmetic mean is implemented and the results are compared with those of the proposed approach. Ten major

Outliers are removed from the estimated angles by utilizing the method provided in [122]. Moreover, the angle errors needed for outlier rejection is derived from the estimated angles and real angles. Locations No. 2, 4 and 6 had the same DOA to the microphone array. Therefore, only locations No. 1, 2, 3 and 5 are considered for online testing. A frequently used classification method KNN (K-nearest-neighbor classification rule [123]) is used to construct a flexible boundary to improve the accuracy of detection to cope with the slight movement of the source, microphone mismatch, transient response and environmental noise. The estimated angles following outlier rejection are used as reference data in online location detection to illustrate further the performance of the MUSIC algorithm in the car cabinet. Suppose that the

l location contains β_l estimated angles and that

∑

β =β

l is the reference data set. Assume that the l th location contains K_l points in the K−nearest results of a new estimate rˆ derived from MUSIC with outlier rejection. The a posteriori probability is then given as

K r K l

p( | ˆ)= ^l (6-3)

To minimize the probability of a false classification of rˆ , the estimated location, denoted as lˆ_MUSIC, is decided by using following equation:

ˆ)

| ( max ˆ arg

5 , 3 , 2 ,

1 p l r

MUSIC = l= (6-4)

Notably, the new estimate will not be classified if it is an outlier. The parameters are set to β_l =200, l =

{

1,2,3,5

}

, β =800and K =30 and the number of trials is 100.

Table 6-14 presents the correct rate after KNN classification with outlier rejection. The

correct rates at locations No. 3 and 5 are too low to be useful. In summary, these experimental results demonstrate that the MUSIC algorithm is not sufficiently reliable in a vehicle environment, even a classification method is applied and outliers are rejected to cope with the uncertainties.

Table 6-14 Correct Rate of MUSIC Method Utilizing KNN with Outlier Rejection

The correct rates at various speeds (km/h)

L^OCATION Speed

0 km/h

Speed 20 km/h

Speed 40 km/h

Speed 60 km/h

Speed 80 km/h

Speed 100 km/h

1 94 % 85 % 74 % 79 % 84 % 91 %

2 93 % 90 % 92 % 89 % 81 % 89 %

3 60 % 44 % 63 % 70 % 36 % 52 %

4 × × × × × ×

5 59 % 46 % 17 % 26 % 78 % 22 %

6 × × × × × ×

6.3.1.2 Proposed Single Speaker’s Location Detection Approach

The proposed approach is applied under the same experimental conditions as Section 6.3.1.1 to detect the speaker’s location. The second initial approach mentioned in section 5.3.2 is utilized to initialize the mean values. The covariance update may lead to numerical difficulties, as the covariance matrices become nearly singular. Consequently, the practical solution is to limit the minimum variance σ_min² . In this experiment, the

value of σ_min² is set to 0.02. The lengths of the training sequence T and the testing sequence Q are set to 200 and 50; in other words, a two-second length input datum is set for training, and a half-second length input datum is set for testing. The mixture number of GMM model has ten choices, from one to ten. Figure 6-21 plots the experimental result of the correct rate versus the mixture numbers at 100 km/h. As

could not yield a satisfactory experimental performance. The correct rates are 100% at all locations with the mixture number is ten. This finding justifies the assumption that GMM is suitable for this application. Although the experimental performance improved as the mixture number increased, the improvement in performance is not significant when the mixture number exceeded five. Table 6-15 lists the experimental results with a mixture number of five. Clearly, the proposed method outperforms the MUSIC algorithm. Even at locations No. 4 and 6, the proposed method could distinguish them with significant accuracy. Figure 6-22 shows the histograms of phase differences at locations No. 2, 4, and 6 between the third and the sixth microphones at a frequency of 0.9375 kHz and between the fourth and the sixth microphones at a frequency of 1.5 kHz, e.g., in third and fourth frequency bands. The speed that corresponds to this figure is 100 km/h. Although the locations had the same angle to the microphone array, their phase difference distributions are quite different, as indicated by several research reports [113-114]. Additionally, the proposed method combined five frequency bands, each of which contained different phase difference distributions.

As a result, the proposed method is able to distinguish all of the locations by exploiting their implicit diversities. Moreover, under low SNR conditions, the proposed approach still yielded a high correct rate and is robust against in-vehicle noise.

Table 6-15 Experimental Result of the Proposed Method with a Mixture Number of Five

The correct rates at various speeds (km/h) Location Speed

0 km/h

Speed 20 km/h

Speed 40 km/h

Speed 60 km/h

Speed 80 km/h

Speed 100 km/h

1 99 % 100 % 99 % 99 % 99 % 98 %

2 99 % 100 % 99 % 99 % 97 % 98 %

3 100 % 100 % 99 % 100 % 100 % 100 %

4 99 % 99 % 99 % 98 % 98 % 98 %

5 100 % 100 % 100 % 100 % 100 % 100 %

6 99 % 99 % 98 % 99 % 98 % 97 %

(a). The location number is chosen from 1 to 3

(b). The location number is chosen from 4 to 6

Figure 6-21 Correct rate versus the different mixture numbers in 100 km/h

(a). Location No. 2 (frequency = 0.9375kHz) (b). Location No. 4 (frequency = 0.9375kHz)

(c). Location No. 6 (frequency = 0.9375 kHz) (d). Location No. 2 (frequency = 1.5 kHz)

(e). Location No. 4 (frequency = 1.5 kHz) (f). Location No. 6 (frequency = 1.5 kHz) Figure 6-22 The histograms of phase differences at locations No. 2, 4, and 6 between the third and the sixth microphones at a frequency of 0.9375 kHz and between the fourth and the sixth microphones at a frequency of 1.5 kHz, e.g., in third and fourth frequency bands (speed = 100 km/h)

6.3.1.3 Proposed Multiple Speakers’ Locations Detection Approach

Figure 6-23 shows the locations of the six in-car loudspeakers, and the locations that are tested for the experiment. The first six locations correspond to modeled locations, and the radio broadcasting emits from the six in-car loudspeakers, locations no. 7, 8, and 9 correspond to unmodeled locations. The total length of the training phase difference sequence T is set to 300 (3-second duration). The values of Q , _Lo Q , _UP α ,

β, and γ are set to 10, 35, 0.3, 0.4, and 0.3 respectively.

Table 6-16 lists the SNR ranges at various speeds. The mixture number of GMM model has six choices, 1, 3, 5, 7, 9, and 11. The trial number for localization detection is 300 for each mixture number at each speed. For the condition of a single speaker, Fig.

6-24 plots the average correct rates versus mixture numbers, and indicates that a single Gaussian distribution, M =1, could not yield a satisfactory performance, and that increasing the mixture number improves the performance.

Figure 6-23 Locations number of the seats

Table 6-16 SNR Ranges at Various Speeds

SNR Ranges (dB) Speed (km/h) Multiple Speakers at

locations no. 1 to 6

Radio broadcasting

Single Speaker at location no. 7

Single Speaker at location no. 8

Single Speaker at location no. 9

Speed = 0 km/h 10.81 – 18.15 dB 13.10 dB 14.96 dB 13.18 dB 17.31 dB

Speed = 20 km/h 5.62 – 12.96 dB 7.20 dB 10.15 dB 9.37 dB 11.50 dB

Speed = 40 km/h 0.19 – 7.54 dB 2.18 dB 4.53 dB 2.76 dB 6.89 dB

Speed = 60 km/h -0.54 – 6.81 dB 1.75 dB 3.81 dB 2.03 dB 5.16 dB

Speed = 80 km/h -5.32 – 2.02 dB -3.04 dB -0.98 dB -2.76 dB 1.37 dB

Speed = 100 km/h -7.28 – 0.07 dB -5.99 dB -2.93 dB -4.71dB -0.58 dB

(a). Locations number 1 to 3

(b). Locations number 4 to 6

Figure 6-24 Average correct rates versus the mixture numbers

Fifteen possible combinations, such as locations No. 1 and 2, and locations No. 1 and 3, exist with two speakers talking. Three, four, and five speakers talking yield 20, 15, and 6 possible combinations respectively. Table 6-17 lists the average error rates of these conditions with a mixture number of 11. Notably, an error is defined as a detection result that does not give the location of any of these speakers. For example, if the speech signals come from locations No. 2 and 3, then an error occurs when the detection result is neither 2 nor 3. Table 6-18 lists the average error rates of radio broadcasting and the speech signals coming from locations No. 7, 8, and 9 with a mixture number of 11. The error in the table is defined as the detection result pointing to

one of the modeled locations. The experimental results indicate that the proposed method can successfully deal with multiple speakers and unmodeled speech sources.

Table 6-17 Average Error Rates at Various Speeds under Multiple Speakers’

Conditions

Average Error Rates (%) Speaker

Number Speed 0 km/h

Speed 20 km/h

Speed 40 km/h

Speed 60 km/h

Speed 80 km/h

Speed 100 km/h 2 0.67 % 1.11 % 0.44 % 0.67 % 1.56 % 1.78 % 3 0.50 % 1.00 % 0.67 % 0.50 % 1.17 % 1.83 % 4 0.89 % 0.89 % 0.66 % 0.44 % 1.11 % 1.56 %

5 0.11 % 0.05 % 0 % 0 % 0.05 % 0.11 %

Table 6-18 Average Error Rates of Unmodeled Locations at Various Speeds

Average Error Rates (%) Speed (km/h) Radio

broadcasting

Single Speaker at Location No. 7

Single Speaker at Location No. 8

Single Speaker at Location No. 9

Speed = 0 km/h 0.22 % 0 % 0.06 % 0.22 %

Speed = 20 km/h 0.28 % 0 % 0.17 % 0 %

Speed = 40 km/h 0 % 0 % 0 % 0 %

Speed = 60 km/h 0.06 % 0 % 0 % 0.33 %

Speed = 80 km/h 0.28 % 0.33 % 0.33 % 0.33 %

Speed = 100 km/h 0.33 % 0 % 0.39 % 0.67 %

6.3.2. Indoor Environment

The dimensions of the experimental room and the arrangement of microphone array are the same with those in the Section 6.1.2.

6.3.2.1 Proposed Single Speaker’s Location Detection Approach

Figure 6-25 presents the real configuration. The four speech signals are located at different angles, 0°, 30°, and −60°, with various distances to the array. Noises are

noise No. 1 have the same DOA to the microphone array. All of the speech signals and noises are played by loudspeakers during the experiment. The interference signals in this experiment are white Gaussian noises and mutually uncorrelated. The distances between the microphone array and the noises are all 1.5 m. There are a total of twelve experimental conditions, denoted from C1 to C12, as shown in Table 6-19.

Figure 6-25 Configuration of microphone array, noises and speech sources in noisy environment

Table 6-19 Twelve Kinds of Experimental Conditions

Condition SNR (dB)

C1 Speech No. 1 and noise No. 1 15.46 dB C2 Speech No. 2 and noise No. 1 16.45 dB C3 Speech No. 3 and noise No. 1 13.93 dB C4 Speech No. 4 and noise No. 1 15.67 dB C5 Speech No. 1, noises No. 1 and 2 10.08 dB C6 Speech No. 2, noises No. 1 and 2 11.07 dB C7 Speech No. 3, noises No. 1 and 2 8.55 dB C8 Speech No. 4, noises No. 1 and 2 10.28 dB C9 Speech No. 1, noises No. 1, 2, and 3 5.95 dB C10 Speech No. 2, noises No. 1, 2, and 3 6.93 dB C11 Speech No. 3, noises No. 1, 2, and 3 4.42 dB C12 Speech No. 4, noises No. 1, 2, and 3 6.15 dB

The lengths of the training sequence T and the testing sequence Q are set to 300 and 50; in other words, a three-second length input datum is set for training, and a half-second length input datum is set for testing. The mixture number of GMM model has six choices, 1, 3, 5, 7, 9, and 11. The trial number for localization detection is 250 for each mixture number at each condition. Figure 6-26 plots the experimental result of the correct rates versus the mixture numbers under various conditions. Although the speech No. 4 and noise No. 1 come from the same direction, speech No. 4 still can be distinguished in C4, C8, and C12 with a higher mixture number. Generally, the correct rates in the indoor environment are lower than those in the vehicular environment when the mixture number is low, such as 1, 3, and 5. This phenomenon means that the phase difference distributions of different locations in the vehicular environment are more distinguishable.

(a). Conditions one to four

(b) Conditions five to eight

Figure 6-26 Correct rates versus the different mixture numbers

6.3.2.2 Proposed Multiple Speakers’ Locations Detection Approach

Figure 6-27 presents the real configuration. Clearly, three unmodeled speech signals, speeches No. 5, 6, and 7, are added in the experiment. It means that speeches No, 5, 6, and 7 can be regard as undesired speech signals or interference signals. The total length of the training phase difference sequence T is set to 300 (3-second duration). The values of Q , _Lo Q_UP, α , β, and γ are set to 10, 35, 0.3, 0.4, and 0.3 respectively.

Table 6-20 lists the SNR ranges at three different noisy environments. The mixture

number of GMM model also has six choices, 1, 3, 5, 7, 9, and 11. The trial number for localization detection is 250 for each mixture number. For the condition of a single speaker, Fig. 6-28 plots the average correct rates versus mixture numbers.

Figure 6-27 Configuration of microphone array, noises and speech sources in noisy environment

Table 6-20 SNR Ranges at Three Different Noisy Environments

SNR Ranges (dB) Noisy Environments Multiple Speakers at

Speeches No. 1 to 4

Single Speech Signal at Speech No. 5

Single Speech Signal at Speech No. 6

Single Speech Signal at Speech No. 7 Noise No. 1 13.93 – 26.9 dB 14.97 dB 18.17 dB 16.00 dB Noises No. 1 and 2 8.55 – 21.54 dB 9.57 dB 12.78 dB 10.62 dB Noises No. 1, 2, and 3 4.42 – 15.66 dB 3.68 dB 6.88 dB 4.71 dB

Six possible combinations, such as speeches No. 1 and 2, and locations No. 1 and 3, exist with two speakers talking. Three speakers talking yield four possible combinations respectively. Table 6-21 lists the average error rates of these conditions and Table 6-22 lists the average error rates of the speech signals coming from speeches No. 5, 6, and 7 with a mixture number of 11. Notably, speeches No. 1, 3 and 6 and speeches No. 2 and 5 have the same DOA to the microphone array. The experimental results indicate that the proposed method can also successfully deal with multiple speakers and unmodeled speech signals.

Table 6-21 Average Error Rates at Three Noisy Environments under Multiple Speakers’ Conditions

Average Error Rates (%) Speaker

Number Noise No. 1 Noises No. 1 and 2 Noises No. 1, 2, and 3

2 2.93 % 1.33 % 0.87 %

3 0.1 % 0.1 % 0 %

Table 6-22 Average Error Rates of Unmodeled Locations at Three Noisy Environments

Average Error Rates (%) Noisy Environments Single Speech Signal at

Speech No. 5

Single Speech Signal at Speech No. 6

Single Speech Signal at Speech No. 7

Noise No. 1 0.8 % 0 % 0 %

Noises No. 1 and 2 0 % 0 % 0.2 %

Noises No. 1, 2, and 3 0.3 % 0.4 % 0.2 %

6.4 Summary

This chapter evaluates the proposed SPFDBB, FDABB, and speaker’s location detection approaches through simulation and real experimental results. Section 6.1 proves the proposed SPFDBB and FDABB not only outperform the reference-signal-based time-domain adaptive beamformer and several famous

beamformers, but also reduce the computational effort. Section 6.2 simulates the single-channel and multiple-channel cases, while performing vehicular environment and indoor environment experiments, to show the robustness of the H∞ adaptation criterion. Moreover, Section 6.3 executes the proposed speaker’s location detection approach in noisy vehicular and indoor environments to prove the high detection accuracy and the robustness to the unmodeled or unexpected speech sources.

Chapter 7 Conclusions and Future researches

7.1. Conclusions

This dissertation presents reference-signal-based methods of sound source localization and speech purification using microphone array. Specifically, frequency-domain adaptive beamformers, namely SPFDBB and FDABB, are proposed to cope with the computation issues in real-time. Under the architecture, the proposed approaches can be applied to both near-field and far-field environments and overcome microphone mismatch problem.

Other than the advantages mentioned above, SPFDBB and FDABB can minimize the channel effects, the desired signal cancellation, and the resolution effect due to the array’s position. FDABB and SPFDBB not only reduce the computational effort, but also deal with the problem of inaccurate channel representation. That is to say, the convolution relation between channel and speech source in time-domain cannot be modeled accurately as a multiplication in the frequency domain with a finite window size. According to the computational effort analysis in Chapters 4, 5, and 6, FDABB or SPFDBB requires a lower computational effort as compared with the

reference-signal-based time-domain adaptive beamformer. Additionally, FDABB utilizes an index named CBVI to adjust the frame number L automatically, so it is more suitable than SPFDBB for applications with a small training data length or variations of the channel dynamics.

FDABB and SPFDBB attempt to simultaneously suppress the noise signals and recover the channel dynamics. However, according to Eq. (5-1), the finite filter coefficient vectors are not sufficient to perform the perfect equalization in general environments, thus leading to the modeling error. To reduce the effect of modeling error, this dissertation further studies the robustness of H∞ adaptation criterion and applies the criterion to the proposed FDABB and SPFDBB.

To overcome the non-line-of-sight problem in the sound source localization field, this dissertation proposes an approach utilizing GMM to model the distributions of the phase differences among the microphones caused by the complex characteristic of room acoustic and microphone mismatch. According to the experimental results in Chapter 6, the scheme performs well not only in non-line-of-sight cases, but also when the speakers are aligned toward the microphone array but at difference distances from it.

However, an unmodeled speech signal which is not emitted from one of the modeled locations degrades the detection performance. Therefore, this dissertation further proposes multiple speakers’ location detection approach to provide an accurate localization of multiple speakers and robustness to unmodeled sound source locations.

7.2. Future researches

To improve the current speech enhancement system, this dissertation proposes two

with the speech recognizer, and the second one is to combine multiple speakers’

location detection approach with the proposed SPFDBB or FDABB.

Currently, the proposed reference-signal-based frequency-domain beamformers are performed in two independent phases: speech purification and then recognition as shown in Figs. 1-5, 3-1 and 4-2. The proposed beamformers designed to reduce the speech distortion and suppress the noise effects assume that improving the quality of the speech waveform will result in better recognition performance and are independent of the recognition system. Although the proposed beamformers can conquer many practical issues, the beamformers still cannot compete with the microphone in a close distance in terms of the ASR rates. Generally, a speech recognizer is a statistical pattern classifier that operates on a sequence of features derived from the waveform. To increase the recognition accuracy in distant-talking environments, the architecture of connecting the proposed beamformers and the speech recognizer as shown in Fig. 7-1 is worth a further study in the further. This architecture enables the beamformer to use the data transmitted from the recognizer and ensures the beamformer enhances those signal components important for ASR. In other words, this architecture enables the designed filters not to undue emphasis on unimportant components.

For example, Seltzer et al. [124-125] proposed a likelihood-maximizing beamformer (LIMABEAM) that integrates the speech recognition system into the filter design process. They proved that incorporating the statistical models of the recognizer into the array processing stage can improve the ASR rates. The goal of the LIMABEAM is not

在文檔中以參考訊號架構為基礎之穩健語者定位與語音純化法 (頁 110-0)