Multiple Speakers’ Locations Detection Criterion

The first term of Eq. (5-14) represents the negative maximum probability variation of the trained model when the length of the training phase difference sequence is Q . As the value of this term increases, the corresponding selection of Q yields a more robust result under the trained GM location model. The second term of Eq. (5-14) is the sum of the probability differences of the location l versus other locations and a larger value means the corresponding selection of Q has a higher discrimination level between the location l and the other trained GM locations. Finally, a high discrimination level between the location l and other unmodeled locations can be achieved if the third term of Eq. (5-14) is large. Figure 5-6 shows the GM location model training procedure with the total location number L .

)

Figure 5-6 Location model training procedure with testing sequence length and thresholds estimation

5.6 Multiple Speakers’ Locations Detection Criterion

The location is detected as,

( ) ^{( )}

[ (

( ) ( )

) ]

X L . If the probability densities at all locations are equally likely, then

( )

(

^b ^l

)

p λ ω, , could be chosen as 1/L. The probability ^p

(

ω,^b,^l

) )

is the same for all location models and then the detection rule can be rewritten as

If the value of ^M ^{( )}

[

( ( ) (

^b ^b^l

) ) ]

^{( )}

corresponding threshold, then the frames may contain speech components that come simultaneously from multiple modeled locations or from unmodeled locations.

5.7 Summary

This chapter proposes reference-signal-based single speaker’s location and multiple speakers’ locations detection methods. The GM location models which are constructed by the location dependent features, phase differences. The proposed methods can

scattering, and coherence problems. The proposed methods are found out to work even under non-line-of-sight conditions and when speakers are in the same direction but different distances from the microphone array.

Additionally, the proposed threshold adaptation approach computes a suitable length of testing sequence and a threshold for each modeled location. Experimental results in Chapter 6 show that the speaker’s location detection approach with these two adapted parameters performs well on detecting multiple speakers’ locations and reducing the average error rates caused by the unmodeled locations at various SNRs.

Chapter 6 Experimental Results

This chapter provides simulation and practical environmental results to assess the capability of the reference-signal-based adaptive beamformers and speaker’s location detection approaches proposed in this dissertation. In these experiments, the sampling frequency is set to 8 kHz and the amplified microphone signals are digitized by 16-bit AD converters. The processed frame window for STFT contained 256 zero padding samples and 32ms speech signals, totaling 512 samples. Figure 6-1 illustrates the processed frame window and the overlapping condition. The pre-recorded speech signals are acquired by locating a loudspeaker on the speech location and the reference signal is obtained from the original speech source that emitted from the loudspeaker.

Figure 6-1 Processed frame window and overlapping condition

The remainder of this chapter is organized as follows. Section 6.1 utilizes the ASR rates and the two frequency-domain performance indexes introduced in Chapter 3 to show the advantages of the proposed reference-signal-based frequency-domain beamformer, SPFDBB and FDABB. Section 6.2 compares the robustness of the NLMS and H∞ adaptation criterions in the time domain through the simulation results.

Section 6.2 also utilizes the ASR rates in both vehicular and indoor environments to prove the advantages of H∞ adaptation criterion. The experimental results containing the location detection performance in single and multiple speakers’ cases, and the cases of radio broadcasting and speech from unmodeled locations are discussed in Section 6.3. Finally, conclusions are made in Section 6.4

6.1 Adaptive Beamformers Using NLMS Adaptation Criterion

6.1.1 Simulation Results

In this simulation, a speech source and two noises, a white noise and a music signal are considered and the linear array contains six microphones. The speech source comes from 0° relative to the linear array, and the white noise and the music signal come from -30° and 60° respectively. Figure 6-2 illustrates the arrangement of the microphone array and the sources. The value of γ is 10⁻⁶ and the step size λ of 0.4 is selected. Two simulations are shown: the first one is performed to compare the performance among different parameters, and the second one is performed to observe the adaptation performance of FDABB in a sudden change of the noise channel, which the noise moves from -30° to -60°. Moreover, the two simulations are executed in three environments specified by different channel response durations: 1024, 2048, and 3072 taps.

Figure 6-2 Arrangement of microphone array, noies and speech source in simulation experiments

In the first simulation, the locations of the speech source and the noises are fixed in the overall training data length. The soft penalty parameter μ has three options, which are 0, 2, and 4. The frame number L in a block varies from 10 to 20, and 20 to 30 corresponding to different soft penalty parameters and channel response durations. Two frequency-domain performance indexes, NSR and SDR, of the most significant frequency, 410Hz, are shown in Tables 6-1, 6-2, and 6-3. The values shown in Tables 6-1, 6-2, and 6-3 are computed by averaging the last 120 frames. The notation ADL indicates that the value of L is adjusted by the CBVI with a lower threshold of 0.02, an upper threshold of 1.2, and the initial frame number 10. In other words, if the CBVI is smaller than 0.02, the value of L will be increased. On the contrary, if the CBVI is larger than 1.2, the value of L will be reset to the initial frame number. Additionally, Table 6-1 summarizes the related parameters of FDABB. Figure 6-3 depicts the NSR and the SDR from C6 to C9 with channel response duration 1024 shown in Table 6-2.

Figure 6-3 shows that the measurement index in the condition with L=1 varies heavily than the one in the conditions with L=10, L=20, and L=30; that is, the performance of the NSR and the SDR cannot be guaranteed even when the algorithm is

channel response duration grows, but the proposed beamformer with a larger value of L would have smaller performance decay and have better convergence performance.

The SDRs of SPFDBB with L=10, L=20, and L=30 in the condition of μ =2 has decreased from about 1.84dB to 4.52dB as compared with those in the conditions of 0μ = . Although the NSR increases at the same time as the SDR fell, the SDR decreasing rate is more important for ASR applications when the NSR is very low, especially when a larger value of L is chosen.

Table 6-1 The First Simulation Experiment: Soft Penalty Parameter is 0

Channel response

duration 1024

Channel response duration 2048

Channel response duration 3072 Condition L NSR(dB) SDR(dB) NSR(dB) SDR(dB) NSR(dB) SDR(dB)

C1 L = 1 -73.88 -46.67 -60.92 -42.97 -57.55 -16.97 C2 L = 10 -102.82 -47.98 -91.53 -46.33 -90.60 -45.37 C3 L = 20 -111.00 -48.85 -98.45 -47.28 -97.12 -46.08 C4 L = 30 -122.92 -50.39 -112.57 -49.60 -105.32 -48.69 C5 ADL -122.40 -50.10 -113.42 -49.87 -106.32 -48.50

Table 6-2 The First Simulation Experiment: Soft Penalty Parameter is 2

Channel response

duration 1024

Channel response duration 2048

Channel response duration 3072 Condition L NSR(dB) SDR(dB) NSR(dB) SDR(dB) NSR(dB) SDR(dB)

C6 L = 1 -65.03 -44.20 -45.77 -42.35 -46.96 -14.50 C7 L = 10 -97.97 -49.82 -89.67 -49.80 -92.87 -47.59 C8 L = 20 -110.32 -51.16 -100.32 -50.62 -96.25 -48.96 C9 L = 30 -120.92 -52.80 -109.27 -52.24 -105.24 -52.13 C10 ADL -125.34 -52.31 -110.27 -52.21 -105.07 -52.11

Table 6-3 The First Simulation Experiment: Soft Penalty Parameter is 4

Channel response

duration 1024

Channel response duration 2048

Channel response duration 3072 Condition L NSR(dB) SDR(dB) NSR(dB) SDR(dB) NSR(dB) SDR(dB)

C11 L = 1 -46.08 -29.40 -42.91 -27.39 -37.27 3.06 C12 L = 10 -93.07 -50.06 -85.35 -49.85 -91.03 -48.08 C13 L = 20 -109.65 -52.54 -95.39 -52.01 -96.00 -50.37 C14 L = 30 -120.56 -54.02 -102.04 -53.87 -105.18 -53.21 C15 ADL -121.71 -53.97 -102.62 -53.92 -105.59 -53.59

Table 6-4 Parameters of the FDABB

Length of STFT 512 Samples Length of Input data in a frame 256 Samples

Shift of STFT 80 Samples

Window function Hamming

Initial block value 10 Block value increment 10

Threshold of CBVI 0.02 and 1.2

(a) NSR

(b) SDR

Figure 6-3 NSR and SDR form C6 to C9 with channel response duration 1024. The dash-dot line represents C6 (L=1), the dot line represents C7 (L=10), the straight line represents C8 (L=20), and the dash line represents C9 (L=30)

In the first simulation, FDABB adjusts the frame number twice from 10 to 30; first

Figure 6-5 shows the NSR and the SDR from C7 to C10 with channel response duration 1024 shown in Table 6-2. Since the initial frame number of FDABB is 10, the SDR and the NSR of FDABB are equivalent to the dash line in the first 261 samples. Obviously, the FDABB could not only perform well in a shorter adaptation process but could also obtain a good convergence result. Since SPFDBB adopts the soft penalty, it emphasize on the SDR improvement than the NSR. Consequently, the SDR of L=30 is better than the SDR of L=10 after frame 300 and the convergence period of the SDR is shorter than that of the NSR.

Figure 6-4 CBVI in the first simulation experiment

(a) NSR

(b) SDR

Figure 6-5 NSR and SDR form C7 to C10 with channel response duration 1024. The dash-dot line represents C10 (ADL ), the dot line represents C7 (L=10), the straight line represents C8 (L=20), and the dash line represents C9 (L=30)

In the second simulation, the location of white noise varies from -30° to -60° during the training data sequence. As shown in the Figs. 6-6 and 6-7, CBVI and the NSR both exhibit a big jump at frame 601 in response to the noise channel variation. Since the impulse response of the speech source is fixed, the SDR has a little variation. After this sudden change is detected, FDABB resets the value of L to the initial frame number to perform advanced adaptation of the noise channel and changes the frame number at frame 771 and frame 851 to maintain convergence.

(a) NSR

(b) SDR

Figure 6-7 NSR and SDR in the second simulation experiment. The dash-dot line represents C10 ( ADL ), the dot line represents C7 (L=10), the straight line represents C8 (L=20), and the dash line represents C9 (L=30)

Table 6-5 shows the number of multiplications ratios of FDABB and SPFDBB to the reference-signal-based time-domain adaptive beamformer. Significant saving of computing power can be achieved as these data indicated.

Table 6-5 Real Multiplication Requirement Ratio

Multiplication Requirement Ratio Adaptation Phase Lower Beamformer Phase FDABB with μ=2 in the first simulation case 1 : 8.57 1 : 20.72 FDABB with μ=2 in the second simulation case 1 : 8.48 1 : 20.72

SPFDBB with L=10 1 : 10.69 1 : 20.72

SPFDBB with L=20 1 : 11.04 1 : 20.72

SPFDBB with L=30 1 : 11.17 1 : 20.72

6.1.2 Indoor Environment

A uniform, linear array using 6 microphones is constructed for this experiment with microphones spaced 0.07 m apart. The array is mounted on an easel which is one meter in height and two meters to the nearest wall. The environment is a 20 m x 15 m x 4 m room full of office furniture to simulate a practical environment and its reverberation time at 1000 Hz is around 0.52 second. The interference signals in this experiment are mutually uncorrelated white noise. The speech signal comes from 0° or 30° with a distance of 1.5 m and the configuration is shown as Fig. 6-8.

Figure 6-8 Arrangement of microphone array, noises and speech source in a noisy environment

This experiment utilizes the ASR rates to measure the performances of FDABB, SPFDBB, the reference-signal-based time-domain adaptive beamformer, DS beamformer, GSC, robust adaptive beamformer, and minimum variance beamformer (MV) under a fixed speech source and different number of interference signals. To measure the ASR rate, 500 pairs of the vehicle identification numbers pronounced in Chinese are used. An HTK software package [120] is adopted as a speech recognizer.

penalty is set at a constant value 2. The value of γ is 10⁻⁶. Table 6-6 shows the ASR system parameters. The filter tap of the reference-signal-based time-domain adaptive beamformer is chosen as 2560.

Table 6-6 Parameters of the ASR

Recognition kernel HTK ver.3.0

Model HMM

Feature Vector 12^th order MFCC + 12^th order ΔMFCC

Training data Set 1001 clean pairs of the vehicle identification numbers Recognition Task 500 pairs of the vehicle identification numbers

Figures 6-9 and 6-10 present the ASR rates of two different speech sources and the notations used in the two figures are shown in Table 6-7. Figure 6-9 shows that the ASR rate decreases as the number of interference source increases (see Table 6-7). Because DS beamformer, GSC, robust adaptive beamformer, and MV beamformer do not take the calibration problem into consideration, the improved ASR rates of speech source in 30° are lower than in 0°. For example, these traditional beamformers perform better in C1-C3 than in C4-C6. On the other hand, the proposed methods, SPFDBB with

=20

L and L =30 , and FDABB with u =2 , and the reference-signal-based time-domain adaptive beamformer shown in Fig. 6-10 can overcome this effect. For example, the improved ASR rate of SPFDBB with L=20 in C4, 32.23%, is better than that in C1, 29.57%. These experimental results in Figs. 6-9 and 6-10 show that the value of L could affect the recognition rate. In this experiment, the FDABB with the

μ has the best performance in all conditions and the SPFDBB with L=30 performs better than the reference-signal-based time-domain adaptive beamformer.

Figure 6-9 ASR rates of different kinds of beamformer outputs versus different experiment conditions

Figure 6-10 ASR rates of SPFDBB and FDABB versus different experiment conditions

Table 6-7 Meaning of Notations in Figs 6-9 and 6-10

C1 Speech source in 0°and noise source in -60°

C2 Speech source in 0°and noise source in 60°and -30°

C3 Speech source in 0°and noise source in 60°, -30°, and -60°

C4 Speech source in 30°and noise source in 60°

C5 Speech source in 30°and noise source in 60°and -30°

6.1.3 Vehicular Environment

The experiment is performed on passenger seat of a mini-van vehicle instead of the driver’s seat due to the driving safety consideration. A uniform linear microphone array of six un-calibrated microphones with 0.05 m spacing is mounted in front of the passenger seat. Additionally, the distance between the microphone array and the speaker in the passenger seat is about 0.62 m. During the experiment, all windows are closed to prevent the microphones from saturating and the cabinet temperature is set to be 24ºC using the in-car air conditioner. Off-the-shelf, low-cost and non-calibrated microphones are used for the array. The performance of the proposed approaches is evaluated by ASR rates with the parameters shown in Table 6-6 under ten conditions (C1-C10 of Table 6-8). Table 6-8 shows the average SNRs in the ten conditions. A music piece containing vocal sound is played repeatedly from six build-in loudspeakers when the in-car audio system is turned on. This experiment utilizes the ASR rates shown in Fig. 6-11 to measure the performances of FDABB, SPFDBB, and the reference-signal-based time-domain adaptive beamformer. In the vehicular environment, the FDABB does not always outperform the SPFDBB withL=30. However, the FDABB with the soft penalty parameter of 2 and the SPFDBB with

=30

L outperform the reference-signal-based time-domain adaptive beamformer.

Table 6- 8 Ten Experimental Conditions and Isolated Average SNRs

Condition

Number Speed

Power of In-car Audio

System

Average SNR (dB)

Condition

Number Speed

Power of In-car Audio

System

Average SNR (dB)

C1 20 km/h Off 4.20 C6 20 km/h On -0.08

C2 40 km/h Off 2.84 C7 40 km/h On -2.19

C3 60 km/h Off 2.72 C8 60 km/h On -2.28

C4 80 km/h Off -1.90 C9 80 km/h On -4.75

C5 100 km/h Off -3.04 C10 100 km/h On -5.40

Figure 6-11 ASR rates of reference-signal-based time-domain beamformer, SPFDBB and FDABB

6.2 Comparison of NLMS and H_∞ Adaptation Criterions

6.2.1 Simulation Results

6.2.1.1 Single-channel Case

The four time-domain performance indexes: filtered output error (n)e_f in Eq.

(4-31), reference signal estimation error e_r(n) in Eq. (4-32), filter coefficient estimation error ratio in Eq. (4-33), and filtered output error ratio in Eq. (4-34) are utilized in this case. The simulation signal is constructed through the linear model shown in Eq. (4-2). To reflect the modeling error, the unknown disturbance e(n) is constructed as:

( )

[

^ˆ ⁽ⁿ⁾ ³ ^v⁽ⁿ⁾

]

e(n)=ρ x^T q + (6-1)

where

{ }

v(n) is a white noise sequence and ρ is a scalar that can produce various SNR simulation cases. Therefore, the linear model can be rewritten as,

( )

[

^ˆ ⁽ ⁾ ⁽ ⁾

]

) ˆ ( )

(n n n ³ v n

r =x^T q+ρ x^T q + (6-2)

The tap number of the single channel, P , is chosen to be 10 and 20, μ₀ is set to 1, δ is set to 0.001, and there are a total of seven SNRs, denoted from C1 to C7, as shown in Table 6-9. Figure 6-12 illustrates the filtered output error ratio obtained by executing NLMS adaptation criterion at C7. The filtered output error ratio of the H∞ adaptation criterion at C7 is presented in Fig 6-13. Obviously, the filtered output error ratio derived via the NLMS adaptation criterion never exceed one which fits the fact that the NLMS adaptation criterion guarantees the energy of the filtered output error will never exceed the energy of disturbance. Furthermore, the filtered output error, coefficient vector estimation error, and reference signal estimation error versus seven conditions shown in Figs. 6-14, 6-15, and 6-16 are derived from averaging the last one thousand runs of the total adaptation runs of 30000. Although the NLMS adaptation criterion is robust to the disturbance, the filtered output error and coefficient vector estimation error of H∞

adaptation criterion still outperform those of the NLMS adaptation criterion in this simulation. Notably, because the basic concept of the NLMS and H∞ adaptation criterions are to minimize the energy of the reference signal estimation error, e_r(n), the reference signal estimation errors of the two adaptation criterion are similar.

Table 6-9 Seven Kinds of SNRs

Condition C1 C2 C3 C4 C5 C6 C7 Average SNR 5.53 dB 1.99 dB -0.57 dB -2.47 dB -4.09 dB -5.42 dB -6.56 dB

Figure 6-12 Filtered output error ratios of NLMS adaptation criterion with the tap number of 10 and 20

Figure 6-13 Filtered output error ratios of H_∞ adaptation criterion with the tap number of 10 and 20

Figure 6-15 Coefficient vector estimation error versus seven conditions

Figure 6-16 Reference signal estimation error versus seven conditions

Although the H∞ adaptation criterion can guarantee Eq. (4-6) holds when the Riccati recursion P⁻¹(n)+xˆ(n)xˆ^T(n)−γ_q⁻²I is larger than 0, how to find a most suitable

value of γ_q is still an issue. According to Eq. (4-6), a smaller value of γ can _q² guarantee a smaller upper bound of Eq. (4-6) under the same disturbances. It means that a smaller value of γ limits the maximum filter coefficient estimation error ratio _q² to a smaller value. However, a smaller value of γ does not always achieve a better _q² performance. In this simulation, three different values of γ are compared at C2 with _q the SNR of 1.99dB to prove that the selection of γ could affect the estimation _q

performance. P is set to 10. γ_q⁻² in the three cases are selected as eig(P⁻¹(n)+xˆ(n)xˆ^T(n))×10⁻³ , eig(P⁻¹(n)+xˆ(n)xˆ^T(n))×10⁻⁴ , and

1( ) ˆ( )ˆ ( )) 10

(

eig P⁻ n +x n x^T n × ⁻ individually. Notably, the three values of γ_q satisfy 0

) ˆ ( ) ˆ( )

( ²

1 + − ⁻ >

− n x n x^T n γ_q I

P , so the upper bound of Eq. (4-6) exists. Table 6-10 lists the corresponding minimum values of γ , filter output errors, and coefficient _q² estimation errors in three cases from averaging the last one thousand runs. Figure 6-17 illustrates the filter coefficient estimation error ratios in three cases. Clearly, the filter coefficient estimation error ratios in three cases do not exceed the minimum value of

γq . Although, case one has the smallest filter coefficient estimation error ratio and the fastest convergence rate, it does not converge to a smaller coefficient estimation error as compared with case two.

Table 6-10 Experimental Results in Three Different Selection Cases of γ _q²

Case One Case Two Case Three

Minimum value of γ_q² 629.08 2837.90 14220

Filtered output error (dB) -53.48 dB -55.31 dB -51.78 dB Coefficient Estimation error (dB) -12.27 dB -16.98 dB -11.37 dB

6.2.1.2 Multiple-channel Case

To simulate a noisy environment, a speech source and a white noise are passed through each individual channel to the six microphones and the channel response duration is set to 30 taps. The tap number of the estimated coefficient vector has three selections of 10, 20, and 30 taps which can simulate the conditions under which the filter order is lower than or equal to the channel response duration. The time-domain SDR and NSR defined in Eqs. (4-29) and (4-30) are adopted as performance measurements in this section. Table 6-11 depicts the values of the SDR and the NSR versus the number of taps. Notably, the total adaptation runs is 30000 and the values of time-domain SDR and NSR are derived from averaging the last one thousand runs.

Clearly, the H∞ adaptation criterion outperforms the NLMS adaptation criterion. In Table 6-11, the performance of SDR and NSR degrades with the decrease of the filter order, especially for the H∞ adaptation criterion. Although increasing the filter order enlarges the degree of freedom, the improvement provided by NLMS adaptation criterion is insignificant, since the approach continuous to suffer from the modeling error problem. On the contrary, the H∞ adaptation criterion provides the robustness to modeling error, and thus can utilize the same increment of the degree of freedom to provide higher performance improvements.

Table 6-11 SDR and NSR at the SNR of -5.16 dB

P = 10 P = 20 P = 30

SDR NSR SDR NSR SDR NSR NLMS adaptation criterion -52.91 dB -58.79 dB -54.50 dB -60.21 dB -55.49 dB -61.21 dB H∞ adaptation criterion -67.50 dB -85.62 dB -75.90 dB -116.51 dB -79.09 dB -119.73 dB

6.2.2 Indoor environment

The indoor environment is arranges as Fig. 6-8 and the parameters of ASR is shown in Table 6-6. The FDABB parameters are the same as those in Table 6-4 and the soft penalty is 2. Figure 6-18 presents the ASR rates of SPFDBB and FDABB using NLMS

在文檔中以參考訊號架構為基礎之穩健語者定位與語音純化法 (頁 90-0)