• 沒有找到結果。

Chapter 4 H ∞ Adaptation Criterion

4.4. Time-domain Performance Indexes

In this section, six time-domain performance indexes are defined. The first two parameters, SDR and NSR, instead of SNR are defined to evaluate the performances of the NLMS and H adaptation criterions. This is because a lower SNR may not correspond to a higher ASR rate and these two indexes facilitate to directly separate two main issues: the inverse issue and noise suppression issue. SDR is defined as:

⎟⎟

where V denotes the length of signals in both equations. SDR represents the degree of source distortion that caused by channel effect and noises. Moreover, NSR is the degree of noise reduction.

To observe different characteristics between the NLMS and H adaptation criterion, four performance indexes named filtered output error (n)ef , reference signal estimation error er(n), filter coefficient estimation error ratio, and filtered output error ratio are defined in Eqs. (4-31), (4-32), (4-33), and (4-34) individually.

)

where )~q(n is the coefficient vector estimation error defined in Eq. (4-3).

4.5. Summary

In this chapter, the H adaptation criterion is investigated to enhance the robustness to the modeling error caused by an inadequate window size to capture the acoustic channel dynamics. This chapter utilizes the H adaptation criterion to replace the NLMS in the reference-signal-based time-domain adaptive beamformer. However, due to the intensive computational effort requirement in the time domain, SPFDBB

and FDABB, using H adaptation criterion are proposed to significantly reduce the computational effort which is analyzes in Section 4.3.3. As shown in Chapter 6, H

adaptation criterion outperforms the NLMS with the same filter order in terms of SDR, NSR and ASR results.

Chapter 5

Reference-signal-based Speaker’s Location Detection

5.1 Introduction

A speech enhancement system using microphone array usually requires a speaker’s location estimation capability. With the information of the desired speaker’s location, the speech enhancement system can suppress the interference signals and noises from the other locations. For example, in vehicle applications, a driver may wish to exert a particular authority in manipulating the in-car electronic systems through spoken language. Consequently, a better receiving beam using a microphone array can be formed to suppress the environmental noises and enhance the driver’s speech signal if the driver’s location is known.

In a highly reflective or scattering environment, conventional delay estimation methods such as GCC-based (TDOA-based) algorithms [21-23] or previous works [24-25] do not yield satisfactory results. Although Brandstein et al. [101] proposed Tukey’s Biweight to redefine the weighting function to deal with the reflection effect; it

is not suitable for a noisy environment. To overcome this limitation, Nikias et al. [102]

adopted the alpha-stable distribution, instead of a single Gaussian model, to model ambient noise and to obtain a robust speaker’s location detection in advance. In recent years, several works have introduced probability-based methods to eliminate the measurement errors caused by uncertainties, such as those associated with reverberation or low energy segments. Histogram-based TDOA estimators such as time histograms [20] and weighted time-frequency histograms [26-27] have been proposed to reduce direction-of-arrival root-mean square errors. The algorithm in [27] performs well especially under low SNR conditions. Moreover, Potamitis et al. [103] proposed the probabilistic data association (PDA) technique with the interacting multiple model (IMM) estimator to conquer these measurement errors. Ward et al. [7] developed a particle filter beamforming (steered-beamformer-based location approach) in which the weights and particles can be updated using a likelihood function to solve the reverberation problem. Although these statistical based methods [7], [20], [26-27] and [103] can improve the estimation accuracy further, they cannot distinguish from the locations using a single linear microphone array under a totally non-line-of-sight condition which is common in vehicular environments.

Another approach (spectral-estimation-based location approach), proposed by Balan et al. [8], explores the eigenstructure of the correlation matrix of the microphone array by separating speech signals and noise signals into two orthogonal subspaces. The DOA is then estimated by projecting the steering vectors onto the noise subspace.

MUSIC [9-10] combined with spatial smoothing [11] and [104] is one of the most popular methods for eliminating the coherence problem. However, as the experiment in Chapter 6 indicates, its robustness is still poor in a vehicular environment when the

SNR is low. Furthermore, the near-field effect [105-107] should also be considered in applications in real environments.

In some environments, especially in vehicular environments, the line-of-sight condition may not be available because, for example, barriers may exist between the speaker and the microphone array. Therefore, when a single linear array is employed, the aforementioned methods cannot distinguish speakers under non-line-of-sight conditions. Hence, multiple microphone arrays must be considered [108-109]. Further, the microphone mismatch problem often arises when such methods as steered-beamformer-based, GCC-based or spectral-estimation-based algorithms are used since these methods require the microphones to be calibrated in advance. Several sound source localization works [110-111] also mentioned the importance of calibration and the influence of microphone mismatch problem. However, accurate calibration is not easy to obtain since the characteristics of microphones vary from the sound source directions.

The relationship between a sound source and a receiver (microphone) in a complicated enclosure is almost impossible to characterize with a finite-length data in real-time applications (such as in frame-based calculations). According to the investigation of room acoustics [112], the number of eigen-frequencies with an upper limit of fs/2 kHz can be obtained by the following equation:

3

2 3

4 ⎟

⎜ ⎞

⎛ ν π fs

B (5-1)

where f denotes the sampling frequency, ν represents the sound velocity s ( ν ≈340m /s ) and B is the geometrical volume. This equation indicates that the number of poles is too high when the frequency is high, and that the transient response

occurs in almost any processing duration when the input signal is a speech signal. For example, the number of poles is about 96435 when the sampling frequency is 8 kHz and the volume is 14.1385m . Hence, the non-stationary characteristics of speech signals 3 make the phase differences between the signals received by two elements of a single linear microphone array from a fixed sound source vary among data sets. Moreover, the stochastic nature of the phase difference is more prominent when the sound source is moving slightly and environmental noises are present. Consequently, the proposed method in this dissertation does not explicitly utilize the information of direct path from sound source to microphones to detect speaker’s location, nor attempt to suppress the effect of reverberations, interference signals, and noises. Instead, this proposed method utilizes the sound field features obtained when the speaker is at different location in an indoor environment. In other words, this dissertation proposes the use of the distributions of phase differences, rather than their actual values, to locate the speaker, because the phase difference distributions vary among locations and can be distinguished by pattern matching methods. Previous researches [113-114] also showed that common acoustic measures vary significantly with small spatial displacements of the sound source or the microphone.

The experimental results in Chapter 6 indicate that the GMM [115] is very suitable for modeling these distributions. Furthermore, the model training uses the distributions of phase differences among microphones as a location-dependent but content and speaker-independent sound field feature. In this case, the geometry of the microphone array should be considered to cope with the aliasing problem and maximize the phase difference of each frequency band to detect the speaker’s location accurately.

Consequently, the microphone array can be decoupled into several pairs with various

detector integrates the overall probability information from different frequency bands to detect the speaker’s location.

The reminder of this chapter is organized as follows. The next section introduces the overall system architecture and data flow. Section 5.3 presents the design of the location model and the model parameters estimation approach. Section 5.4 presents the proposed reference-signal-based single speaker’s location detection criterion. Section 5.5 discusses an approach to find each location’s testing sequence length and threshold.

Section 5.6 describes the proposed reference-signal-based multiple speakers’ locations detection criterion using the information of testing sequence length and threshold of each location. Conclusions are made in Section 5.7.

5.2 System Architecture

5.2.1 System Architecture

Figure 5-1 illustrates the overall system architecture. A voice activity detector divides the system into two stages, the silent stage and the speech stage. Before the proposed system is training online, a set of pre-recorded speech signals can be required via the description in Chapter 2 to obtain a priori information between speakers and the microphone array. The pre-recorded speech database can represent the acoustical characteristic of each location. After collecting the pre-recorded speech signals, the system switches automatically between the silent and speech stages according to the VAD result.

The first stage is called the silent stage in which speakers are silent. In this stage, environmental noises without speech are recorded online. The system combines the

online recorded environmental noise,N1(ω),L,NM(ω), with the pre-recorded speech database, S1(ω),L,SM(ω), to construct training signals, ˆ ( ), , ˆ ( )

1 ω XM ω

X L . After that,

the GM location models are derived via the location model training procedure described in Section 5.3 or 5.5. Since the environmental noise alters, the GM location models that contain the characteristics of environmental noise are updated to ensure the detection accuracy and robustness in this stage. The second stage is the speech stage, in which the parameters of GM location models derived from the first stage are duplicated into the location detector to detect the speaker’s location.

)

1(ω N

)

2(ω N

) (ω NM

)

1(ω

S S2(ω) SM(ω) )

1(ω X

)

2(ω X

) (ω XM

) ˆ2(ω X

) ˆM(ω X

) ˆ1(ω X

Figure 5-1 Proposed reference-signal-based speaker’s location detection system architecture

5.2.2 Frequency Band Divisions based on a Uniform Linear Microphone Array

The phase difference of the received signal becomes more significant as the distance between microphones increases. However, the aliasing problem occurs when this

distance between pairs of microphones should be chosen based on the selected frequency band to obtain clear phase difference data to enhance the accuracy of location detection and prevent aliasing.

Figure 5-2 illustrates a uniform microphone array with M microphones and the distance of d . According to the geometry, the training frequency range is divided into

) 1

(M − bands listed in Table 5-1, where m denotes the mth microphone; b represents the band number, ν denotes the sound velocity, and Jb is the number of microphone pairs in the band of b . The phase differences measured by the microphone pairs at each frequency component, ω (belonging to a specific band, b) are utilized to generate a GM location model with the dimension of J . b

Figure 5-2 Microphone array geometry

Table 5-1 Relationship of Frequency Bands to the Microphone Pairs

Frequency Band Microphone Pairs The Number of Microphone Pair The Range of Frequency Band Band 1 (b=1) (m,m+ M1)with m=1 Jb= J1=1

d M 1) ( 0 2

ω ν

Band 2 (b=2) (m,m+ M2)with 1≤ m2 Jb= J2=2

d M d

M 1) 2( 2) (

2 <

ω ν ν

M M M M

Band M1 (b= M1) (m,m+1) with 1mM1 Jb=JM1=M1

d

d 2

4 ω ν ν <

5.3 Location Model Description and Parameters Estimation

5.3.1 GM Location Model Description

If the GM location model at location l is represented by the parameter

( )

l =

{

λ

(

ω,1,l

)

, ,λ

(

ω,M −1,l

) }

λ L , then a group of L GM location models can be

represented by the parameters,

{

λ

( )

1 L, ,λ

( )

L

}

. A Gaussian mixture density in the band b at location l can be denoted as a weighted sum of N Gaussian component densities:

( ) ( )

(

b l b l

) (

b l

)

g

( (

b l

) )

G i X

N

i X i

b , , | , , , , ˆ , ,

1

ˆ ω ω ρ ω P ω

P

=

=

λ (5-2)

where PXˆ

(

ω,b,l

)

=

[

PXˆ

(

ω,1,l

)

L PXˆ

(

ω,Jb,l

) ]

T is a Jb-dimensional training phase difference vector derived from the training signals, ˆ ( ), , ˆ ( )

1 ω XM ω

X L . ρi

(

ω,b,l

)

is the th i mixture weight and each component of the training phase difference vector can be obtained as follows:

( )

(

X b l

)

phase

(

X

(

b l

) )

phase l

m

PXˆ(ω, , )= ˆm+Mb ω, , − ˆm ω, , with 1≤mb (5-3)

where Xˆm

(

ω,b,l

)

denotes constructed training signal of mthmicrophone in the band b at location l . The GM location model parameter in the band b at location l,λ

(

ω,b,l

)

, is constructed by the mean matrix, covariance matrices and mixture weights vector from N Gaussian component densities:

(

ω,b,l

) { (

ω,b,l

) (

,μω,b,l

) (

,Σω,b,l

) }

λ = ρ (5-4)

where

( ) [ ( ) ( ) ]

band b at location l.

The ithcorresponding vector and matrix of the parameters defined above are

( ) [

i

( )

i

(

b

) ]

T

Notably, the mixture weight must satisfy the constraint that

(

, ,

)

1 phase differences of the microphone pairs may not be statistically independent of each

other, GMMs with diagonal covariance matrices have been observed to be capable of modeling the correlations within the data by increasing mixture number [117].

5.3.2 Parameters Estimation via EM Algorithm

The purpose is to determine the L GM location models,

{

λ

( )

1 L, ,λ

( )

L

}

, from the measured phase differences between each microphone pair in band b . Several techniques are available for estimating λ

( )

l , of which the most popular is the EM algorithm [115] that estimates the parameters by using an iterative scheme to maximum the log-likelihood function. The EM algorithm can guarantee a monotonic increase in the model’s log-likelihood value, and its iteration equations corresponding to frequency band selection can be arranged as:

Expectation step:

(i). Estimate the mixture weights:

( )

=

T b

(

X t

( ) ( ) )

(ii). Estimate the mean vector:

(iii). Estimate the variances:

( ) ( ( ) ( ) ) ( )

However, the EM algorithm only guarantees to find a local maximum log-likelihood model. A different choice of initial model λ0

(

ω,b,l

)

leads to various local maximum models. This work considered two initialization methods to find out the initial model.

K-means [118] is by far the most widely-used method. Elkan [119] proposed an accelerated K-means algorithm which utilizes the triangle inequality to decrease significantly the computational effort. Charles’ method is also suitable for finding a good initial model to lower the iteration number of the EM algorithm. The first method utilizes the accelerated K-means clustering method. The second method separates phase difference range,

{

π,π

}

, into N segments to obtain a fixed initial mean model since the phase difference range is small enough. Consequently, the initial mean model is

{

π N2π1π N4π1π L π

}

. The location detection performances of the two initial approaches have slightly different performance and no one is always the best.

)

Figure 5-3 Location model training procedure with the total location number L

5.4 Single Speaker’s Location Detection Criterion

The location is determined by finding the GM location model which has the maximum posteriori probability for a given observation sequences:

( ) ( )

the detection rule can be rewritten as:

( ) ( )

5.5 Testing Sequence Lengths and Thresholds Estimation

First, the work in Section 5.4 assumed that the speech signals are emitted from one of the previously modeled locations. Consequently, an unmodeled speech signal which is not emitted from one of the modeled locations, such as the radio broadcasting from the in-car audio system and the speaker’s voices from unmodeled locations, degrades the performance. The unmodeled speech signal could trigger the VAD, resulting in an incorrect detection of the speaker location. Therefore, a method that can prevent the detection errors without modifying the VAD approach is necessary. Second, the work in Section 5.4 cannot detect multiple speakers’ locations. If the speech signals from various modeled locations are mixed together, then the derived phase difference distribution becomes an unmodeled distribution, leading to a detection error. Figure 5-4 shows the example of the phase difference distribution from two simultaneously speaking passengers at locations No. 1 and 2 which is not similar to the one from location No. 1 or 2, and thus may lead to a detection error.

(a). Location No. 1 (frequency = 0.9375 kHz) (b). Location No. 2 (frequency = 0.9375 kHz)

(c). Locations No. 1 and 2 (frequency = 0.9375 kHz)

Figure 5-4 The histograms of phase differences at locations No. 1, 2, and 1 and 2 between the third and the sixth microphones at a frequency of 0.9375 kHz.

Moreover, how to find a suitable length of testing sequence that could significantly affect the location detection performance is not discussed in Section 5.4. This section proposes a new threshold-based location detection approach that utilizes the training signals and the trained GM location model parameters to determine the length of testing sequence and then obtain a threshold of the a posteriori probability for each location to resolve the two issues mentioned above.

Since conversational speech contains many short pauses, Potamitis et al. [103]

locates multiple speakers by detecting the direction of individual speaker when the frame is originated from a single speaker. For example, Fig. 5-5 shows a two people conversation condition. Based on this concept, this dissertation proposes a threshold-based location detection approach to determine whether a speech frame originates from a single speaker or from simultaneously active speakers. This approach identifies the frames in which probably only one speaker is talking, and returns a valid location detection result. Moreover, because each location has specific acoustical

represents the radio broadcasting or speech signals coming from unmodeled or modeled locations.

Figure 5-5 A two people conversation condition

The lengths of testing sequences and thresholds can be derived using the estimated parameters of the L GM location models. The most suitable length of testing sequences at location l is denoted as

( )

l , the threshold at location l is denoted as

( )

l

Thd , and the possible searching range of the length of the testing sequence is set to

[

Q ,Lo QUp

]

. T is the total length of the training phase difference sequence. the following criterions:

( )

l

{

C

( )

Q

}

with α +β +γ =1 (5-14) lower bound when the length of the training phase difference sequence is Q . They are derived from the following equations:

( ) ( )

and (5-17) can be rewritten as:

( ) ( )

( )

M

∑∑

1Q1

( ( ) ( ) )

( ) ( )

The first term of Eq. (5-14) represents the negative maximum probability variation of the trained model when the length of the training phase difference sequence is Q . As the value of this term increases, the corresponding selection of Q yields a more robust result under the trained GM location model. The second term of Eq. (5-14) is the sum of the probability differences of the location l versus other locations and a larger value means the corresponding selection of Q has a higher discrimination level between the location l and the other trained GM locations. Finally, a high discrimination level between the location l and other unmodeled locations can be achieved if the third term of Eq. (5-14) is large. Figure 5-6 shows the GM location model training procedure with the total location number L .

)

Figure 5-6 Location model training procedure with testing sequence length and thresholds estimation

5.6 Multiple Speakers’ Locations Detection Criterion

The location is detected as,

( ) ( )

[ (

( ) ( )

) ]

X L . If the probability densities at all locations are equally likely, then

( )

(

b l

)

p λ ω, , could be chosen as 1/L. The probability p

(

PX

(

ω,b,l

) )

is the same for all location models and then the detection rule can be rewritten as

If the value of M ( )

[

G

( ( ) (

b bl

) ) ]

Q

( )

l

corresponding threshold, then the frames may contain speech components that come simultaneously from multiple modeled locations or from unmodeled locations.

5.7 Summary

This chapter proposes reference-signal-based single speaker’s location and multiple speakers’ locations detection methods. The GM location models which are constructed by the location dependent features, phase differences. The proposed methods can

scattering, and coherence problems. The proposed methods are found out to work even under non-line-of-sight conditions and when speakers are in the same direction but different distances from the microphone array.

Additionally, the proposed threshold adaptation approach computes a suitable length of testing sequence and a threshold for each modeled location. Experimental results in Chapter 6 show that the speaker’s location detection approach with these two adapted parameters performs well on detecting multiple speakers’ locations and reducing the average error rates caused by the unmodeled locations at various SNRs.

Chapter 6

Experimental Results

This chapter provides simulation and practical environmental results to assess the capability of the reference-signal-based adaptive beamformers and speaker’s location detection approaches proposed in this dissertation. In these experiments, the sampling frequency is set to 8 kHz and the amplified microphone signals are digitized by 16-bit AD converters. The processed frame window for STFT contained 256 zero padding samples and 32ms speech signals, totaling 512 samples. Figure 6-1 illustrates the processed frame window and the overlapping condition. The pre-recorded speech signals are acquired by locating a loudspeaker on the speech location and the reference signal is obtained from the original speech source that emitted from the loudspeaker.

Figure 6-1 Processed frame window and overlapping condition

The remainder of this chapter is organized as follows. Section 6.1 utilizes the ASR rates and the two frequency-domain performance indexes introduced in Chapter 3 to show the advantages of the proposed reference-signal-based frequency-domain beamformer, SPFDBB and FDABB. Section 6.2 compares the robustness of the NLMS and H adaptation criterions in the time domain through the simulation results.

Section 6.2 also utilizes the ASR rates in both vehicular and indoor environments to

Section 6.2 also utilizes the ASR rates in both vehicular and indoor environments to