• 沒有找到結果。

Frequency-domain Performance Indexes

From Eq. (3-1), the filter coefficient vector should equalize the acoustic channel

frequency-domain performance indices, SDR and NSR, are defined for these effects.

SDR means the decreased level of source distortion produced by inexact filter coefficient vector estimation and NSR means the improved level of noise reduction.

Assuming that the true channel response is W(ω)=

[

W1(ω) ... WM(ω)

]

T, then the source distortion can be represented as:

)

Thus, the optimal filter coefficient vector should satisfy two equality equations:

1

Since the denominators of Eqs. (3-21) and (3-22) are initial reference values, SDR or NSR represents the performance enhancement derived by comparing certain iteration with the initial filter coefficient vector. Smaller values of SDR and NSR indicate smaller source distortion and higher noise suppression performance.

3.6 Summary

This chapter presents two novel reference-signal-based frequency-domain beamformers, FDABB and SPFDBB, to overcome problems such as calibration, near-field or far-field cases, resolution and desired signal cancellation etc. These approaches not only reduce the computational effort significantly in the ASR-based application as compared with the reference-signal-based time domain adaptive beamformer, but also improve performance in a noisy environment. Moreover, FDABB which can automatically adjust the frame number to different environments is particularly suitable for practical applications.

Chapter 4

H Adaptation Criterion

4.1. Introduction

The NLMS adaptation criterion used in the preceding chapter does not make any assumption about the pre-recoded signals and the disturbance, unlike the exact least square algorithm such as recursive least square (RLS). The solution of the NLMS adaptation criterion recursively updates the filter coefficient vector along the direction of the instantaneous gradient of the squared error. Therefore, the NLMS adaptation criterion is more robust to disturbance variation than the RLS algorithm. For example, it has been observed that the NLMS has better tracking capabilities than the RLS algorithm in the presence of non-stationary inputs [88]. However, the performance of the NLMS depends upon the properties of the modeling errors which may lead to large coefficient vector estimation error. Consequently, it is necessary to design a robust adaptive algorithm to guarantee that if the disturbance energy is small, the coefficient vector estimation error will be small as well (in energy).

There has been an increasing interest in the mini-max estimation method [89-97]

called H algorithm which is more robust and less sensitive to model uncertainties and

parameter variations than the H2 adaptation criterion (such as the Kalman filter). This is because no a priori knowledge of the disturbance statistics is required in the H

algorithm. It means that the H algorithm can accommodate for all conceivable disturbances which have a finite energy. Moreover, the estimation criterion of the H

algorithm is to minimize the worst possible effects of the disturbances (modeling errors and additive noises) on the signal estimation error. Actually, the NLMS adaptation criterion is the central a posteriori Hoptimal filter [94]. However, it is the Hoptimal filter that minimizes the worst possible effects of the disturbances on the filtered output error. But the goal of the reference-signal-based beamformer is to estimate the filter coefficient vector itself instead of in minimizing the filtered output error. In this case, the criterion of the Hoptimal filter has to be modified to address the problem of filter coefficient vector estimation (eg. minimizing the coefficient vector estimation error).

The remainder of this chapter is organized as follows. Section 4.2 describes the definition of the H-norm and the recursive solution of the time-domain adaptive beamformers which utilizes H adaptation criterion. Section 4.3 applies H adaptation criterion to the two proposed frequency-domain adaptive beamformers, SPFDBB and FDABB as well as analyzes the computing effort of the two frequency-domain beamformers and the time-domain beamformer using H adaptation criterion. Section 4.4 defines time-domain performance indices for the experiments in Chapter 6 to measure the robustness of H adaptation criterion. Finally, a conclusion is given in Section 4.5.

4.2. Time-Domain Adaptive Beamformer Using H Adaptation Criterion

4.2.1 Definition of H-norm

The H-norm defines the worst case response of a system. If Z denotes a transfer operator that maps an input causal sequence

{ }

ui to an output causal sequence

{ }

yi as shown in Fig. 4-1, the H-norm of Z is defined as,

2 2 0

sup u Z y

= u≠ (4-1)

where the notation .2 denotes the 2-norm. Obviously, the H-norm can be regards as the maximum energy gain from the input u to the output y .

Z

{ }yi ni=1

{ }ui ni=1

Figure 4-1 Transfer operator Z from input

{ }

ui to output

{ }

yi

4.2.2 Formulation of Time-Domain Adaptive Beamformer

Figure 4-2 shows the overall architecture of the time-domain adaptive beamformer using H adaptation criterion. The data flow and architecture are almost the same with those introduced in the Chapter 2 except the H adaptation criterion is used. In the silent stage (VAD = 0), the filter coefficient vectors are adapted through H adaptation criterion. In the speech stage (VAD = 1), the computed filter coefficient vectors are applied to the lower beamformer to suppress the interference signals and noises, and derive the purified speech signal.

Based on the system architecture shown in Fig. 4-2, the formulation of speech enhancement system can be expressed as the following linear model:

r(n)=xˆT(n)q+e(n) (4-2)

where r(n) is the reference signal and xˆ(n)=

[

xˆ1(n) L xˆM(n)

]

T is a MP×1 training signal vector. xˆi(n)=

[

xˆi(n) L xˆi(nP+1)

]

is a 1×P training signal vector and each component in the silent stage is constructed from the linear combination of the pre-recorded speech signals and the online recorded interference signals or noises as xˆi(n)=si(n)+ni(n). M denotes the number of microphones and

P denotes the number of filter tap. Additionally,

[

q11 L q1P L qM1 L qMP

]

T

=

q is the MP×1 unknown filter coefficient

vector in the time domain that we intent to estimate. e(n) is the unknown disturbance, which may also include modeling error.

)

Figure 4-2 System Architecture of the time-domain adaptive beamformer using H adaptation criterion

The problem of the proposed speech enhancement system is how to use a strategy

using all the information available from time 1 to time n such that the H-norm from estimation error defined as:

q~(n)=qqˆ(n) (4-3)

Figure 4-3 Transfer operator from disturbances to coefficient vector estimation error

To apply the adaptive H adaptation criterion, the linear model, as in Eq. (4-2), is transformed into an equivalent state-space form:

⎩⎨

To find the optimal Hestimation, the criterion in the sense of H-based filtering is:

( ) ∑

where ⋅ denotes the square of the 2-norm. However, a closed form solution of Eq. 2 (4-5) is unavailable for general cases. Therefore, it is common in the literature to relax the minimization condition and settle for a suboptimal solution. Given a scalar γq >0,

find a H suboptimal estimation strategy called

(

(1), ( ); ˆ(1), ˆ( )

)

)

ˆ(n r r n x x n

q =Ψ L L that achieves T

( )

Ψ q. In other words, the suboptimal solution is to find a strategy that achieves

2

4.2.3 Solution of suboptimal H Adaptation Criterion

Consider a time-variant state-space model of the form

⎩⎨ H-formulation estimates some arbitrary linear combination of the states, say

d(n)=B(n)q(n) (4-8)

where B(n)∈CN×MP. Let dˆ(n)=Ψ

(

r(1), L r(n)

)

denote the estimation of d(n) given observations

{ }

r(i) |ni=1. Define the estimation error as

)

In this case, the suboptimal solution [96] is recursively computed as

(

( ) ( )ˆ( )

)

following alternative condition can be used to guarantee the existence of Eq. (4-10)

0

4.2.4 Solution of Time-domain Adaptive Beamformer

Let’s apply Eqs. (4-11), (4-12), (4-13), and (4-14) to the state-space model Eq. (4-4) where F(n)=I, G(n)=0, )H(n)=xˆT(n , and B(n)=I. Thus the solution of qˆ n( ) can be found by the following iterative equations:

(

( ) ˆ ( )ˆ( )

)

To ensure the existence of Eq. (4-6), γq should be chosen such that 0 eigenvalue of z. δ is a positive constant and lower than one to ensure that Eq. (4-18) is positive definite.

The adaptation of the filter coefficient vector is performed in the silent stage. When the system is switched to speech stage, the adaptation stops and the filter coefficient vector is passed to lower beamformer. The purified speech signal can be calculated by

) signal vector acquired by the microphone array, where

[

( ) ( 1)

]

)

(n = xi n xi nP+

i L

x .

4.3. SPFDBB, FDABB and Computational Effort Analysis

4.3.1 SPFDBB Using H Adaptation Criterion

Figure 4-4 shows the overall architecture of the proposed speech enhancement using H adaptation criterion in the frequency domain. For the ASR application, the purified spectrum data should be computed directly to save computational effort, since most speech recognition algorithms are performed in the frequency domain. In this case, the filter coefficient vectors can be updated on a block of data. Hence, the problem is transformed into the frequency domain by using STFT. In conjunction with the spectrum-based ASR, the window size in the STFT has to equal to that in ASR in order to obtain a more accurate result. However, the window size may be too small to capture the acoustic channel response. For this reason, Chapter 4 proposed an approach called SPFDBB which takes the frame average over several frames as a block improving the approximation of the channel response. The number of frames in a block is denoted as the frame number L . In this chapter, the H adaptation criterion is adopted to improve the performance further.

)

Figure 4-4 System Architecture of SPFDBB and FDABB using H∞ adaptation criterion

The strategy of the SPFDBB using H adaptation criterion can be formulated as:

⎥⎥

R ω represent the frequency-domain online recorded environmental noise vector, the pre-recorded speech signal vector, and the reference signal respectively.

Let’s apply Eqs. (4-11), (4-12), (4-13), and (4-14) to the SPFDBB, where

I the state space model can be represented by

⎩⎨

Q can be approximated by the iteration:

[

( , ) ( , )ˆ( , )

]

0

Qˆ(ω,1)= and P1(ω,1)=μ0I (4-27)

The value of γQ2 during the iteration is chosen as

(

1( ,k) ( ,k) ( ,k)

)

eig ω H ω ω

δ P +H H where eig(z) denotes the minimum eigenvalue of z. δ is a positive constant and lower than one to ensure that Eq. (4-26) is positive definite. Consequently, the purified speech signal at kth block can be obtained by the following equation:

) , ( ) , ˆ ( ) ,

ˆ(ω k QH ω k X ω k

Y = (4-28)

where Yˆ(ω,k)=

[

Yˆ(ω,k) L Yˆ(ω,k+L1)

]

is the purified result and X( kω, ) is the M ×L online recorded noisy speech signal matrix. The step k is chosen as

,L 3 , 2 , ,

0 L L L to perform the adaptation process every L frames.

4.3.2 FDABB Using H Adaptation Criterion

The requirement of the length of the training data would be too large and the SPFDBB could not respond to the change of room acoustics when a large value of L is chosen. Therefore, the method called FDABB is proposed in Chapter 4 to further enhance SPFDBB through the index CBVI to allowing the frame number to be adapted on-line. In other words, CBVI defined in Eq. (3-18) is the basis for adjusting the frame number. Figure 4-5 summarizes the proposed FDABB algorithm using H

adaptation criterion.

Threshold Lower

i CBVI(ω, )

Threshold Upper

i CBVI(ω, )

) , ˆ1( k X ω

M

) , ˆ ( k XM ω

) , ( k Yb ω

) , ( k Rω

Figure 4-5 FDABB using H adaptation criterion

4.3.3 Computational Effort Analysis

This section analyzes the computational efforts of the time-domain and the two frequency-domain adaptive beamformers using H adaptation criterion from two different phases: the coefficients adaptation phase and the lower beamformer phase. In the coefficients adaptation phase, the desired speaker is silent and the coefficients are updated with the iteration equations from (4-16) to (4-19) for the time-domain adaptive beamformer, with Eqs. (4-24) to (4-27) for SPFDBB, and with Eqs. (4-24) to (4-27) and (3-18) for FDABB. The computational efforts are calculated according to a one-second length input datum for each phase, and are shown in Table 4-1. The meanings of the parameters are defined in Section 3.4.2. Notably, the computational effort of a matrix inversion was given in [98]. On the other hand, the eigenvalue must be found to determine the value of γq or γQ . However, the computation of eigenvalues is complex and the computational effort varies with the precision required. To ensure a steady performance, a general method based on Householder method and the shifted QR algorithm is considered; the computational effort associated with finding the eigenvalues is given in [99-100].

Table 4-1 Real Multiplication Requirement in One Second Input Data

Multiplication Requirement

Adaptation Phase Lower Beamformer Phase Time-domain

beamformer fs

MP

4.4. Time-domain Performance Indexes

In this section, six time-domain performance indexes are defined. The first two parameters, SDR and NSR, instead of SNR are defined to evaluate the performances of the NLMS and H adaptation criterions. This is because a lower SNR may not correspond to a higher ASR rate and these two indexes facilitate to directly separate two main issues: the inverse issue and noise suppression issue. SDR is defined as:

⎟⎟

where V denotes the length of signals in both equations. SDR represents the degree of source distortion that caused by channel effect and noises. Moreover, NSR is the degree of noise reduction.

To observe different characteristics between the NLMS and H adaptation criterion, four performance indexes named filtered output error (n)ef , reference signal estimation error er(n), filter coefficient estimation error ratio, and filtered output error ratio are defined in Eqs. (4-31), (4-32), (4-33), and (4-34) individually.

)

where )~q(n is the coefficient vector estimation error defined in Eq. (4-3).

4.5. Summary

In this chapter, the H adaptation criterion is investigated to enhance the robustness to the modeling error caused by an inadequate window size to capture the acoustic channel dynamics. This chapter utilizes the H adaptation criterion to replace the NLMS in the reference-signal-based time-domain adaptive beamformer. However, due to the intensive computational effort requirement in the time domain, SPFDBB

and FDABB, using H adaptation criterion are proposed to significantly reduce the computational effort which is analyzes in Section 4.3.3. As shown in Chapter 6, H

adaptation criterion outperforms the NLMS with the same filter order in terms of SDR, NSR and ASR results.

Chapter 5

Reference-signal-based Speaker’s Location Detection

5.1 Introduction

A speech enhancement system using microphone array usually requires a speaker’s location estimation capability. With the information of the desired speaker’s location, the speech enhancement system can suppress the interference signals and noises from the other locations. For example, in vehicle applications, a driver may wish to exert a particular authority in manipulating the in-car electronic systems through spoken language. Consequently, a better receiving beam using a microphone array can be formed to suppress the environmental noises and enhance the driver’s speech signal if the driver’s location is known.

In a highly reflective or scattering environment, conventional delay estimation methods such as GCC-based (TDOA-based) algorithms [21-23] or previous works [24-25] do not yield satisfactory results. Although Brandstein et al. [101] proposed Tukey’s Biweight to redefine the weighting function to deal with the reflection effect; it

is not suitable for a noisy environment. To overcome this limitation, Nikias et al. [102]

adopted the alpha-stable distribution, instead of a single Gaussian model, to model ambient noise and to obtain a robust speaker’s location detection in advance. In recent years, several works have introduced probability-based methods to eliminate the measurement errors caused by uncertainties, such as those associated with reverberation or low energy segments. Histogram-based TDOA estimators such as time histograms [20] and weighted time-frequency histograms [26-27] have been proposed to reduce direction-of-arrival root-mean square errors. The algorithm in [27] performs well especially under low SNR conditions. Moreover, Potamitis et al. [103] proposed the probabilistic data association (PDA) technique with the interacting multiple model (IMM) estimator to conquer these measurement errors. Ward et al. [7] developed a particle filter beamforming (steered-beamformer-based location approach) in which the weights and particles can be updated using a likelihood function to solve the reverberation problem. Although these statistical based methods [7], [20], [26-27] and [103] can improve the estimation accuracy further, they cannot distinguish from the locations using a single linear microphone array under a totally non-line-of-sight condition which is common in vehicular environments.

Another approach (spectral-estimation-based location approach), proposed by Balan et al. [8], explores the eigenstructure of the correlation matrix of the microphone array by separating speech signals and noise signals into two orthogonal subspaces. The DOA is then estimated by projecting the steering vectors onto the noise subspace.

MUSIC [9-10] combined with spatial smoothing [11] and [104] is one of the most popular methods for eliminating the coherence problem. However, as the experiment in Chapter 6 indicates, its robustness is still poor in a vehicular environment when the

SNR is low. Furthermore, the near-field effect [105-107] should also be considered in applications in real environments.

In some environments, especially in vehicular environments, the line-of-sight condition may not be available because, for example, barriers may exist between the speaker and the microphone array. Therefore, when a single linear array is employed, the aforementioned methods cannot distinguish speakers under non-line-of-sight conditions. Hence, multiple microphone arrays must be considered [108-109]. Further, the microphone mismatch problem often arises when such methods as steered-beamformer-based, GCC-based or spectral-estimation-based algorithms are used since these methods require the microphones to be calibrated in advance. Several sound source localization works [110-111] also mentioned the importance of calibration and the influence of microphone mismatch problem. However, accurate calibration is not easy to obtain since the characteristics of microphones vary from the sound source directions.

The relationship between a sound source and a receiver (microphone) in a complicated enclosure is almost impossible to characterize with a finite-length data in real-time applications (such as in frame-based calculations). According to the investigation of room acoustics [112], the number of eigen-frequencies with an upper limit of fs/2 kHz can be obtained by the following equation:

3

2 3

4 ⎟

⎜ ⎞

⎛ ν π fs

B (5-1)

where f denotes the sampling frequency, ν represents the sound velocity s ( ν ≈340m /s ) and B is the geometrical volume. This equation indicates that the number of poles is too high when the frequency is high, and that the transient response

occurs in almost any processing duration when the input signal is a speech signal. For example, the number of poles is about 96435 when the sampling frequency is 8 kHz and the volume is 14.1385m . Hence, the non-stationary characteristics of speech signals 3 make the phase differences between the signals received by two elements of a single linear microphone array from a fixed sound source vary among data sets. Moreover, the stochastic nature of the phase difference is more prominent when the sound source is moving slightly and environmental noises are present. Consequently, the proposed method in this dissertation does not explicitly utilize the information of direct path from sound source to microphones to detect speaker’s location, nor attempt to suppress the effect of reverberations, interference signals, and noises. Instead, this proposed method utilizes the sound field features obtained when the speaker is at different location in an indoor environment. In other words, this dissertation proposes the use of the distributions of phase differences, rather than their actual values, to locate the speaker, because the phase difference distributions vary among locations and can be distinguished by pattern matching methods. Previous researches [113-114] also showed that common acoustic measures vary significantly with small spatial displacements of the sound source or the microphone.

The experimental results in Chapter 6 indicate that the GMM [115] is very suitable for modeling these distributions. Furthermore, the model training uses the distributions of phase differences among microphones as a location-dependent but content and speaker-independent sound field feature. In this case, the geometry of the microphone array should be considered to cope with the aliasing problem and maximize the phase difference of each frequency band to detect the speaker’s location accurately.

Consequently, the microphone array can be decoupled into several pairs with various

detector integrates the overall probability information from different frequency bands to detect the speaker’s location.

The reminder of this chapter is organized as follows. The next section introduces the overall system architecture and data flow. Section 5.3 presents the design of the location model and the model parameters estimation approach. Section 5.4 presents the proposed reference-signal-based single speaker’s location detection criterion. Section 5.5 discusses an approach to find each location’s testing sequence length and threshold.

Section 5.6 describes the proposed reference-signal-based multiple speakers’ locations detection criterion using the information of testing sequence length and threshold of each location. Conclusions are made in Section 5.7.

5.2 System Architecture

5.2.1 System Architecture

Figure 5-1 illustrates the overall system architecture. A voice activity detector divides the system into two stages, the silent stage and the speech stage. Before the proposed system is training online, a set of pre-recorded speech signals can be required via the description in Chapter 2 to obtain a priori information between speakers and the

Figure 5-1 illustrates the overall system architecture. A voice activity detector divides the system into two stages, the silent stage and the speech stage. Before the proposed system is training online, a set of pre-recorded speech signals can be required via the description in Chapter 2 to obtain a priori information between speakers and the