• 沒有找到結果。

Chapter 1 Introduction

2.5 Summary

This chapter proposes a two-stage procedure beamformer to perform multiple competing speeches and stationary noise signals suppression as well as desired speech extraction based on the TFR information and the Hfilter. The virtual sound source concept which transforms the multiple competing speeches from MIMO to SIMO acoustic system is presented to simplify the complicated acoustic system. The performances of the individual noise cancellation block are analyzed and the advantages of the Hfilter and the proposed system architecture are also demonstrated.

Chapter 3

Robust Adaptive Beamformer Using the Second-Order Extended H Filter

3.1 Introduction

Most of the early methods of robust adaptive beamformers are rather ad hoc in that the choice of parameters or the structural modifications is not directly related to the uncertainty of the steering vector [11]. Recently, more rigorous approaches were proposed to cope with unknown mismatches via worst-case optimization [38], [39]. Unlike the earlier methods, they make explicit use of the uncertainty set of the steering vector. The work in [38] obtains the beamformer weight by minimizing the output interference-plus-noise power while maintaining a distortionless response for the worst-case steering vector mismatch. The robust MVDR problem in [38] was formulated as a second-order cone program and solved in polynomial time via the interior point method. A number of extensions of the robust MVDR beamformer of [38] have been considered [40]-[43]. However, the main shortcoming of these extensions is that they do not have a computationally efficient online implementation. To overcome this problem,

El-Keyi et al. [44] developed a new algorithm for the robust MVDR beamformer of [38]

which was based on the constrained SOE Kalman filter that can be implemented online.

The SOE Kalman filter assumes that the dynamics of the signal generating processes are known, so are the statistical properties of noise signals (i.e., uncorrelated and zero-mean Gaussian with known covariance) [48]. However, these assumptions limit the performance since the complex acoustic dynamics is difficult to model and the uncorrelated zero-mean Gaussian noise assumption is quite stringent considering the variety of environmental interferences. To relax these assumptions, this paper proposes the SOE Hfilter for the MVDR beamformer of [38] that requires no prior knowledge of the noise statistics but bounded energy. Several studies on the linear and nonlinear H

filter or mixed Kalman/H filter have been presented [45]-[49] and [52]-[69]. Despite these efforts to expand the use of H filter to different domains for robustness, there is still no work which considers the second-order extended case similar to that of the SOE Kalman filter presented to the adaptive beamformer.

In this chapter, the SOE H filter under the robust MVDR beamformer setting [38] is derived based on the game theory approach [69]. In the SOE H filter, the state estimator and the disturbance signals (initial condition error, process noise and measurement noise) have conflicting objectives, i.e., to minimize and maximize the estimation error, respectively. The estimation criterion in the SOE H filter design is to minimize the worst possible effects of the disturbance signals on the signal estimation errors without priori knowledge. This estimation criterion makes the SOE H filter more suitable for speech enhancement in the cases of unknown noise statistics, steering vector uncertainty and modeling error of beamformer weight. To derive the SOE H filter, the second-order Taylor series expansion is used to approximate the nonlinear function. However, the quadratic terms appear in the series expansion are too complex to make the solution

matrix which effectively simplifies the problem.

The remainder of this chapter is organized as follows. The speech enhancement problem and some necessary background on MVDR beamformer and robust MVDR beamformer of [38] are presented in Section 3.2. In Section 3.3, the SOE Kalman filter for the implementation of the robust MVDR beamformer of [38] is briefly reviewed and the proposed robust MVDR beamformer based on the SOE H filter is introduced in Section 3.4. Section 3.5 presents the SOE H filter solution of a general nonlinear discrete-time system and the detail derivation is given in the Appendix I-IV. Finally, summary is drawn in Section 3.6.

3.2 Problem Formulation

Consider an acoustic environment the same with Section 2.2.1 and the received signal of the m-th microphone in frequency domain can be written as:

∑ ( )

=

+

= P

p

m p

mp

m k A S k N k

X

1

, )

, ( ) ( )

,

( ω ω ω ω (3-1)

The MVDR beamformer output at frame k and frequency ω is given by

) , ( ) ( )

,

(k ω ω k ω

YMV =wMVΗ X (3-2)

where X(k,ω)=

[

X1(k,ω) L XM(k,ω)

]

Τ and wMV(ω)CM×1 is the MVDR beamformer weights. The well-known MVDR beamformer minimizes the output power of interference-signals-plus-stationary-noise while maintaining a distortionless response to the desired signal. The frequency domain MVDR problem is given by

) ( ) ( ) (

min MV ω xx ω MV ω

MV

w

w wΗ R subject to ~( ) 1

)

( =

Η ω Aω

wMV (3-3)

where

{

( , ) ( , )

}

)

E k ω k ω

xx

= X XΗ

R (3-4) )

Rxx is the M×M correlation matrix and A~(ω)∈CM×1 is the presumed steering vector. The solution of the MVDR problem is given by [70],

)

~( ) ( )

~ ( ~( )

) ) (

( 1 1

ω ω ω

ω ω ω

A A

w Η A

=

xx xx

MV R

R (3-5)

In practice, the correlation matrix is unavailable and is usually approximated by

=

= K Η k

xx k k

K1 1 ( , ) ( , ) )

ˆ (ω X ω X ω

R (3-6)

where K is the frame number available. The sample correlation matrix is used in (3-5) to replace the true correlation matrix and the resulting solution is commonly referred to as the sample matrix inversion (SMI) algorithm [70]. If the desired signal is present in the training procedure, the SMI algorithm degrades dramatically [38].The other disadvantage of the SMI algorithm is that it does not provide the sufficient robustness against a mismatch between presumed steering vector A~(ω) and the actual steering vector

[ ]

Τ

= ( ) ( )

)

A11 ω L AM1 ω

A .

In practical environment, there may exist unknown mismatches between A~(ω) and )

A due to the reverberation, microphone mismatch, array configuration mismatch, etc.

The norm of the steering vector distortion can be bounded by some known constant

>0

ε . Therefore, the actual steering vector belongs to the set

{

ω ω ω ω ω ε

}

ω ≡ = + ≤

Λ ~( ) ( ), ( )

) ( ) ( )

( C C A e e (3-7)

The robust MVDR beamformer in [38] minimizes the output of the beamformer while

maintaining a distortion response, not only toward the steering vector A~(ω) but also toward all the vectors that belong to Λ(ω). Based on this uncertainty description, Vorobyov et al. [38] formulated the robust MVDR beamformer problem as,

) ( ) ˆ ( ) (

min MV ω xx ω MV ω

MV

w

w wΗ R subject to wMVΗ (ω)C(ω) ≥1 for all C(ω)∈Λ(ω)(3-8)

The semi-infinite nonconvex constraint in (3-8) was reformulated as a single constraint that corresponds to the worst-case constraint [38]

) ( ) ˆ ( ) (

min MV ω xx ω MV ω

MV

w

w wΗ R subject to min ( ) ( ) 1

) ( )

( Η

Λ

ω ω

ω

ω w C

C MV (3-9) It can be proven that the inequality constraint in (3-9) is equal to the equality constraint [38]. Therefore, the problem in (3-9) can be rewritten as

) ( ) ( 1

)

~( ) ( subject to

) ( ) ˆ ( ) ( min

2 2

ω ω

ε ω

ω

ω ω

ω

MV MV

MV

MV xx

MV MV

w w

A w

w

w w

Η Η

Η

=

R

(3-10)

The problem in (3-10) has been solved in [38] using SOC programming. Moreover, several extensions of the robust MVDR beamformer have been considered. For example, a Newton-type iterative method was proposed for this problem and its modification [39], [40]. Re-formulating (3-10) into a state-space observer form facilitates the application of the SOE Kalman filter [44]. In the following, we briefly review the SOE Kalman filter solution and present a new approach based on the SOE H filter.

3.3 Robust MVDR beamformer based on the Second-Order Extended Klaman Filter

For the convenience of analysis, the mean square error (MSE) between the zero signal and the beamformer is introduced as,

) where )E(⋅ denotes the expectation operation. The constraint in (3-10) can be rewritten as

Therefore, the robust MVDR beamformer problem can be formulated as

⎥⎦⎤

The constraint minimization problem of (3-14) is written in the state space model below

State equation:

) Measurement equation:

2 measurement equation is then,

)

and R~

respectively.

[ ]

The SOE Kalman filter expands the nonlinear function by using the second-order Taylor series and finds the optimal estimate wˆMV(k,ω) to minimize the estimation error defined below

[

(k,ω) ˆ (k,ω)

]

=0

E wMV wMV (3-19)

To present the SOE Kalman filter solution, we start by evaluating the Jacobian Gw(k,ω) of ))g(wMV(k,ω and Hessian matrices G(ww1)(ω) and G(ww2)(ω) of its components as

where I is the identity matrix. For the state space model (3-15) and (3-16), the SOE Kalman filter solution is given by [48]

[

ˆ ( , )

]

where the predicted measurement is obtained by

{ }

and the filter gain and predicted weight error covariance matrix are given by

(

( , )~ ( , ) ( , ) ~

)

1

) , ( ) ,

~ ( ) ,

~( =P GΗ G P GΗ +R

K k ω k ω w k ω w k ω k ω w k ω (3-25) Q

P

P ~

) , 1

~ ( ) ,

~(k ω = + k − ω + (3-26)

(

~( , ) ( , )

)

~ ( , )

) ,

~+(k ω = IK k ω Gw k ω P k ω

P (3-27)

where K~(k,ω) is the Kalman gain; ~P(k,ω) is the priori error covariance matrix and )

,

~+(k ω

P is the posteriori error covariance matrix. After some algebra operations [48], the Kalman gain can be rewritten as (3-28) and covariance matrices P~(k,ω) and

) ,

~+(k ω

P can be integrated as (3-29)

(

( , )~ 1 ( , )~ ( , )

)

1 ( , )~ 1

) ,

~ ( ) ,

~( =P I+GΗ RG P GΗ R

K k ω k ω w k ω w k ω k ω w k ω (3-28) Q

P G

R G

I P

P ~

)) ,

~ ( ) ,

~ ( ) , ( )(

,

~ ( ) , 1

~(k+ ω = k ω + Ηw k ω 1 w k ω k ω 1+ (3-29)

3.4 Robust MVDR beamformer based on the Second-Order Extended H

Filter

In contrast to minimizing the expected value of the estimation error variance like the SOE Kalman filter, another strategy is to minimize the worst possible effects of the disturbances on the signal estimation errors. This is essentially to minimize the infinity norm of the input-output relation. In this case, no assumptions on the noise statistics are necessary (such as (3-18)) but the boundedness of the noise energy. Considering the state space model (3-15) and (3-16), and the estimation of some arbitrary linear combination of

) , (k ω wMV , i.e.,

) , ( )

,

(k ω wMV k ω

z =C (3-30)

where C is a user-defined matrix. The estimate of z(k,ω) is denoted by zˆ(k,ω) and

wMV . The performance index J can be defined as:

( )

by the user based on the specific problem. To simplify the analysis, we assume the weighting matrices Q(k,ω) , R(k,ω) and S(k,ω) are independent of frame and frequency. Hence, equation (3-31) can be reformulated as

( )

satisfy

γ

<

sup J (3-33) where sup represents supremum. The formulation of (3-33) shows that the SOE H

optimal estimators guarantee the smallest estimation error energy over all possible disturbances (wMV(0,ω)−wˆMV(0,ω),vs(k,ω)and )vm(k,ω ) of finite energy. They are over-conservative but have a better robust behavior to the disturbance variations. The SOE H filter can be interpreted as a minmax problem where the estimator strategy

) equation (3-34) can be rewritten as:

( )

Considering a second-order approximation of the nonlinearity in (3-35), the solution of (3-35) leads to the SOE H filter. The solution of the SOE H filter for a class of discrete-time nonlinear systems has been briefly explained in Section 3.5 and is derived in Appendix I-IV. By substituting the corresponding matrices to (3-61)-(3-65), the solution of the SOE H filter for the state space model (3-15) and (3-16) is given as,

where 0<η≤1 and the predicted measurement is obtained by

{ }

⎢⎣

= Η +

) , ( ) ( 5

. 0 )) , ˆ ( (

) , ˆ ( ) , ) (

,

ˆ ( (2)

2 ω ω ω

ω ω ω

k tr

k g

k k k

ww MV

MV

h w G P

w

y X (3-41)

Comparing with the SOE Kalman filter solution, we can observe the following.

1. The structures of the matrices K~(k,ω) and P~(k+1,ω) ((3-28) and (3-29)) in the SOE Kalman filter are similar to the structures of K(k,ω) and P(k+1,ω) ((3-37) and (3-38)) in the SOE H filter. If the weighting matrices P~(0,ω), Q~

and R~ are the same with the covariance matrices P(0,ω) , Q and R , K~(k,ω) and

) , 1

~(k+ ω

P have the same structures with K(k,ω) and P(k+1,ω) respectively when γ →∞.

2. The second-order terms of Taylor series in the SOE H filter and the SOE Kalman filter are both approximated by the state estimation error sample covariance matrix.

However, unlike the error covariance matrix ~P(k,ω) or P~+(k,ω) in the SOE Kalman filter, the matrix P(k,ω) in the SOE H filter does not represent the estimation error covariance matrix. Therefore, equations (3-39) and (3-40) are utilized to approximate the estimation error covariance matrix.

3.5 The Second-Order Extended H

Filter

This section provides the SOE H filter solution of a general nonlinear discrete-time system shown in (3.42). Although, the state space model (3-15) and (3-16) are not exactly the same with (A-1). However, like the SOE Kalman filter solution [48], the SOE H filter solution of (3.42) can be easily applied to (3-15) and (3-16).

Consider a nonlinear discrete-time system

f and h(⋅) are vectors of smooth nonlinear functions that are second-order differentiable with respect to xa(t) . The second-order Taylor series expansion of

) quadratic term in (3-43) can be written as

⎥⎥

where tr

[]

is the trace operation. Assume that the matrix (xa(t)−xˆa(t))(xa(t)−xˆa(t))Τ can be obtained by the expected values of the past data, i.e., it becomes independent of the current state xa(t). Denote the matrix as P , and we assume that the value of this matrix a can be estimated. Hence, we have

⎥⎥

Later P is approximated by the sample covariance matrix of the estimation error. The a goal is to estimate a linear combination of xa(t) using the observation, i.e., function can be defined as:

( )

positive definite matrices chosen by the user based on the specific problem. For the SOE

H filter, a performance bound γ is selected and zˆ ta( ) is computed to satisfy γ

<

sup J (3-48) where sup represents supremum. The SOE H filter can be interpreted as a minmax problem where the estimator strategy zˆ ta( ) plays against the exogenous inputs wa(t), be rewritten as:

( )

Therefore, J in (3.50) is written as

(

(0)

)

1 () this derivation can be separated into three steps. First, a stationary point of J with respect to xa(0) and wa(t) is found in Appendix I. Secondly, a stationary point of J with

respect to xˆ ta( ) and ya(t) is found in Appendix II based on the results from Appendix I. Finally, according to Appendix I and Appendix II, the SOE H filter solution of the nonlinear discrete-time system in (3.42) is given in Appendix III.

3.5.1. The Second-Order Extended H Filter Solution

Theorem 1: Consider the minmax problem in (3-50) and use the second-order Taylor series described in (3-43)-(3-45) to approximate the nonlinear function in (3-42). The stationary point of J with respect to xa(0) and wa(t) is given by:

[Proof]: Please see Appendix I.

Theorem 2: Given the values of xa(0) and wa(t) described in Theorem 1, the

[Proof]: Please see Appendix II.

Theorem 3: According to Theorem 1 and Theorem 2, the SOE H filter solution for the state space model (3-42) can be given by

⎟⎟

3.6 Summary

The SOE H filter-based robust MVDR beamformer for the acoustic environment has been proposed and the detail derivation of the SOE H filter filter has also been given in this chapter. The comparisons between the proposed beamformer and the SOE Klaman filter-based robust MVDR beamformer are described. For the derivation of the SOE H

filter, the second-order Taylor series expansion is used to approximate the nonlinear function and the second-order term is approximated by the estimation error sample covariance matrix. The SOE Hfilter provides a rigorous method for dealing with systems that have model uncertainty.

Chapter 4

Experimental Results

This chapter presents the experimental results of the simulated and practical environment to access the capability of the proposed TFR-based adaptive beamformer, the SOE H filter and the robust MVDR beamformer based on the SOE H filter. The experimental results about the TFR-based adaptive beamformer are shown in Section 4.1 and those about the SOE H filter and the SOE H filter-based robust MVDR beamformer are shown in Section 4.2 and Section 4.3, respectively.

4.1. Experimental Results of the Proposed Transfer Function Ratio-based Adaptive Beamformer

This section provides the experimental results of the proposed TFR-based adaptive beamformer. The proposed beamformer was tested both in a real room environment and in a car environment. In addition, the proposed beamformer was also tested by an automatic speech recognition system (ASR) for the application consideration.

Three speech enhancement algorithms, DSB [1], reference-signal-based adaptive

beamformer (RAB) implemented in frequency domain [34] and dual-source transfer-function generalized sidelobe canceller (DTF-GSC) [32] are adopted to compare with the proposed algorithm. The performance criterion of the RAB algorithm can be written as

[

( , ) Η( , ) ˆ( , )

][

( , ) Η( , ) ˆ( , )

]

minD k ω k ω k ω D k ω k ω k ω

G G X G X (4-1)

where Xˆ(k,ω) is the vector containing the linear combination of present microphone received signal and pre-recorded signal ~( , )

)

( 1

1 ω S k ω

Am . )~( ,

1 k ω

S is the representative speech signal at the position of the desired speech and ~( , )

)

( 1

1 ω S k ω

Am are the

pre-recorded speech signals which can be recorded when the environment is quiet. D(k,ω) is the reference signal set to be ~( , )

)

( 1

11 ω S k ω

A and the adaptive weight G(k,ω) can be trained using NLMS algorithm when the desired speech signal is inactive.

The DTF-GSC algorithm is comprised of three building blocks. The first is the FB designed to block one competing speech while maintaining the desired speech signal. The second is the BM which can block both the desired speech and one competing speech.

The FB and BM are designed with the TFRs of the desired speech and the competing speech. Finally, the residual noise from the BM is cancelled by the adaptive filter using the NLMS algorithm. Notably, in this experiment, the TFRs for the desired speech of the DTF-GSC algorithm are the same with those of the proposed algorithm.

In the RAB, DTF-GSC and proposed algorithms, we assume a perfect desired speech detection system exists, allowing the adaptive noise cancellation system to adapt weight during inactive periods of desired speech. The STFT size is 1024 with 320 shift samples and 64 zero padding samples. In the RAB and DTF-GSC algorithms, the step size of the NLMS algorithm is set to be 0.1 and the initial values of the adaptive weight of the RAB,

DTF-GSC and proposed algorithms are identically set to be 0.1+0.1i. The TFRs for the DTF-GSC and proposed algorithms are estimated using 20 frames. For the proposed beamformer, the parameters of r, β and γ in (2-12), (2-20) and (2-26) are set to be 1, 10 and 2, respectively. The adaptation number k~

is set to 10. The weighting matrices )

, 0 ( ω

P and S in (2-29) are identically set to be identity matrices and R in (2-29) is set to be diag(1,10-9).

Four objective performance indices are used to measure the waveform property directly.

The first is segmental signal-to-interference-plus-noise ratio (segSINR) defined as

∑ ∑

= +

=

+

=

⎟⎟

⎟⎟

⎜⎜

⎜⎜

= 1

0 1

2 ,

1 12

, 1

)) ( )

( (

) ( 10

1 log 10 )

segSINR(dB s

s s

s

s s

s

K

k L k L

k L t

y s

L k L

k L

t s

s x t g y t

t x

K (4-2)

where Ls is the frame length and k is the frame number when the desired speech signal is active. Note that x1,s(t) is the desired signal component recorded by the first microphone,

g is the gain factor and y y(t) is the output of the algorithm. The second is the average SINR (avgSINR) defined as

=

n n s

T t

T t T

t

t x

t x t

x

) (

) ( )

(

avgSINR 2

2 2

(4-3)

where Ts and Tn denote periods in time where only the desired speech is active and only the interference-plus-noise signals are active respectively. The first quality measure stresses on the speech distortion more than the second quality measure. The third quality measure is segmental noise level (segNL)

∑ ∑

= =

⎟⎠

⎜ ⎞

⎛ ⋅ ⋅ +

= K

k

I

i

y y i kI

K 1 1 g

2 2

10( ( ))

log 1 10

) dB (

segNL (4-4)

where y(t) is the algorithm output when only s2(t)~ sP(t) and nm(t) are all active. I is the length of the frame and k is the frame number. A lower segNL represents a better ability of noise suppression. The fourth quality measure is log spectral distortion (LSD) defined as

( )

∑ ∑

= =

= K

k

W

k Y k

S W A

K 1 1

2 10

1 11

10 ( ) ( , ) 20 log ( , )

log 1 20

LSD 1

ω

ω ω

ω (4-5)

where )Y(k,ω is the STFT of the algorithm output. LSD means the speech distortion in frequency domain. Note that a lower LSD level corresponds to a better performance.

4.1.1. Real Room Environment

For the real room environment, the dimension is 10 m × 6 m × 3.6 m and the reverberation time at 1000 Hz is 0.52 second. A uniform linear microphone array of eight un-calibrated microphones separated by 0.05 m was constructed for this experiment. The amplified microphone signals were sampled at 8 kHz and 16 bits. The microphone array was placed on a table at a distance of 2 m from the wall and the picture of microphone array in real room is shown in Fig. 4-1. The arrangement of microphone array and sound sources is shown in Fig. 4-2. The desired speech signal at 0° consists of sentences from TCC-300 database [71] spoken by 150 males and 150 females. The interference signals 2, 3 and 4 are speech signals spoken by 3 females and interference signal 1 is the speech signal spoken by a male. Five conditions denoted from C1 to C5 for the experiments are listed in Table 4-1.

The experimental results are shown in Fig. 4-3 and Mic#1 represents the contaminated speech recorded by the first microphone. The range of average input SINR is from 0 dB to -7 dB. As can be seen, the best performance is obtained by the proposed algorithm and the

Figure 4-1 Microphone array in real room

Figure 4-2 Configuration of microphones, desired speech, white noise and interference signals

Table 4-1 Five experimental conditions Condition

Number Desired

Speech Location Stationary

Noise Location Interference Speech Location(s)

C1 0° -30° none

C2 0° -30° one of (30°, 45°, 60°, -60°)

C3 0° -30° two of (30°, 45°, 60°, -60°)

C4 0° -30° three of (30°, 45°, 60°, -60°)

C5 0° -30° 30°, 45°, 60° and -60°

C1 C2 C3 C4 C5 -8

-6 -4 -2 0 2 4 6 8

Condition

segSINR (dB)

Proposed DTF-GSC RAB DSB Mic#1

C1 C2 C3 C4 C5

-10 -5 0 5 10 15 20

Condition

avgSINR (dB)

Proposed DTF-GSC RAB DSB Mic#1

(a) (b)

C1 C2 C3 C4 C5

45 50 55 60 65 70 75

Condition

segNL (dB)

Proposed DTF-GSC RAB DSB Mic#1

C1 C2 C3 C4 C5

6 8 10 12 14 16 18

Condition

LSD

Proposed DTF-GSC RAB DSB Mic#1

(c) (d)

Figure 4-3 Experimental results in real room environment (a) segSINR results (b) avgSINR results (c) segNL results (d) LSD results

DSB performs worst. Since the DSB aligns only the direct path signal, it does not take reflections into account and no nulls are placed directly in interference signal directions.

For the RAB algorithm, the finite impulse response coefficients G(k,ω) are trained to achieve two objectives simultaneously during the desired speech inactive periods: to suppress the interference and stationary noise signals, and to adjust the distorted desired speech of each microphone ~( , )

)

( 1

1 ω S k ω

Am to the same channel effect ~( , ) )

( 1

11 ω S k ω

A .

However, the finite number of taps and NLMS adaptive algorithm are unlikely to achieve these two objectives fully at the same time especially for complex channel dynamics. (e.g.,

competing speeches are present). It is unlike the DTF-GSC algorithm or the proposed algorithm which separates these two objectives. The DTF-GSC algorithm or the proposed algorithm suppresses competing speech and adjusts desired speech channel effect first using TFR techniques and then minimizes the residual noise with multi-channel adaptive

competing speeches are present). It is unlike the DTF-GSC algorithm or the proposed algorithm which separates these two objectives. The DTF-GSC algorithm or the proposed algorithm suppresses competing speech and adjusts desired speech channel effect first using TFR techniques and then minimizes the residual noise with multi-channel adaptive

相關文件