• 沒有找到結果。

This chapter presents the reference-signal-based system architecture which implicitly contains the information of the channel effect and microphones’

characteristics. This architecture implicitly obtains the acoustic behavior from the desired location to microphone array and reduces the efforts of directly performing microphone calibration and channel inversion. Furthermore, it can be applied on both near-field and far-field situations which offers a significant advantage in speaker localization and beamformer algorithms. Extension of this idea to further improve the ASR rates will be described in the following chapters. Moreover, a novel speaker’s location detection algorithm based on the reference-signal-based architecture is also proposed. In addition, Chapter 6 compares the performance of the reference-signal-based time-domain adaptive beamformer and other well-known non-reference-signal-based beamformers to show the effectiveness of the proposed method.

Chapter 3

Reference-signal-based

Frequency-domain Adaptive Beamformer

3.1 Introduction

The required computational effort could be large when applying a large FIR filter coefficients, e.g., 256 to 512 taps in the time-domain adaptive beamformer introduced in the previous chapter. For subsequent ASR operation, another effort to compute Discrete Fourier transform (DFT) is required. One possible way to simplify the computational complexity is to compute the beamformer directly in the frequency domain because ideally the large FIR taps can be replaced by a simple multiplication at each frequency bin (e.g., the FIR filter with dimension of MP×1 is represented by filter coefficient vector M ×1 in the frequency domain where M is the microphone number and P denotes the FIR taps). Moreover, the purified speech signal after a frequency-domain beamformer can be sent directly to the ASR. As explained later in this chapter and Chapter 6, the saving of computational effort is quite significant.

In a reference-signal-based beamformer, coefficients adjustment has two objectives:

to minimize the interference signal and noises, and to equalize the channel effect (e.g.

room acoustics). Channel equalization is important for ASR since the channel distortion may greatly reduce the recognition rate.

By formulating the same problem in the frequency domain, channel distortion can be emphasized using a priori information. In this chapter, a penalty function is incorporated into the performance index to calculate the filter coefficient vectors. This proposed algorithm is called SPFDBB.

A real-time frequency-domain beamformer is necessary to apply the short time Fourier transform (STFT). However, the corresponding window size of the STFT has to be fixed by the training data settings in ASR. For an environment with longer impulse response duration, the convolution relation between channel and speech source in time-domain cannot be modeled accurately as a multiplication in the frequency- domain with a finite window size. Therefore, the finite window size may not provide enough information for the coefficient adjustment and could not fit the assumptions that filter coefficient vector and the error signal should be independent to the input data in the NLMS adaptation criterion. In this case, SPFDBB takes the frame average over several frames as a block to improve the approximation of the linear model shown in Eq. (2-1). In other words, a block of windowed data is simultaneously adopted to calculate the filter coefficient vectors in the SPFDBB algorithm. The number of frames in a block is denoted as the frame number L. Intuitively, a large frame number could enhance the accuracy of the filter coefficient estimation. However, if the room acoustic dynamic changes suddenly, the channel response is difficult to be adjusted quickly when taking a large frame number for the updating process. Furthermore, the

of L is chosen. Therefore, SPFDBB is further enhanced by allowing the frame number to be adapted on-line. A novel index called changing block values index (CBVI) is defined as the basis for adjusting the frame number. The overall algorithm is called FDABB.

The remainder of this chapter is organized as follows. Section 3.2 describes the system architecture and the corresponding dataflow. Section 3.3 represents SPFDBB, one of the reference-signal-based frequency-domain adaptive beamformers which utilizes NLMS adaptation criterion. Section 3.4 introduces the other proposed method, FDABB, and also analyzes the computing efforts of SPFDBB, FDABB and the reference-signal-based time-domain beamformer. Two frequency-domain performance indexes, the source distortion ratio (SDR) and the noise suppression ratio (NSR) are defined in Section 3.5. Finally, conclusions are given in Section 3.6.

3.2 System Architecture

Figure 3-1 shows the overall system architecture. The pre-recorded speech signals,S1(ω,k),L,SM(ω,k), and the reference signal, R( kω, ), can be recorded by the same way described in Chapter 2 when the environment is quiet. After acquiring the pre-recorded speech signal and the reference signal, the overall system automatically executes between the silent and speech stages based on the VAD result.

If the result of VAD equals to zero which means no speech signal contained in the received signal,

{

x1(n) L xM(n)

}

, then the system is switched to the silent stage in which the adaptation of FDABB or SPFDBB is turned on. The filter coefficient vectors of FDABB or SPFDBB are adjusted through NLMS adaptation criterion in this stage.

Notably, SPFDBB is a part of FDABB and can be executed separately.

On the other hand, if the received sound signal is detected as containing speech signal, then the system is switched to the second stage called speech stage. In this stage, the filter coefficient vectors obtained in the silent stage are applied to the lower beamformer to suppress the interference signals and noises, and enhance the speech signal. Finally, the purified speech signal Yˆ(ω,k) is directly sent to the ASR.

Figure 3-1 Overall system structure

3.3 SPFDBB Using NLMS Adaptation Criterion

The linear model in Eq. (2-1) is transformed to the frequency domain by padding the short-time Fourier transform of the error signal with zeros to make it twice as long as the window length. The error signal at frequency ω and frame k is written as:

)

where )R( kω, is the reference signal in the frequency domain, Q( kω, ) denotes the filter coefficient vector we intend to find, and ˆ( ,k)

[

Xˆ ( ,k) XˆM( ,k)

]

T

1 ω ω

ω = L

X

is the training signal vector at frequency ω and frame k. The optimal set of filter coefficient vectors can be found using the formula:

[ ][ ]

*

where the superscripts ∗ denotes the complex conjugate.

The normalized LMS solution of Eq. (3-2) is given by:

)

Consequently, the purified output signal can be obtained by the following equation:

) contains speech source, interference and noise. From Eq. (3-1), the filter coefficient vector equalizes the acoustic channel dynamics and also creates the null space for the interference and noise.

As mentioned above, the filter coefficient vector Q( kω, ) equalizes the channel response and rejects interference signals and noises. To emphasize these two objectives differently, a soft penalty function is added into the performance index as,

)

Then, the iterative equation utilizing the NLMS adaptation criterion can be shown as: the soft penalty is set to infinity, then the system only focuses on minimizing the channel distortion. On the other hand, the system returns to the formulation in Eq.

(3-2) when the soft penalty is set to zero.

The problem of Eq. (3-7) is the window size has to equal that in ASR to ensure calculation accuracy. However, the window size may be too small for cases where the acoustic channel response duration is long (e.g., long reverberation path). Because the perturbation caused by channel model error is highly correlated to the reference signal instead of the uncorrelated noise, the updating process will not converge to a fixed channel response which is shown in Fig 6.3 and the relation between two sequential frames is highly relative. Taking the frame average over several frames (denoted as L ) allows the channel response to be approximated; since information of channel response which is not contained in one frame could be regarded as an external noise in the next

[

( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , )

]

The performance indexes denoted as Eqs. (3-2) and (3-5) are two special cases of Eq (3-8) with Λ1 =1, Λ2 =Λ3 =−1, Λ4 =1, L=1 and Λ1 = 1+μ, Λ2 = Λ3 =−1,

=1

Λ4 , L=1. Because of the long reverberation path and the fix window size, the error signal between different frames should be taken into consideration. Therefore, the soft penalty approach can be applied on several windows (frames) by choosing

⎥⎥

where I is an identity matrix with dimension L L. Significantly, when the channel response is longer than the window size, the reference signal in previous windows is added in the following windows of the received signal. A good estimation should collect the information of several frames to eliminate this correlation effect. The

choice of Λ1considers the cross-term as a factor to minimize this correlation effect.

Using the performance index as Eq. (3-8), the SPFDBB can be summarized as:

{ }

is the training signal matrix with dimension M ×L. The kth error signal vector can be denoted as:

Y is the purified signal vector of the SPFDBB,

and

where Eq. (3-12) means the sum of autocorrelations of the training signal at th

The purified speech signal at thk block can be represented as:

) every L frames. In this way, the relation between the two sequential block data may be lower than the one by overlapping (L−1) frames to perform the next adaptation process (e.g.k=0 ,1 ,2 ,3,L). Moreover, the approach which the step k is chosen as

0 significantly increases the computational effort and the memory consumption as the value L increases. The goal of carrying out the beamformer in the frequency domain couldn’t achieve. As a result, the proposed SPFDBB (e.g.k=0 ,L ,2L ,3L,L) is a kind of efficient approach to obtain low computational effort in the frequency domain.

3.4 FDABB and Computational Effort Analysis

3.4.1 FDABB Using NLMS Adaptation Criterion

The number of frames L in the SPFDBB greatly influences the performance shown in Chapter 6. However, the bigger value of L needs more training data sequences. To cope with the window size problem, an index-based algorithm is proposed to adjust the value L automatically. Using Eq. (3-1), the error signal could be separated as:

(

( , ) ( , )

)

Taking an frame average over L frames

{ }

If no correlation exists between speech source and interference signals or noises, the first term and second term of the Eq. (3-16) could be zero. Consequently, this equation can be rewritten as:

{ } { }

L

H

x k k L k E k k

Eε (ω, )NH(ω, ) =−Q (ω, ) N(ω, )NH(ω, ) (3-17)

The optimal coefficient vectors should be the null space of E

{

N(ω,k)NH(ω,k)

}

L

and then the norm of Eq. (3-17) should be zero. The large norm of Eq. (3-17) indicates a strong negative or positive correlation between the error signal and interference signals or noises. In other words, the large norm of Eq. (3-17) means that the interference signals or noises affect the error significantly and that the convergence period is not achieved with the frame number L. If the norm of Eq. (3-17) is small, then it means the present value L has lesser help for finding better coefficients.

Consequently, the value of L is increased to improve the performance of the algorithm. Conversely, if the room acoustic varies temporarily, then the present norm of Eq. (3-17) would become much larger than the last one. Then, the value of L should be reset to the initial value to handle this sudden change. The CBVI at frequency ω and block i are defined in Eq. (3-18). The parameters in Eq. (3-18) are chosen as

0 =3

α , α1 =2, and α2 =1 to be an high pass filter to increase the sensitivity of the temporal variation of room acoustic and the convergence. Figure 3-2 summarizes the proposed FDABB algorithm.

{ } { }

Figure 3-2 FDABB using NLMS adaptation criterion

3.4.2 Computational Effort Analysis

This section analyzes the computational effort of the reference-signal-based time-domain adaptive beamformer, SPFDBB, and FDABB from two different viewpoints: the coefficients adaptation phase and the lower beamformer phase. In the coefficients adaptation phase, the speaker is silent and the coefficients are updated with the iteration equation (2-2) for the reference-signal-based time-domain adaptive beamformer, with Eq. (3-9) for SPFDBB, and with Eqs. (3-9) and (3-18) for FDABB.

The computational effort is based on a one-second length input datum for each phase, and is shown in Table 3-1. The sampling rate is denoted as fs, meaning that the considered input data containsfs samples. The length of STFT is represented as Bl;

the length of input data in a frame is represented as Bi and the shift size of STFT is

represented as B . The filter tap of the reference-signal-based time-domain adaptive s beamformer is assumed to be P , and the dimension of filter coefficient vector in the time domain is given by MP×1. The function I(z) takes the integer part of z only.

With the increasing value of L, the computational efforts of SPFDBB and FDABB decrease. The computational effort of STFT is estimated by using radix-2 decimation-in-frequency FFT [87]. Notably, the STFT of M microphones is necessary for VAD and speaker’s location detection algorithm. Therefore, the computational loading could be reasonably omitted to decrease the amount of real multiplication requirement when the overall system architecture is considered.

Furthermore, SPFDBB and FDABB could be implemented with parallelism to decrease the beamformer loading. Furthermore, Chapter 6 lists the computational effort of the proposed algorithms in the two simulations.

Table 3-1 Real Multiplication Requirement in One Second Input Data

Multiplication Requirement

Adaptation Phase Lower Beamformer Phase Time-domain adaptive

3.5 Frequency-domain Performance Indexes

From Eq. (3-1), the filter coefficient vector should equalize the acoustic channel

frequency-domain performance indices, SDR and NSR, are defined for these effects.

SDR means the decreased level of source distortion produced by inexact filter coefficient vector estimation and NSR means the improved level of noise reduction.

Assuming that the true channel response is W(ω)=

[

W1(ω) ... WM(ω)

]

T, then the source distortion can be represented as:

)

Thus, the optimal filter coefficient vector should satisfy two equality equations:

1

Since the denominators of Eqs. (3-21) and (3-22) are initial reference values, SDR or NSR represents the performance enhancement derived by comparing certain iteration with the initial filter coefficient vector. Smaller values of SDR and NSR indicate smaller source distortion and higher noise suppression performance.

3.6 Summary

This chapter presents two novel reference-signal-based frequency-domain beamformers, FDABB and SPFDBB, to overcome problems such as calibration, near-field or far-field cases, resolution and desired signal cancellation etc. These approaches not only reduce the computational effort significantly in the ASR-based application as compared with the reference-signal-based time domain adaptive beamformer, but also improve performance in a noisy environment. Moreover, FDABB which can automatically adjust the frame number to different environments is particularly suitable for practical applications.

Chapter 4

H Adaptation Criterion

4.1. Introduction

The NLMS adaptation criterion used in the preceding chapter does not make any assumption about the pre-recoded signals and the disturbance, unlike the exact least square algorithm such as recursive least square (RLS). The solution of the NLMS adaptation criterion recursively updates the filter coefficient vector along the direction of the instantaneous gradient of the squared error. Therefore, the NLMS adaptation criterion is more robust to disturbance variation than the RLS algorithm. For example, it has been observed that the NLMS has better tracking capabilities than the RLS algorithm in the presence of non-stationary inputs [88]. However, the performance of the NLMS depends upon the properties of the modeling errors which may lead to large coefficient vector estimation error. Consequently, it is necessary to design a robust adaptive algorithm to guarantee that if the disturbance energy is small, the coefficient vector estimation error will be small as well (in energy).

There has been an increasing interest in the mini-max estimation method [89-97]

called H algorithm which is more robust and less sensitive to model uncertainties and

parameter variations than the H2 adaptation criterion (such as the Kalman filter). This is because no a priori knowledge of the disturbance statistics is required in the H

algorithm. It means that the H algorithm can accommodate for all conceivable disturbances which have a finite energy. Moreover, the estimation criterion of the H

algorithm is to minimize the worst possible effects of the disturbances (modeling errors and additive noises) on the signal estimation error. Actually, the NLMS adaptation criterion is the central a posteriori Hoptimal filter [94]. However, it is the Hoptimal filter that minimizes the worst possible effects of the disturbances on the filtered output error. But the goal of the reference-signal-based beamformer is to estimate the filter coefficient vector itself instead of in minimizing the filtered output error. In this case, the criterion of the Hoptimal filter has to be modified to address the problem of filter coefficient vector estimation (eg. minimizing the coefficient vector estimation error).

The remainder of this chapter is organized as follows. Section 4.2 describes the definition of the H-norm and the recursive solution of the time-domain adaptive beamformers which utilizes H adaptation criterion. Section 4.3 applies H adaptation criterion to the two proposed frequency-domain adaptive beamformers, SPFDBB and FDABB as well as analyzes the computing effort of the two frequency-domain beamformers and the time-domain beamformer using H adaptation criterion. Section 4.4 defines time-domain performance indices for the experiments in Chapter 6 to measure the robustness of H adaptation criterion. Finally, a conclusion is given in Section 4.5.

4.2. Time-Domain Adaptive Beamformer Using H Adaptation Criterion

4.2.1 Definition of H-norm

The H-norm defines the worst case response of a system. If Z denotes a transfer operator that maps an input causal sequence

{ }

ui to an output causal sequence

{ }

yi as shown in Fig. 4-1, the H-norm of Z is defined as,

2 2 0

sup u Z y

= u≠ (4-1)

where the notation .2 denotes the 2-norm. Obviously, the H-norm can be regards as the maximum energy gain from the input u to the output y .

Z

{ }yi ni=1

{ }ui ni=1

Figure 4-1 Transfer operator Z from input

{ }

ui to output

{ }

yi

4.2.2 Formulation of Time-Domain Adaptive Beamformer

Figure 4-2 shows the overall architecture of the time-domain adaptive beamformer using H adaptation criterion. The data flow and architecture are almost the same with those introduced in the Chapter 2 except the H adaptation criterion is used. In the silent stage (VAD = 0), the filter coefficient vectors are adapted through H adaptation criterion. In the speech stage (VAD = 1), the computed filter coefficient vectors are applied to the lower beamformer to suppress the interference signals and noises, and derive the purified speech signal.

Based on the system architecture shown in Fig. 4-2, the formulation of speech enhancement system can be expressed as the following linear model:

r(n)=xˆT(n)q+e(n) (4-2)

where r(n) is the reference signal and xˆ(n)=

[

xˆ1(n) L xˆM(n)

]

T is a MP×1 training signal vector. xˆi(n)=

[

xˆi(n) L xˆi(nP+1)

]

is a 1×P training signal vector and each component in the silent stage is constructed from the linear combination of the pre-recorded speech signals and the online recorded interference signals or noises as xˆi(n)=si(n)+ni(n). M denotes the number of microphones and

P denotes the number of filter tap. Additionally,

[

q11 L q1P L qM1 L qMP

]

T

=

q is the MP×1 unknown filter coefficient

vector in the time domain that we intent to estimate. e(n) is the unknown disturbance, which may also include modeling error.

)

Figure 4-2 System Architecture of the time-domain adaptive beamformer using H adaptation criterion

The problem of the proposed speech enhancement system is how to use a strategy

using all the information available from time 1 to time n such that the H-norm from estimation error defined as:

q~(n)=qqˆ(n) (4-3)

Figure 4-3 Transfer operator from disturbances to coefficient vector estimation error

To apply the adaptive H adaptation criterion, the linear model, as in Eq. (4-2), is transformed into an equivalent state-space form:

⎩⎨

To find the optimal Hestimation, the criterion in the sense of H-based filtering is:

( ) ∑

where ⋅ denotes the square of the 2-norm. However, a closed form solution of Eq. 2 (4-5) is unavailable for general cases. Therefore, it is common in the literature to relax the minimization condition and settle for a suboptimal solution. Given a scalar γq >0,

find a H suboptimal estimation strategy called

(

(1), ( ); ˆ(1), ˆ( )

)

)

ˆ(n r r n x x n

q =Ψ L L that achieves T

( )

Ψ q. In other words,

q =Ψ L L that achieves T

( )

Ψ q. In other words,