Dissertation Organization - 以參考訊號架構為基礎之穩健語者定位與語音純化法

Chapter 1 Introduction

1.5 Dissertation Organization

This chapter provides a brief introduction to the general microphone-array-based speech enhancement system, including the overviews of DOA algorithms, and beamformers. This chapter also briefly discusses three main components in the proposed reference-signal-based speech enhancement system. Chapter 2 introduces the reference-signal-based time-domain adaptive beamformer using NLMS adaptation criterion. Chapter 3 presents SPFDBB and FDABB using NLMS adaptation criterion and analyzes the computational efforts of the two proposed frequency-domain and the time-domain beamformers. Chapter 4 studies the robustness of the H∞ adaptation criterion. Chapter 5 presents the proposed reference-signal-based speaker’s location detection approach for single speaker’s and multiple speakers’ locations detection.

Chapter 6 shows the simulation as well as the experimental results in real environment.

Chapter 7 gives some concluding remarks and avenues for future research.

Chapter 2 Reference-signal-based Time-domain Adaptive Beamformer

2.1 Introduction

Speech enhancement systems are now becoming increasingly important, especially with the development of automatic speech recognition (ASR) applications. Although various solutions have been proposed to reduce the desired signal cancellation, particularly desired speech, in noisy environments, the recognition rate is still not satisfactory. Earlier approaches, such as delay-and-sum (DS) beamformer [82], Frost beamformer [45], and generalized sidelobe cancellation (GSC) [46], are only good in ideal cases, where the microphones are mutually matched and the environment is a free space. The causes of performance degradation include array steering vector mismodeling due to imperfect array calibration [47], and the channel effect (e.g., near-field or far-field problem [83], environment heterogeneity [84] and source local scattering [85]). To manage these limitations, the most common linearly constrained minimum variance (LCMV)-based techniques [33] and [57] have been developed to reduce uncertainty in the sound signal’s look direction. However, these approaches are

limited by scenarios with look direction mismatch. Henry [58] employed the white noise gain constraint to overcome the problem of arbitrary steering vector mismatch.

Unfortunately, no clear guidelines are available for choosing the parameters.

Hoshuyama et al. [59] proposed two robust constraints on blocking matrix design.

Cannot et al. [86] proposed a new channel estimation method for standard GSC architecture in the frequency domain, but loud noises, and in particular circuit noise, would heavily degrade its channel estimation accuracy. Sergiy A. et al. [60] proposed an approach based on optimizing the worst-case performance to overcome an unknown steering vector mismatch. However, a worst case is defined as a small random perturbation, which may not suitable for general cases.

Dahl et al. [61] proposed the reference-signal-based time-domain adaptive beamformer using NLMS adaptation criterion to perform indirect microphone calibration and to minimize the speech distortion due to the channel effect by using pre-recorded speech signals and a reference signal. Moreover, Yermeche et al. [62] and Low et al. [63] also utilized the reference signal to estimate the source correlation matrix and calibration correlation vector. These methods in [61-63] are fundamentally the same except that the works in [62-63] do not require a VAD. However, VAD is useful for finding speaker’s location and enabling the speech purification system and speech recognizer. Therefore, the methods in [62-63] cannot offer any profit in ASR applications.

The following section describes the system architecture of the reference-signal-based time-domain adaptive beamformer and the corresponding dataflow. It also presents how to derive the pre-recorded and reference signals. Finally, conclusions are given in Section 2.3.

2.2 System Architecture

The architecture of the reference-signal-based time-domain adaptive beamformer is shown in Fig. 2-1. A speech signal passing through multi-acoustic channels is monitored at spatially separated sensors (a microphone array). Two kinds of signals, the pre-recorded signal,

{

^s₁⁽ⁿ⁾ L ^sM⁽ⁿ⁾

}

, and the reference signal, r(n), are necessary before executing the reference-signal-based time-domain adaptive beamformer. Let M denotes the number of microphones. A set of pre-recorded speech signals are collected by placing a loudspeaker or a person in the desired position, and by letting the loudspeaker emit or the person speak a short sentence when the environment is quiet. Therefore, the pre-recorded speech signals provide a priori information between speakers and the microphone array. Additionally, the reference signal is acquired from the original speech source that emitted from the loudspeaker or using another microphone located near the person to record the speech. In practice, the loudspeaker or the person should move around the desired position slightly to obtain an effective recording. For example, Fig. 2-2 illustrates a vehicular environment with a headset microphone and a microphone array. The person is right on the desired location and speaks several sentences when the environment is quiet. The sentences are simultaneously recorded by the headset microphone and the microphone array. The speech signal collected by the headset microphone and the microphone array are called the reference signal and pre-recorded speech signals individually. Notably, the user does not need the headset microphone during the online applications.

After collecting the pre-recorded speech signals and the reference signal, the complete procedures of the reference-signal-based time-domain beamformer are divided in two stages, the silent stage and the speech stage, through the result of VAD algorithm.

)

Figure 2-1 System architecture of the reference-signal-based time-domain adaptive beamformer

Figure 2-2 Installation of the array and headset microphone inside a vehicle

In other words, the VAD result decides whether to switch the system to the silent stage or speech stage. First, if VAD result equals to zero which means no speech signal contains in the received signals,

{

x₁(n) L x_M(n)

}

(i.e. the received signals are totally environmental noises denoted as

{

ⁿ₁⁽ⁿ⁾ L ⁿM⁽ⁿ⁾

}

), then the system is switched to the first stage: the silent stage. Given that the environmental noises are

speaker is talking in a noisy environment can be expressed as a linear combination of the pre-recorded speech signal and the environmental noises. Therefore, in this stage, the system combines the online recorded environmental noise,

{

ⁿ₁⁽ⁿ⁾ L ⁿM⁽ⁿ⁾

}

, with the pre-recorded speech database,

{

s₁(n) L s_M(n)

}

, to construct the training signals,

{

^x^ˆ₁⁽ⁿ⁾ L ^x^ˆM⁽ⁿ⁾

}

, and to performs NLMS adaptation criterion to derive the filter coefficient vectors. Notably, the filter coefficient vectors are updated via the reference signal and the training signal thus implicitly solving the calibration.

Secondly, if the received sound signal is detected as containing speech signal, then the system is switched to the second stage, called speech stage. In this stage, the filter coefficient vectors obtained by the first stage are applied to the lower beamformer to suppress noises and enhance the speech signal in the speech stage. Finally, the single-channel purified speech signal yˆ n( ) is transformed to the frequency domain and then sent to the automatic speech recognizer. Because the variation between pre-recorded speech signals and the reference signal contains useful information about the dynamics of channel, electronic equipments uncertainties, and microphones’

characteristics, the method potentially outperforms other un-calibrated algorithms in real applications. Figure 2-3 presents the flowchart of the reference-signal-based time-domain adaptive beamformer.

Figure 2-3 Flowchart of the reference-signal-based time-domain adaptive beamformer

While the speaker in the desired location is silent, the formulation of referenced-signal-based time-domain beamformer can be expressed as the following linear model:

(

( ) ( )

)

( ) )

( ) ˆ( )

(n n e n n n e n

r =q^Tx + =q^T s +n + (2-1)

where the superscripts T denotes the transpose operation and e(n) is the error signal in the time domain. Notice that italics fonts represent scalars, bold italics fonts represent vectors, and bold upright fonts represent matrices in this dissertation. Let the

[

ⁿ _M ⁿ

]

^T beamformer that we intent to estimate.

The corresponding vectors of the signals defined above are,

[

ˆ ( ) ˆ ( 1)

]

The well-known normalized LMS solution obtained by minimizing the power of error signal is represented in Eq. (2-2)

)

where γ is a small constant included to ensure that the update term does not become excessively large when xˆ(n)^Txˆ(n) temporarily become small. The purified signal can be calculated by

) signal vector acquired by the microphone array, and

[

( ) ( 1)

]

)

(n = x_i n x_i n−P+

i L

x .

2.3 Summary

This chapter presents the reference-signal-based system architecture which implicitly contains the information of the channel effect and microphones’

characteristics. This architecture implicitly obtains the acoustic behavior from the desired location to microphone array and reduces the efforts of directly performing microphone calibration and channel inversion. Furthermore, it can be applied on both near-field and far-field situations which offers a significant advantage in speaker localization and beamformer algorithms. Extension of this idea to further improve the ASR rates will be described in the following chapters. Moreover, a novel speaker’s location detection algorithm based on the reference-signal-based architecture is also proposed. In addition, Chapter 6 compares the performance of the reference-signal-based time-domain adaptive beamformer and other well-known non-reference-signal-based beamformers to show the effectiveness of the proposed method.

Chapter 3 Reference-signal-based

Frequency-domain Adaptive Beamformer

3.1 Introduction

The required computational effort could be large when applying a large FIR filter coefficients, e.g., 256 to 512 taps in the time-domain adaptive beamformer introduced in the previous chapter. For subsequent ASR operation, another effort to compute Discrete Fourier transform (DFT) is required. One possible way to simplify the computational complexity is to compute the beamformer directly in the frequency domain because ideally the large FIR taps can be replaced by a simple multiplication at each frequency bin (e.g., the FIR filter with dimension of MP×1 is represented by filter coefficient vector M ×1 in the frequency domain where M is the microphone number and P denotes the FIR taps). Moreover, the purified speech signal after a frequency-domain beamformer can be sent directly to the ASR. As explained later in this chapter and Chapter 6, the saving of computational effort is quite significant.

In a reference-signal-based beamformer, coefficients adjustment has two objectives:

to minimize the interference signal and noises, and to equalize the channel effect (e.g.

room acoustics). Channel equalization is important for ASR since the channel distortion may greatly reduce the recognition rate.

By formulating the same problem in the frequency domain, channel distortion can be emphasized using a priori information. In this chapter, a penalty function is incorporated into the performance index to calculate the filter coefficient vectors. This proposed algorithm is called SPFDBB.

A real-time frequency-domain beamformer is necessary to apply the short time Fourier transform (STFT). However, the corresponding window size of the STFT has to be fixed by the training data settings in ASR. For an environment with longer impulse response duration, the convolution relation between channel and speech source in time-domain cannot be modeled accurately as a multiplication in the frequency- domain with a finite window size. Therefore, the finite window size may not provide enough information for the coefficient adjustment and could not fit the assumptions that filter coefficient vector and the error signal should be independent to the input data in the NLMS adaptation criterion. In this case, SPFDBB takes the frame average over several frames as a block to improve the approximation of the linear model shown in Eq. (2-1). In other words, a block of windowed data is simultaneously adopted to calculate the filter coefficient vectors in the SPFDBB algorithm. The number of frames in a block is denoted as the frame number L. Intuitively, a large frame number could enhance the accuracy of the filter coefficient estimation. However, if the room acoustic dynamic changes suddenly, the channel response is difficult to be adjusted quickly when taking a large frame number for the updating process. Furthermore, the

of L is chosen. Therefore, SPFDBB is further enhanced by allowing the frame number to be adapted on-line. A novel index called changing block values index (CBVI) is defined as the basis for adjusting the frame number. The overall algorithm is called FDABB.

The remainder of this chapter is organized as follows. Section 3.2 describes the system architecture and the corresponding dataflow. Section 3.3 represents SPFDBB, one of the reference-signal-based frequency-domain adaptive beamformers which utilizes NLMS adaptation criterion. Section 3.4 introduces the other proposed method, FDABB, and also analyzes the computing efforts of SPFDBB, FDABB and the reference-signal-based time-domain beamformer. Two frequency-domain performance indexes, the source distortion ratio (SDR) and the noise suppression ratio (NSR) are defined in Section 3.5. Finally, conclusions are given in Section 3.6.

3.2 System Architecture

Figure 3-1 shows the overall system architecture. The pre-recorded speech signals,S₁(ω,k),_L,S_M(ω,k), and the reference signal, R( kω, ), can be recorded by the same way described in Chapter 2 when the environment is quiet. After acquiring the pre-recorded speech signal and the reference signal, the overall system automatically executes between the silent and speech stages based on the VAD result.

If the result of VAD equals to zero which means no speech signal contained in the received signal,

{

x₁(n) L x_M(n)

}

, then the system is switched to the silent stage in which the adaptation of FDABB or SPFDBB is turned on. The filter coefficient vectors of FDABB or SPFDBB are adjusted through NLMS adaptation criterion in this stage.

Notably, SPFDBB is a part of FDABB and can be executed separately.

On the other hand, if the received sound signal is detected as containing speech signal, then the system is switched to the second stage called speech stage. In this stage, the filter coefficient vectors obtained in the silent stage are applied to the lower beamformer to suppress the interference signals and noises, and enhance the speech signal. Finally, the purified speech signal Yˆ(ω,k) is directly sent to the ASR.

Figure 3-1 Overall system structure

3.3 SPFDBB Using NLMS Adaptation Criterion

The linear model in Eq. (2-1) is transformed to the frequency domain by padding the short-time Fourier transform of the error signal with zeros to make it twice as long as the window length. The error signal at frequency ω and frame k is written as:

)

where )R( kω, is the reference signal in the frequency domain, Q( kω, ) denotes the filter coefficient vector we intend to find, and ˆ( ,k)

[

Xˆ ( ,k) XˆM( ,k)

]

1 ω ω

ω = _L

is the training signal vector at frequency ω and frame k. The optimal set of filter coefficient vectors can be found using the formula:

[ ][ ]

where the superscripts ∗ denotes the complex conjugate.

The normalized LMS solution of Eq. (3-2) is given by:

)

Consequently, the purified output signal can be obtained by the following equation:

) contains speech source, interference and noise. From Eq. (3-1), the filter coefficient vector equalizes the acoustic channel dynamics and also creates the null space for the interference and noise.

As mentioned above, the filter coefficient vector Q( kω, ) equalizes the channel response and rejects interference signals and noises. To emphasize these two objectives differently, a soft penalty function is added into the performance index as,

)

Then, the iterative equation utilizing the NLMS adaptation criterion can be shown as: the soft penalty is set to infinity, then the system only focuses on minimizing the channel distortion. On the other hand, the system returns to the formulation in Eq.

(3-2) when the soft penalty is set to zero.

The problem of Eq. (3-7) is the window size has to equal that in ASR to ensure calculation accuracy. However, the window size may be too small for cases where the acoustic channel response duration is long (e.g., long reverberation path). Because the perturbation caused by channel model error is highly correlated to the reference signal instead of the uncorrelated noise, the updating process will not converge to a fixed channel response which is shown in Fig 6.3 and the relation between two sequential frames is highly relative. Taking the frame average over several frames (denoted as L ) allows the channel response to be approximated; since information of channel response which is not contained in one frame could be regarded as an external noise in the next

[

⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾ ⁽ ^, ⁾

]

The performance indexes denoted as Eqs. (3-2) and (3-5) are two special cases of Eq (3-8) with Λ₁ =1, Λ₂ =Λ₃ =−1, Λ₄ =1, L=1 and Λ₁ = 1+μ, Λ₂ = Λ₃ =−1,

Λ4 , L=1. Because of the long reverberation path and the fix window size, the error signal between different frames should be taken into consideration. Therefore, the soft penalty approach can be applied on several windows (frames) by choosing

⎥⎥

where I is an identity matrix with dimension _L L. Significantly, when the channel response is longer than the window size, the reference signal in previous windows is added in the following windows of the received signal. A good estimation should collect the information of several frames to eliminate this correlation effect. The

choice of Λ₁considers the cross-term as a factor to minimize this correlation effect.

Using the performance index as Eq. (3-8), the SPFDBB can be summarized as:

{ }

is the training signal matrix with dimension M ×L. The kth error signal vector can be denoted as:

Y is the purified signal vector of the SPFDBB,

and

where Eq. (3-12) means the sum of autocorrelations of the training signal at th

The purified speech signal at th^k block can be represented as:

) every L frames. In this way, the relation between the two sequential block data may be lower than the one by overlapping (L−1) frames to perform the next adaptation process (e.g.k=0 ,1 ,2 ,3,_L). Moreover, the approach which the step k is chosen as

0 significantly increases the computational effort and the memory consumption as the value L increases. The goal of carrying out the beamformer in the frequency domain couldn’t achieve. As a result, the proposed SPFDBB (e.g.k=0 ,L ,2L ,3L,_L) is a kind of efficient approach to obtain low computational effort in the frequency domain.

3.4 FDABB and Computational Effort Analysis

3.4.1 FDABB Using NLMS Adaptation Criterion

The number of frames L in the SPFDBB greatly influences the performance shown in Chapter 6. However, the bigger value of L needs more training data sequences. To cope with the window size problem, an index-based algorithm is proposed to adjust the value L automatically. Using Eq. (3-1), the error signal could be separated as:

(

⁽ ^, ⁾ ⁽ ^, ⁾

)

Taking an frame average over L frames

{ }

If no correlation exists between speech source and interference signals or noises, the first term and second term of the Eq. (3-16) could be zero. Consequently, this equation can be rewritten as:

{ } { }

x k k L k E k k

Eε (ω, )N^H(ω, ) =−Q (ω, ) N(ω, )N^H(ω, ) (3-17)

The optimal coefficient vectors should be the null space of E

{

N(ω,k)N^H(ω,k)

}

and then the norm of Eq. (3-17) should be zero. The large norm of Eq. (3-17) indicates a strong negative or positive correlation between the error signal and interference signals or noises. In other words, the large norm of Eq. (3-17) means that the interference signals or noises affect the error significantly and that the convergence period is not achieved with the frame number L. If the norm of Eq. (3-17) is small, then it means the present value L has lesser help for finding better coefficients.

Consequently, the value of L is increased to improve the performance of the algorithm. Conversely, if the room acoustic varies temporarily, then the present norm of Eq. (3-17) would become much larger than the last one. Then, the value of L should be reset to the initial value to handle this sudden change. The CBVI at frequency ω and block i are defined in Eq. (3-18). The parameters in Eq. (3-18) are chosen as

0 =3

α , α₁ =2, and α₂ =1 to be an high pass filter to increase the sensitivity of the temporal variation of room acoustic and the convergence. Figure 3-2 summarizes the proposed FDABB algorithm.

{ } { }

Figure 3-2 FDABB using NLMS adaptation criterion

3.4.2 Computational Effort Analysis

This section analyzes the computational effort of the reference-signal-based time-domain adaptive beamformer, SPFDBB, and FDABB from two different viewpoints: the coefficients adaptation phase and the lower beamformer phase. In the coefficients adaptation phase, the speaker is silent and the coefficients are updated with the iteration equation (2-2) for the reference-signal-based time-domain adaptive

在文檔中以參考訊號架構為基礎之穩健語者定位與語音純化法 (頁 35-0)