• 沒有找到結果。

In the formulation above, three parameters a  , Q  and v

 . v  control the error covariance of the Process Equation. Q  control the error v

covariance of the Measurement Equation.  control the error pro ortion between the v upper line and lower line of the Measureme Equation.

Process Equation:

p Measurement Equation:

(3.3)

 is large, the filter adapts to the variation in environment faster. By (3.2),

it can be observed that if  is large, the change between Q w(k,w) and w(k1,w)

In the case of the environment is a Linearly Time-Invariant (LTI) system, ent Er no need to do adaption to those variations in the system. Therefore, the best choose of

v Q

 will be zero by setting  to zero. Q

The parameter  controls the tradeoff between noise reduction and v dereverberation. Large  leads to strong noise reduction and little dereverberation v

while small  leads to strong dereverberation and little noise reduction. If v  is v small, that means the error variation in the lower line of (3.4) is relatively small compared with the upper line, which leads to closer tracing in the lower line and looser tracing in the upper line, achieving strong dereverberation and weak noise reduction. If

 is large, that means the error variation in the upper line of (3.4) is relatively small v

pared with the lower line, which leads to closer tracing in the upper line and looser tracing in the lower line, achieving strong noise reduction and weak dereverberation.

Extreme choose of v

com

 in either cases will decrease the signal quality since too c

mu h distortion or too m h noise are both degrading reasons to the quality of the signal. The optimal choose of v

uc

 should be related to the signal-to-noise ratio (SNR)

since  can be treated as a leverage that distributes the total effort of filtering v

between signal dereverberation and noise reduction. If the noise level is relatively small to the signal, or the SNR is high, more effort should be emphasized on signal dereverberation while if the noise level is relatively large to the signal, or the SNR is low, more effort should be emphasized on noise reduction. Experiments on this tradeoff will be presented in Section 4.

3.5

wh

Voice Activity Detection under Proposed Formulation

re ed

en nal cancelation phenomenon will

parameters during the filtering procedure can be utilized to implement as voice activity As mentioned before in Section 3.3, the vector X(k,w) is the data cord

the desired signal is inactive, or desired sig

occur. Thus, a voice activity detector is required. A feasible option is to incorporate other algorithm that detects voice activity or signal activity. However, some

detector. The procedure regarding such implementation will be presented in this section.

Starting again from the formulation:

Mea

(3.2)

Process Equation:

surement Equation:

) The vector is the Measurement E

as a feature to rror. By observing the value of the )

, (k w V

Error, t

Measurement he voice activity detector can be implemented. In (3.2), the upper line can be regarded as suppressing noise while the lower line can be regarded as preserving the desired signal. If X(k,w)is purely noise, it will be minimized by both the upper line and lower line of he Measurement Error with such X(k,w) is small and has low variance. If X(k,w) contains desired signal, it is prone to be preserved by the lower line but also prone to be minimized by the upper line, which constitutes a dilemma. The filtering result is that the first element of the Measurement Error, corresponding to the error in the upper line, is large, which means such X(k,w) cannot be minimized by the upper line and leads to large residual error.

In summary, the Measurement Error of noise reduction is employed (3.2). T

detect voice activity under this algorithm. It can be considered as a data rejection procedure before filtering [8]. If the Measurement Error is larger than the threshold, the current frame is regarded as voice activity and thus the parameters update is abandoned with respect to current frame. If the Measurement Error is smaller than the

threshold, the current frame is regarded as voice inactivity and thus the parameters update is preserved with respect to current frame. The flow chart of the voice activity detection procedure is as Fig 2.

Fig. 2 Flow Chart of Voice Activity Detection Procedure

It has to be note  v

ise d that the Measurement Error is not discriminative enough if is ill-chosen. Since the the critical error term is the Measurement Error on no reduction,  should be chosen large enough to spare efforts on noise reduction. v

However, the best  should consider both noise reduction and dereverberation, so v the appropriate  ould not be chosen extremely large. To overcome such dilemma, v two Kalman filters should be executed, one with large v

sh

 that executing noise reduction and detecting signal activity while another one with medium  that v computes optimal weight w(k,w) to achieve best tradeoff between noise reduction

and dereverberation.

3.6 Threshold Decision and SINR Estimation

e voice activity is not deter

In Section 3.5, the threshold that discriminates th

mined. In Section 3.6.1, the procedure that determines the threshold will be presented. In Section 3.6.2, the result of the detection procedure can be further reused to estimated current SINR and help choosing the best  , which is undetermined in v Section 3.4.

3.6.1 Gaussian Mixture Model and EM Algorithm

orated to guide the data class

p

The Gaussian Mixture Model (GMM) is incorp

ification [9]. The distribution of Measurement Error when signal is inactive is modeled as a Gaussian distribution and the distribution of Measurement Error when signal is active is modeled as another Gaussian distribution as Fig. 3. This model is described by the following equations. Let xk denote the first element of the Measurement Error at time k. z is the speech/nonspeech label, z{0,1}, where 0 denotes nonspeech and 1 for s eech. According to Bayes’ Rule, it can be written that

), ( ) , ( )

, ( )

(x p x z p x z p z

p k k k

z

z

  

 (3.21)

where p(z) is the prior probability of speech/nonspeech, and is coeff

actually equal to the weight icient wz (w0 w1 1). p(xk z,) represents the likelihood of xk given speech/nonsp ch mee odel.

} is the parameter set of the GMM.

Fig. 3 Schematic illustration of error distribution: (a) Distribution of noisy speech; (b) Distributions of speech and nonspeech (This Figure is modified from [9]) Let be a sequence of the first element of the Measurement Error. The probability density function (PDF) is given by

}

The parameter set  is estimated by maximizing the above PDF function.

From the GMM, both of the PDFs of speech and nonspeech can be obtained, namely p( z 1,)p(z1) and p( z 0,)p(z 0). These two PDFs are shown in Fig. 3(b). From the two PDFs, the optimal threshold  can be obtained to minimize the classification error. The threshold  satisfies

) Eq. (3.24) is a quadratic equation with one unknown . The threshold is one of its roots location between the two means, namely 1  0. The samples with error less than  are determined as nonspeech, and otherwise as speech. The shadow in Fig.

3(b) denotes the classification error.

The crucial issue of the above model is to estimate the parameter set . The estimation consists of an initialization and a sequential updating process. The initial GMM is first established by the EM algorithm, and then incrementally updated with coming data. The parameter set at time k is denoted as k {k,z,k,z,wk,z z0,1}.

 is the initial parameter set estimated from the first 0 M samples by EM algorithm.

According to [9], the following are the typical EM re-estimation formulas,

iteration,  is replaced by '. This iteration continues until EM algorithm converges.

The final ' is the initial parameter set  required to GMM initialization and the 0

threshold  can be obtained by solving (3.24).

According to [9], it assumes the GMM varies with time slowly, k k1 at time

. Accordingly, the relationship

k

 

approximated by the zero-order moment, kz

parameter defined by user which determines the adaption speed. Therefore, the adaption formulas can be written as follows,

)

where  stands for forgetting factor. Besides, some constraints are required during the adaption process as follows.

}

The reason for constraint (3.32) is based on the inspection that the mean of the Measurement Error when speech is always larger than nonspeech, thus a lower bound for  is implemented by adding a gap k,1  to  and choose the larger one. The k,0 reason for constraint (3.33) is based on the inspection that the variance of the Measurement Error when speech is always not smaller than the the variance of the Measurement Error when nonspeech. The reason for constraint (3.34) is to stem the minimum prior probability of speech from becoming 0 and inducing no adaption afterwards, where  is also a parameter to be chosen.

After building the GMM model, the threshold  can be determined after EM

initialization and adaption. The process of EM algorithm is written in Fig. 4 and the total procedure of VAD decision is written as Fig. 5.

Initialize GMM by using unsupervised clustering while GMM likelihood is increasing

if wk,1 wk,1  wk,0 1 break end

Calculate p(zxj,) for all and z xj with (3.25) Calculate new weights with (3.26)

Calculate new means with (3.27) Constraint means with (3.32)

Calculate new variances with (3.28) Constraint variances with (3.33) end

Fig. 4 EM algorithm with constraints (revised from [9])

for the first M frames

Calculate the Measurement Error

Establish a GMM by EM with constraints

Determine the threshold from GMM using (3.24) Classify M frames as speech/nonspeech

Discriminate speech/nonspeech by hangover scheme end

for new frame at time k1

Calculate the Measurement Error Calculate p(zxj,) with (3.25)

Update the weight coefficients with (3.29) Constraint the weight coefficient with (3.34) Update the means with (3.30)

Constraint the means with (3.32) Update the variances with (3.31) Constraint the variances with (3.33)

Determine the threshold from GMM using (3.24) Determine xk1 as speech/nonspeech

end

Fig. 5 The process of VAD decision (revised from [9]) Fig. 5 The process of VAD decision (revised from [9])

3.6.2 SINR Estimation

In Section 3.4, the best  that determines the tradeoff between noise reduction v and dereverberation is undetermined. It is mentioned that it should be related to the current SINR since  leverages the effort to reduce noise and enhance signal while v SINR stands for the ratio of signal power and noise power. From the result of Section 3.6.1, the two Gaussian Models stand for the Measurement Error of signal part and noise part, which is also can be related to SINR. The mean of the Gaussian Model for signal and noise can be regarded as two indices describing the signal power and noise power after adaptive filtering. Therefore, the mean difference of the two Gaussian Models can be interpreted as an index describing current SINR. Fig. 6 shows the relationship from Mean Difference in VAD to the best estimation of  . v

v

v

Fig. 6 The relationship from Mean Difference in VAD to the best estimation of  v

In Fig. 6, there are three blocks used to determine the best estimation of  . The v first block is calculating the mean difference from current GMM, which is trivial after building the Gaussian Mixture Models.

The second block is estimating the current SINR by current Mean Difference.

Although the conceptual relationship can be imagined, there is still no concrete equation to describe the relationship between them. To solve that problem, the relationship can be pre-trained. The curve, or the relationship, can be found by mixing

signal clip and noise clip recorded on testing scenario with various amplitudes to acquire clips with different SINRs. With those clips, the computation of computing Measurement Error with Kalman filter and perfect VAD are preceded. After the computation and building GMM modles, the Mean Difference can be found corresponding to the testing clips. Finally, rearranging the correspondence from SINR to Mean Difference, the relationship can be trained. An example showing the result of a series of training is in Fig. 7. With the relationship from SINR to Mean Difference, it can be used to inversely look up when requiring current SINR given Mean Difference.

Fig. 7 An example of trained relationship from SINR to Mean Difference

The third block is estimating the best  corresponding to current SINR. It can v also be trained to build the relationship. The pre-training procedure is varying  v from 0.01 to 100 with multiplication of 100.2 for each sample clip of different SINR and finding the best output. The “best output” can be measured by some combination of objective indices like output SINR or log spectrum distortion (LSD). An example of

giving the best output by minimizing the LSD through various  and various SINR v is presented in Fig. 8. Note that small LSD stands for less distortion and high signal quality.

Fig. 8 SINR vs. the  giving Best LSD v

With the Gaussian Mixture Models and the two pre-trained blocks, the best  v under that trained scenario can be founded.

3.7 Overall System Architecture

Combining the beamforming technique proposed in Section 3.3, the voice activity detection in Section 3.5 and the parameter determinism in Section 3.6, the overall system architecture is presented in this section.

The flow chart Fig. 9 is plotted to elaborate the overall system architecture. The main processing can be separated to two Kalman filters, written as Kalman filter 1 and Kalman filter 2 in Fig. 9 The Kalman filter 1 is operated as the voice activity detector,

thus its  should be chosen large enough to place appropriate efforts on noise v reduction. By a large  , the Measurement Error will be discriminative enough to v

separate the signal part and noise part. The Kalman filter 2 serves as the beamformer, so its  should be chosen appropriately to balance the tradeoff between noise v reduction and dereverberation.

To start with, new speech samples in time domain are collected in frames with fixed overlap to the previous frame and transformed to frequency domain after zero padding and Hanning windowing. Before feeding the new frame to Kalman filter 1, the old parameters of Kalman filter 1 is preserved in case later the Measurement Error shows the Kalman filter 1 should not adapt to the new frame since it contains desired signal. After saving current parameters of Kalman filter 1, the Kalman filter 1 tries to adapt itself to the new frames and calculate the Measurement Error with respect to the new frame. The Measurement Error is compared with the threshold and used to

determine the new frame is desired signal active or inactive.

If the new frame is determined as desired signal active, it should be weighted and summed by the weightings given by Kalman filter 2. As mentioned before, the Kalman filter 2 serves as beamformer and filters out undesired noise and maintains desired signal undistorted. After giving filtered result, the parameters of Kalman filter 1 should be loaded by the parameters before adapting to new frame, since the new frame

contains desired signal and should not be adapted by Kalman filter 1.

If the new frame is determined as desired signal inactive, it should be fed to Kalman filter 2 to adapt to the noise contained in the new frame. During the adaption phase, the parameters will be meanwhile updated.

New input

Run Kalman Filter 1 w.r.t. New Input

If ME threshold

Mark as desired signal active

Y

N

Save Current Parameters of Kalman Filter 1

Calculate New Measurement Error

Output Filtered Result by Weights from Kalman Filter 2 Mark as desired

signal inactive

Update Parameters Of Kalman Filter2

Load Parameters of Kalman Filter 1

Update GMM

By new ME and calculate new threshold

Find the best by inverse look up

v

Update Of Kalman Filter2

v

Fig. 9 The Flowchart of Overall System

After determining the voice activity, the new Measurement Error is used to update the GMM and calculate for new threshold. The Mean Difference of the two Gaussian Models can be used to look up for current SINR and the best  for Kalman filter 2. v

To sum up with, the overall algorithm contains two Kalman filters to handle the two issues of voice activity detection and beamforming respectively. The two Kalman filters differ in its crucial parameter  and thus render different functions and v scenarios. The GMM is incorporated to help detecting voice activity and separate the signal and noise as two groups, which gives more information to retrieve the best  v

corresponding to current SINR.

Chapter 4. E XPERIMENT R ESULTS

4.1 Introduction of the Experiment Condition

In the experimental results presented afterward, the original sound samples are recorded in a Ford Fiesta car by a microphone array placed at the sun shield of driver’s seat. The desired male speech is played by the Head and Torso Simulator (HATS) by Brüel & Kjær on the driver’s seat. The speech data is extracted from a listening comprehension test by an English learning center, thus giving high SNR. The interfering female speech is played by the same HATS on the copilot’s seat. It is also extracted from an English listening comprehension test. The noise is recorded when the car is driving on road with speed at around 50 km/hr. More specifications about the experiment are presented in Table 1. The photos illustrating the recording environment are as Fig. 10 and Fig. 11. Fig. 12 and Fig. 13 are the time-frequency plots from the known clips played and the signal clips recorded, both of which are used as reference signal in this experiment.

Microphone Number 4 Microphone Displacement 7 cm

Sampling rate 8000 Hz FFT size 512 samples

Shift number 160 samples Zero padding 32 samples Table 1 Parameters in experiment

The sound data is recorded by a digital microphone array, which uses digital microphones to receive signal and collects 16-bits array data in an Altera FPGA development board. The received data is visible for an embedded network hardware NetBurner through shared memory. Finally, the array data is transferred to PC or Laptop through Local Area Network (LAN).

Fig. 10 The photo for the microphone array at the sun shield of the driver’s seat.

Fig. 11 The photo for the HATS at the driver’s seat

Fig. 12 The time-frequency plot for original speech

Fig. 13 The time-frequency plot for recorded speech

4.2 Experiments on Performance of Noise Reduction and Its Tradeoff with Dereverberation

In this section, the tradeoff phenomenon between noise reduction and dereverberation is exhibited. The experiment environment is as mentioned in Section 4.1. Three speech enhancement algorithms, MVDR, MVDR with Kalman filter solution, DSB (Delay and Sum Beamformer) are implemented to compare with proposed algorithm. In this section, perfect voice activity detection is assumed for MVDR, MVDR with Kalman filter and proposed algorithm to avoid sample matrix inverse (SMI) problem [10]. For the MVDR filter, the forgetting factor of sample covariance matrix is 0.99. In proposed beamformer, the parameter  ranges from v

0.001 to 1000 with ration of increase 10.

Two objective performance indices are used to measure the waveform property.

The first is the average SINR (avgSINR) defined as

where and denote periods in time when only the desired speech is active and only the interference-plus-noise signals are active respectively. The second quality measure is log spectral distortion (LSD) defined as

Ts Tn

where is the Short-time Fourier transform (STFT) of the original sound played by HATS and is the STFT of the beamformer output. LSD means the

)

speech distortion in frequency domain. Note that a lower LSD level corresponds to better performance.

In Fig. 14(a), Fig. 15(a) and Fig. 16(a), the effect of  regarding SINR is as v expected. Higher  gives higher noise reduction level and thus giving better v

performance. In contrast, small  gives low noise reduction level and thus giving v bad result in LSD since noise and distortion both worsen the LSD. Since the perfect voice activity detection is assumed, other methods like MVDR and MVDR with Kalman filter both performs well. With perfect voice activity detection, the MVDR

performance. In contrast, small  gives low noise reduction level and thus giving v bad result in LSD since noise and distortion both worsen the LSD. Since the perfect voice activity detection is assumed, other methods like MVDR and MVDR with Kalman filter both performs well. With perfect voice activity detection, the MVDR

相關文件