In the formulation above, three parameters a , Q and v
. v control the error covariance of the Process Equation. Q control the error v
covariance of the Measurement Equation. control the error pro ortion between the v upper line and lower line of the Measureme Equation.
Process Equation:
p Measurement Equation:
(3.3)
is large, the filter adapts to the variation in environment faster. By (3.2),
it can be observed that if is large, the change between Q w(k,w) and w(k1,w)
In the case of the environment is a Linearly Time-Invariant (LTI) system, ent Er no need to do adaption to those variations in the system. Therefore, the best choose of
v Q
will be zero by setting to zero. Q
The parameter controls the tradeoff between noise reduction and v dereverberation. Large leads to strong noise reduction and little dereverberation v
while small leads to strong dereverberation and little noise reduction. If v is v small, that means the error variation in the lower line of (3.4) is relatively small compared with the upper line, which leads to closer tracing in the lower line and looser tracing in the upper line, achieving strong dereverberation and weak noise reduction. If
is large, that means the error variation in the upper line of (3.4) is relatively small v
pared with the lower line, which leads to closer tracing in the upper line and looser tracing in the lower line, achieving strong noise reduction and weak dereverberation.
Extreme choose of v
com
in either cases will decrease the signal quality since too c
mu h distortion or too m h noise are both degrading reasons to the quality of the signal. The optimal choose of v
uc
should be related to the signal-to-noise ratio (SNR)
since can be treated as a leverage that distributes the total effort of filtering v
between signal dereverberation and noise reduction. If the noise level is relatively small to the signal, or the SNR is high, more effort should be emphasized on signal dereverberation while if the noise level is relatively large to the signal, or the SNR is low, more effort should be emphasized on noise reduction. Experiments on this tradeoff will be presented in Section 4.
3.5
wh
Voice Activity Detection under Proposed Formulation
re ed
en nal cancelation phenomenon will
parameters during the filtering procedure can be utilized to implement as voice activity As mentioned before in Section 3.3, the vector X(k,w) is the data cord
the desired signal is inactive, or desired sig
occur. Thus, a voice activity detector is required. A feasible option is to incorporate other algorithm that detects voice activity or signal activity. However, some
detector. The procedure regarding such implementation will be presented in this section.
Starting again from the formulation:
Mea
(3.2)
Process Equation:
surement Equation:
) The vector is the Measurement E
as a feature to rror. By observing the value of the )
, (k w V
Error, t
Measurement he voice activity detector can be implemented. In (3.2), the upper line can be regarded as suppressing noise while the lower line can be regarded as preserving the desired signal. If X(k,w)is purely noise, it will be minimized by both the upper line and lower line of he Measurement Error with such X(k,w) is small and has low variance. If X(k,w) contains desired signal, it is prone to be preserved by the lower line but also prone to be minimized by the upper line, which constitutes a dilemma. The filtering result is that the first element of the Measurement Error, corresponding to the error in the upper line, is large, which means such X(k,w) cannot be minimized by the upper line and leads to large residual error.
In summary, the Measurement Error of noise reduction is employed (3.2). T
detect voice activity under this algorithm. It can be considered as a data rejection procedure before filtering [8]. If the Measurement Error is larger than the threshold, the current frame is regarded as voice activity and thus the parameters update is abandoned with respect to current frame. If the Measurement Error is smaller than the
threshold, the current frame is regarded as voice inactivity and thus the parameters update is preserved with respect to current frame. The flow chart of the voice activity detection procedure is as Fig 2.
Fig. 2 Flow Chart of Voice Activity Detection Procedure
It has to be note v
ise d that the Measurement Error is not discriminative enough if is ill-chosen. Since the the critical error term is the Measurement Error on no reduction, should be chosen large enough to spare efforts on noise reduction. v
However, the best should consider both noise reduction and dereverberation, so v the appropriate ould not be chosen extremely large. To overcome such dilemma, v two Kalman filters should be executed, one with large v
sh
that executing noise reduction and detecting signal activity while another one with medium that v computes optimal weight w(k,w) to achieve best tradeoff between noise reduction
and dereverberation.
3.6 Threshold Decision and SINR Estimation
e voice activity is not deter
In Section 3.5, the threshold that discriminates th
mined. In Section 3.6.1, the procedure that determines the threshold will be presented. In Section 3.6.2, the result of the detection procedure can be further reused to estimated current SINR and help choosing the best , which is undetermined in v Section 3.4.
3.6.1 Gaussian Mixture Model and EM Algorithm
orated to guide the data class
p
The Gaussian Mixture Model (GMM) is incorp
ification [9]. The distribution of Measurement Error when signal is inactive is modeled as a Gaussian distribution and the distribution of Measurement Error when signal is active is modeled as another Gaussian distribution as Fig. 3. This model is described by the following equations. Let xk denote the first element of the Measurement Error at time k. z is the speech/nonspeech label, z{0,1}, where 0 denotes nonspeech and 1 for s eech. According to Bayes’ Rule, it can be written that
), ( ) , ( )
, ( )
(x p x z p x z p z
p k k k
z
z
(3.21)
where p(z) is the prior probability of speech/nonspeech, and is coeff
actually equal to the weight icient wz (w0 w1 1). p(xk z,) represents the likelihood of xk given speech/nonsp ch mee odel.
} is the parameter set of the GMM.
Fig. 3 Schematic illustration of error distribution: (a) Distribution of noisy speech; (b) Distributions of speech and nonspeech (This Figure is modified from [9]) Let be a sequence of the first element of the Measurement Error. The probability density function (PDF) is given by
}
The parameter set is estimated by maximizing the above PDF function.
From the GMM, both of the PDFs of speech and nonspeech can be obtained, namely p( z 1,)p(z1) and p( z 0,)p(z 0). These two PDFs are shown in Fig. 3(b). From the two PDFs, the optimal threshold can be obtained to minimize the classification error. The threshold satisfies
) Eq. (3.24) is a quadratic equation with one unknown . The threshold is one of its roots location between the two means, namely 1 0. The samples with error less than are determined as nonspeech, and otherwise as speech. The shadow in Fig.
3(b) denotes the classification error.
The crucial issue of the above model is to estimate the parameter set . The estimation consists of an initialization and a sequential updating process. The initial GMM is first established by the EM algorithm, and then incrementally updated with coming data. The parameter set at time k is denoted as k {k,z,k,z,wk,z z0,1}.
is the initial parameter set estimated from the first 0 M samples by EM algorithm.
According to [9], the following are the typical EM re-estimation formulas,
iteration, is replaced by '. This iteration continues until EM algorithm converges.The final ' is the initial parameter set required to GMM initialization and the 0
threshold can be obtained by solving (3.24).
According to [9], it assumes the GMM varies with time slowly, k k1 at time
. Accordingly, the relationship
k
approximated by the zero-order moment, kz
parameter defined by user which determines the adaption speed. Therefore, the adaption formulas can be written as follows,
)
where stands for forgetting factor. Besides, some constraints are required during the adaption process as follows.
}
The reason for constraint (3.32) is based on the inspection that the mean of the Measurement Error when speech is always larger than nonspeech, thus a lower bound for is implemented by adding a gap k,1 to and choose the larger one. The k,0 reason for constraint (3.33) is based on the inspection that the variance of the Measurement Error when speech is always not smaller than the the variance of the Measurement Error when nonspeech. The reason for constraint (3.34) is to stem the minimum prior probability of speech from becoming 0 and inducing no adaption afterwards, where is also a parameter to be chosen.
After building the GMM model, the threshold can be determined after EM
initialization and adaption. The process of EM algorithm is written in Fig. 4 and the total procedure of VAD decision is written as Fig. 5.
Initialize GMM by using unsupervised clustering while GMM likelihood is increasing
if wk,1 wk,1 wk,0 1 break end
Calculate p(zxj,) for all and z xj with (3.25) Calculate new weights with (3.26)
Calculate new means with (3.27) Constraint means with (3.32)
Calculate new variances with (3.28) Constraint variances with (3.33) end
Fig. 4 EM algorithm with constraints (revised from [9])
for the first M frames
Calculate the Measurement Error
Establish a GMM by EM with constraints
Determine the threshold from GMM using (3.24) Classify M frames as speech/nonspeech
Discriminate speech/nonspeech by hangover scheme end
for new frame at time k1
Calculate the Measurement Error Calculate p(zxj,) with (3.25)
Update the weight coefficients with (3.29) Constraint the weight coefficient with (3.34) Update the means with (3.30)
Constraint the means with (3.32) Update the variances with (3.31) Constraint the variances with (3.33)
Determine the threshold from GMM using (3.24) Determine xk1 as speech/nonspeech
end
Fig. 5 The process of VAD decision (revised from [9]) Fig. 5 The process of VAD decision (revised from [9])
3.6.2 SINR Estimation
In Section 3.4, the best that determines the tradeoff between noise reduction v and dereverberation is undetermined. It is mentioned that it should be related to the current SINR since leverages the effort to reduce noise and enhance signal while v SINR stands for the ratio of signal power and noise power. From the result of Section 3.6.1, the two Gaussian Models stand for the Measurement Error of signal part and noise part, which is also can be related to SINR. The mean of the Gaussian Model for signal and noise can be regarded as two indices describing the signal power and noise power after adaptive filtering. Therefore, the mean difference of the two Gaussian Models can be interpreted as an index describing current SINR. Fig. 6 shows the relationship from Mean Difference in VAD to the best estimation of . v
v
v
Fig. 6 The relationship from Mean Difference in VAD to the best estimation of v
In Fig. 6, there are three blocks used to determine the best estimation of . The v first block is calculating the mean difference from current GMM, which is trivial after building the Gaussian Mixture Models.
The second block is estimating the current SINR by current Mean Difference.
Although the conceptual relationship can be imagined, there is still no concrete equation to describe the relationship between them. To solve that problem, the relationship can be pre-trained. The curve, or the relationship, can be found by mixing
signal clip and noise clip recorded on testing scenario with various amplitudes to acquire clips with different SINRs. With those clips, the computation of computing Measurement Error with Kalman filter and perfect VAD are preceded. After the computation and building GMM modles, the Mean Difference can be found corresponding to the testing clips. Finally, rearranging the correspondence from SINR to Mean Difference, the relationship can be trained. An example showing the result of a series of training is in Fig. 7. With the relationship from SINR to Mean Difference, it can be used to inversely look up when requiring current SINR given Mean Difference.
Fig. 7 An example of trained relationship from SINR to Mean Difference
The third block is estimating the best corresponding to current SINR. It can v also be trained to build the relationship. The pre-training procedure is varying v from 0.01 to 100 with multiplication of 100.2 for each sample clip of different SINR and finding the best output. The “best output” can be measured by some combination of objective indices like output SINR or log spectrum distortion (LSD). An example of
giving the best output by minimizing the LSD through various and various SINR v is presented in Fig. 8. Note that small LSD stands for less distortion and high signal quality.
Fig. 8 SINR vs. the giving Best LSD v
With the Gaussian Mixture Models and the two pre-trained blocks, the best v under that trained scenario can be founded.
3.7 Overall System Architecture
Combining the beamforming technique proposed in Section 3.3, the voice activity detection in Section 3.5 and the parameter determinism in Section 3.6, the overall system architecture is presented in this section.
The flow chart Fig. 9 is plotted to elaborate the overall system architecture. The main processing can be separated to two Kalman filters, written as Kalman filter 1 and Kalman filter 2 in Fig. 9 The Kalman filter 1 is operated as the voice activity detector,
thus its should be chosen large enough to place appropriate efforts on noise v reduction. By a large , the Measurement Error will be discriminative enough to v
separate the signal part and noise part. The Kalman filter 2 serves as the beamformer, so its should be chosen appropriately to balance the tradeoff between noise v reduction and dereverberation.
To start with, new speech samples in time domain are collected in frames with fixed overlap to the previous frame and transformed to frequency domain after zero padding and Hanning windowing. Before feeding the new frame to Kalman filter 1, the old parameters of Kalman filter 1 is preserved in case later the Measurement Error shows the Kalman filter 1 should not adapt to the new frame since it contains desired signal. After saving current parameters of Kalman filter 1, the Kalman filter 1 tries to adapt itself to the new frames and calculate the Measurement Error with respect to the new frame. The Measurement Error is compared with the threshold and used to
determine the new frame is desired signal active or inactive.
If the new frame is determined as desired signal active, it should be weighted and summed by the weightings given by Kalman filter 2. As mentioned before, the Kalman filter 2 serves as beamformer and filters out undesired noise and maintains desired signal undistorted. After giving filtered result, the parameters of Kalman filter 1 should be loaded by the parameters before adapting to new frame, since the new frame
contains desired signal and should not be adapted by Kalman filter 1.
If the new frame is determined as desired signal inactive, it should be fed to Kalman filter 2 to adapt to the noise contained in the new frame. During the adaption phase, the parameters will be meanwhile updated.
New input
Run Kalman Filter 1 w.r.t. New Input
If ME threshold
Mark as desired signal active
Y
N
Save Current Parameters of Kalman Filter 1
Calculate New Measurement Error
Output Filtered Result by Weights from Kalman Filter 2 Mark as desired
signal inactive
Update Parameters Of Kalman Filter2
Load Parameters of Kalman Filter 1
Update GMM
By new ME and calculate new threshold
Find the best by inverse look up
v
Update Of Kalman Filter2
v
Fig. 9 The Flowchart of Overall System
After determining the voice activity, the new Measurement Error is used to update the GMM and calculate for new threshold. The Mean Difference of the two Gaussian Models can be used to look up for current SINR and the best for Kalman filter 2. v
To sum up with, the overall algorithm contains two Kalman filters to handle the two issues of voice activity detection and beamforming respectively. The two Kalman filters differ in its crucial parameter and thus render different functions and v scenarios. The GMM is incorporated to help detecting voice activity and separate the signal and noise as two groups, which gives more information to retrieve the best v
corresponding to current SINR.
Chapter 4. E XPERIMENT R ESULTS
4.1 Introduction of the Experiment Condition
In the experimental results presented afterward, the original sound samples are recorded in a Ford Fiesta car by a microphone array placed at the sun shield of driver’s seat. The desired male speech is played by the Head and Torso Simulator (HATS) by Brüel & Kjær on the driver’s seat. The speech data is extracted from a listening comprehension test by an English learning center, thus giving high SNR. The interfering female speech is played by the same HATS on the copilot’s seat. It is also extracted from an English listening comprehension test. The noise is recorded when the car is driving on road with speed at around 50 km/hr. More specifications about the experiment are presented in Table 1. The photos illustrating the recording environment are as Fig. 10 and Fig. 11. Fig. 12 and Fig. 13 are the time-frequency plots from the known clips played and the signal clips recorded, both of which are used as reference signal in this experiment.
Microphone Number 4 Microphone Displacement 7 cm
Sampling rate 8000 Hz FFT size 512 samples
Shift number 160 samples Zero padding 32 samples Table 1 Parameters in experiment
The sound data is recorded by a digital microphone array, which uses digital microphones to receive signal and collects 16-bits array data in an Altera FPGA development board. The received data is visible for an embedded network hardware NetBurner through shared memory. Finally, the array data is transferred to PC or Laptop through Local Area Network (LAN).
Fig. 10 The photo for the microphone array at the sun shield of the driver’s seat.
Fig. 11 The photo for the HATS at the driver’s seat
Fig. 12 The time-frequency plot for original speech
Fig. 13 The time-frequency plot for recorded speech
4.2 Experiments on Performance of Noise Reduction and Its Tradeoff with Dereverberation
In this section, the tradeoff phenomenon between noise reduction and dereverberation is exhibited. The experiment environment is as mentioned in Section 4.1. Three speech enhancement algorithms, MVDR, MVDR with Kalman filter solution, DSB (Delay and Sum Beamformer) are implemented to compare with proposed algorithm. In this section, perfect voice activity detection is assumed for MVDR, MVDR with Kalman filter and proposed algorithm to avoid sample matrix inverse (SMI) problem [10]. For the MVDR filter, the forgetting factor of sample covariance matrix is 0.99. In proposed beamformer, the parameter ranges from v
0.001 to 1000 with ration of increase 10.
Two objective performance indices are used to measure the waveform property.
The first is the average SINR (avgSINR) defined as
where and denote periods in time when only the desired speech is active and only the interference-plus-noise signals are active respectively. The second quality measure is log spectral distortion (LSD) defined as
Ts Tn
where is the Short-time Fourier transform (STFT) of the original sound played by HATS and is the STFT of the beamformer output. LSD means the
)
speech distortion in frequency domain. Note that a lower LSD level corresponds to better performance.
In Fig. 14(a), Fig. 15(a) and Fig. 16(a), the effect of regarding SINR is as v expected. Higher gives higher noise reduction level and thus giving better v
performance. In contrast, small gives low noise reduction level and thus giving v bad result in LSD since noise and distortion both worsen the LSD. Since the perfect voice activity detection is assumed, other methods like MVDR and MVDR with Kalman filter both performs well. With perfect voice activity detection, the MVDR
performance. In contrast, small gives low noise reduction level and thus giving v bad result in LSD since noise and distortion both worsen the LSD. Since the perfect voice activity detection is assumed, other methods like MVDR and MVDR with Kalman filter both performs well. With perfect voice activity detection, the MVDR