Discussion & Future Work

2 FREQUENCY BAND ANALYSIS FOR VOICE ACTIVITY DETECTION

2.5 Discussion & Future Work

In Chapter 2, the objective mainly exploits the nature of the banded lines (or called formant traces) on voice-active spectrogram to regard as a robust feature. Via frequency band analysis, the inherent nature can be characterized by a measure of entropy defined subband domain. In fact, some subbands can be contaminated by noise to result in error of BSE value. Consequently, an on-line SSE strategy is used to

automatically extract the useful subbands. So, the BSE parameter with an SSE strategy can correctly detect the boundary of voice activity. In additional to the BSE parameter, the alternative parameter, RLF (ratio of low-band energy to full-band energy) is then presented to compensate the limitation of BSE for characterizing unvoiced speech. Compare the presented parameter with others, it certainly provides a reliable performance even in variable noise level. Experimental results show that the proposed VAD comprising the two BSE and RLF feature parameters and adaptive thresholding, which is used for a VAD decision and an indicator of executing subband extraction, is prior to other VAD, especially at low SNRs and at variable level of noise.

However, the subband extraction is either discarded or preserved. It is not good enough to help the result of VAD. Although some subbands are contaminated by nosie, they still offer some useful information. In future work, each subband has respective weighting and the total weighting is 1. The weighting in harmful subband is lower than that in useful subband. Due to that the SSE strategy assume the larger subband power as harmful band, however this power is speech power (useful band) for a clean signal. So, this false subband discard may reduce the accurate rate of detection. In this case, the rate of correct detection can reduce to about 3%. Similarly, the rate of false detection can increase to about 1.7%. According this find, an improved subband

extraction for testing in a clean signal must be required.

In addition, the number of decomposed subbands is no longer fixed. The number of decomposed subband, in practice, varies with time. So, the rough solution is that we use a peak/valley detection on spectrum for each frame to locate the varying banded lines.

TABLE 2-I

COMPARISON BETWEEN THE PROPOSED ENTROPY-BASED VAD AND ATF-BASED VAD [7] UNDER VARIOUS NOISE CONDITIONS

Noise Conditions Proposed entropy-based VAD ATF-based VAD [7]

Type SNR(dB) Probability of correct detection, P (%) _cS

VAD types Total probability of correct detection, P (%) _cS Total probability of false detection, P (%) _fS

Proposed entropy-based 89.2 3.5

ATF-based VAD [7] 78.6 9.4

Fig. 2-1 Inherent characteristic of banded lines (formant traces) only appears on voice-active spectrogram: (a) Mixed signal waveform is composed of vehicle noise, multi-talker babble noise, factory noise, and speech signal and white noises in turn. (b) Spectrogram of the corresponding mixed signal.

Fig. 2-2 Illustration of the banded lines existing in various types of noises.

Fig. 2-3 Nature of banded lines on voice-active spectrogram: (a) A signal waveform of Mandarin digit “eight”. (b) The continuous, banded lines only appearing on the corresponding voice-active spectrogram. (c) Spectrum magnitude of voice activity. (d) Spectrum magnitude of voice-absent frame.

Fig. 2-4 Power distributions of all 32 uniform subbands with the same entropy (logarithmic BSE=21.9601): (a) During voice-active frame. (b) During voice-absent frame.

Fig. 2-5 Illustration of characterizing speech signals by using entropy-based feature parameter: (a) Waveform of a Mandarin digit “eight”. (b) The corresponding spectrogram. (c) Contour of SE proposed by J. L. Shen et al. [30]. (d) Contour of the proposed BSE.

Fig. 2-6 Different types of noises focusing on different frequency subbands.

Fig. 2-7 Relation between the number of useful subbands and NMinBE parameter.

Fig. 2-8 Illustration of the efficiency of NMinBE parameter for applying in BSE parameter: (a) Waveform of the Mandarin digit “eight” at SNR -5 dB with increasing-level of factory noise. (b) The corresponding spectrogram. (c) The contour of BSE measure. (d) The NMinBE parameter. (e) The number of useful subbands varying with time. (f) The contour of BSE parameter obtained by manual selecting useful bands according to Fig. 2-7.

Fig. 2-9 An adaptive threshold method for VAD decision: (a) Waveform of an utterance of the digit “one”. (b) Detection of speech segments together with the logarithmic BSE value and speech threshold T . _s

Fig. 2-10 Flowchart of SSE strategy for automatically extracting the useful subbands.

Fig. 2-11 Block diagram of the proposed entropy-based VAD algorithm.

Fig. 2-12 Result of RLF measure tested in recorded speech sentence /start/. (a) Waveform of a noisy speech sentence. (b) The envelope of RLF measure.

Fig. 2-13 Measurement of the two BSE and RLF parameters.

Fig. 2-14 Comparison between different feature parameters for VAD algorithm testing an utterance with musical background noise inside a car: (a) Waveform of an utterance in Chinese: “Guo Li Chiao Tung Da Xue (National Chiao Tung University)”.

(b) The corresponding spectrogram (c) Contour of spectral energy. (d) Contour of ZCR. (e) Contour of ATF (f) Contour of BSE.

CHAPTER 3 A SINGLE CHANNEL NOISE SPECTRUM ESTIMATION WITH RAPID ADAPTATION IN VARIABLE-LEVEL OF NOISY ENVIRONMENTS

In this Chapter 3, a single channel noise estimation algorithm using only the power spectrum of noisy speech is presented. The proposed method can track the noise spectrum quickly, even when the noise levels suddenly increase. An explicit use of speech/silence detection is needed for estimating noise spectrum. So, the entropy-based VAD mentioned-above is used to continuously classify each frame of speech into the voice active/absent frames, and the noise spectrum estimate is updated using constant smoothing factor for voice absent frames and a time-frequency dependent smoothing factor for voice active frames. Time-frequency dependent smoothing factor is chosen as a Sigmoid function that changes with the voice-active

probabilities in frequency bins. And, voice-active probability is determined by computing the ratio of the noisy speech power spectrum to its local minimum. To speed up the minimum tracking, a fast method is presented for tracking the minimum of the noisy speech power spectrum. In addition, to allow detection with entropy-based VAD under colored noise conditions, we herein propose to subtract the current spectrum from the estimated noise spectrum of the previous frame.

3.1 Introduction

Most of the existing single channel noise estimations are slow in adapting to increasing levels of noise. This results in a perceptually annoying residual noise and speech distortion in aspect of speech enhancement. In general, noise estimation is usually done by explicit detection of speech detection. However, this can be very difficult in the case of varying background noise so that the background noise is assumed to be related stationary between speech pause.

Martin [36] proposed a method of noise spectrum estimation that is based on minimum statistics (MS). The noise spectrum is estimated by tracking the minimum of the noisy speech power spectrum over a particular window. Furthermore, Cohen et al. [37] introduced a minima-controlled recursive averaging (MCRA) method

extended from the MS to estimate the noise power spectrum using a smoothing

parameter that is defined as the voice-active probability in the frequency bins. The VAD is carried out by comparing the probability with a specific threshold and then determines whether to update the estimate of the noise. However, its VAD decision clearly depends on energy level and performs poorly when noise level is higher than speech level. In addition, the noise estimation does not adapt quickly to a rapid change in noise level. Recently, Lin et al. [38] developed an adaptive noise estimation to easily implement. Its smoothing parameter can be chosen as a Sigmoid function changing with posteriori SNR. However, the stability of posteriori SNR is sensitive to a variable noise-level. So far, these kinds of algorithms contain no explicit VAD and their performances depend on the energy level. Accordingly, the noise estimation does not adapt quickly in situations involving rapid change of noise level. To overcome this problem, a noise spectrum estimation with rapid adaptation in variable-level of noisy environments is relatively required.

Enclosed herein we propose a method for tracking the noise spectrum quickly, even when the noise levels suddenly increase. An explicit use of speech/silence detection is needed for estimating noise spectrum. So, the entropy-based VAD mentioned-above is used to continuously classify each frame of speech into the voice active/absent frames, and the noise spectrum estimate is updated using constant smoothing factor for voice absent frames and a time-frequency dependent smoothing factor for voice active

frames. The time-frequency dependent smoothing factor is chosen as a Sigmoid function that changes with the voice-active probabilities in frequency bins. And, the voice-active probability is determined by computing the ratio of the noisy speech power spectrum to its local minimum. To speed up the minimum tracking, an efficient method extended from [52] for tracking the minimum of the noisy speech power spectrum is presented. Besides, to allow the decision of entropy-based VAD under colored noise conditions, we suggest that the current spectrum is subtracted from the estimated noise spectrum of the previous frame. After the subtraction, the resulting spectrum is similar to the white noise in voice-absence. In order to make sure of the BSE and RLF values in the next frame well, a subtractive-type method presented by Berouti et al. [39] is used herein to decrease significantly the annoying “musical noise” that is introduced by subtracting the estimated noise spectrum from the noisy speech spectrum.

This chapter is organized as follows. Section 3.2 details the configuration of the proposed noise estimation algorithm for quickly adapting variable noise level In addition, the entropy-based VAD above mentioned is modified into the noise estimation algorithm. Section 3.3 evaluates the proposed noise estimation algorithm in variable noise level as comparing with others. Finally, Section 3.4 would summarize the conclusions.

在文檔中以頻帶及小波分析為基礎的強健性語音偵測系統之研究 (頁 54-75)

2 FREQUENCY BAND ANALYSIS FOR VOICE ACTIVITY DETECTION

2.5 Discussion &amp; Future Work

CHAPTER 3

A SINGLE CHANNEL NOISE SPECTRUM ESTIMATION WITH RAPID ADAPTATION IN VARIABLE-LEVEL OF NOISY ENVIRONMENTS

2.5 Discussion & Future Work