1 INTRODUCTION
1.6 Organization of This Dissertation
The dissertation is organized as follows. In Chapter 2, we would illustrate the derivation of the two measures of BSE and RLF. In addition, a subband self-extraction (SSE) strategy is shown to automatically select useful information on
some subband for compensate a complete measure of BSE. Afterwards, Chapter 3 would state a noise spectrum estimator with rapid adaptation algorithm for highly non-stationary noises. The above-mentioned VAD will be modified to employ into a noise spectrum estimator as an indicator of updating noise. Continuously, Chapter 4 presents the alternative VAD approach based on wavelet analysis for detecting voice activity. The Teager energy and mean-delta (MD) operators are respectively employed into auto-correlation function (ACF) of each subband to form a new SMDSACF (sum of mean delta of subband auto-correlation function) parameter. Ultimately, Chapter 5
summarizes our conclusion and gives some future developments.
CHAPTER 2
FREQUENCY BAND ANALYSIS FOR VOICE ACTIVITY DETECTION USING A MEASURE OF BANDED SPECTRAL ENTROPY
The formant frequency representation is a highly efficient, compact representation of the time-varying characteristics of speech, especially for voiced sounds. In addition, the magnitude arrangement of spectral response during formant frequency is relatively important for characterizing speech signals. The so-called Banded Spectral Entropy (BSE), which is a measure of spectral entropy defined in subband domain, is then presented for extracting voice activity. In fact, noises can focus on some subbands to contaminate the useful information that results in error decision of detecting voice activity with BSE. In order to compensate this finds, a method of automatically selecting subband is then presented to meet the requirement, which is regarding as subband self-extraction (SSE). Besides, the ratio of low-band energy to full-band
energy (RLF) is presented to discriminating the unvoiced sound from background
noises since entropy-based measure can provide only for detection of unvoiced sounds.
2.1 Introduction
A feature parameter that can sufficiently characterize speech signals or be robust against the highly noisy environments is relatively required. So far, the current algorithms are based on short-time or spectral energy, zero-crossing rate (ZCR) and duration parameters [25]-[27]. All of these parameters, however, are rather sensitive to noise and cannot fully specify the characteristics of a speech signal. For example, the energy-based parameter and ZCR are not sufficient to distinguish a speech from a noise at low SNRs. In particular, the ZCR is very sensitive to various types of noise.
Several other parameters have also been proposed, including linear prediction coefficients (LPCs), Cepstral coefficients and pitch [9], [11], [14]. Although these parameters are quite effective in expressing the characteristics of speech signals, the performance of VAD using such parameters remains poor in adverse environments.
The reliability of the LPCs has been observed to depend strongly on the noise in adverse environment. Pitch information can help to detect speech; even so, extracting the correct pitch in noisy environments is difficult. Additionally, some algorithms
cannot be implemented for practical applications due to their high computational complexity, even though they perform well [28]. Among such approaches, however, Junqua et al. [29] proposed a time-frequency (TF) parameter to detect speech, which assumes that frequency information in the frequency ranges 250-3500 Hz is less contaminated by noise. The TF parameter is composed of both frequency energy in the fixed frequency bands and time energy. Based on the motivation that the frequency energies of various types of noise are concentrated in different frequency bands, Wu et al. [7] used the multi-band technique to analyze noisy speech signals, and then proposed an adaptive band-selection (ABS) method to cancel noise effectively by selecting useful bands. An adaptive time-frequency (ATF) parameter extended from TF parameter was proposed by them.
Although the ATF-based algorithm outperforms several algorithms commonly used for detecting voice activity in the presence of various types of noise, it cannot be reliably implemented in practical environments. It is found that the selection of useful bands depends on the information of an entire recorded signal. Additionally, the ATF parameter is also energy-based parameter and therefore less reliable in the presence of non-stationary noise or in a changing noise level. J. L. Shen et al. [30] firstly used the entropy-based parameter to detect speech signals. Their study indicated that the spectral entropy of a speech segment differed significantly from that of a noise
segment. In fact, the result of spectral entropy relies on the variance of spectral magnitude to distinguish a speech signal from a noise signal, but the variance of spectral magnitude depends strongly on the noisy environments. L. S. Huang [31]
integrated both the time energy and spectral entropy to form a new feature parameter (EE-feature), since the spectral entropy failed under multi-talker babble and background music, but the energy performed well because of its additive property: the energy of the sum of speech plus noise always exceed the energy of noise. Although the EE-feature parameter proposed by L. S. Huang improved the endpoint detection under babble noise, it is unreliable when the noise level greatly exceeds the speech level.
The appearance of banded line on voice spectrogram resulted from formant frequency is a highly efficient, compact representation of the time-varying characteristics of speech, especially for voiced sounds. Since the locations of banded lines reveal that high powers concentrate on the some frequency bands, the band decomposition is used for locating formant frequency components by obtaining peaks while non-formant frequency components are characterized by obtaining valleys. It has been sufficient to display the location of power of formant frequency when the bandwidth of each subband is approximately 125 Hz [8]. A measure of entropy is defined in subband domain and is regarded as banded spectrum entropy (BSE)
parameter. In fact, the magnitude arrangement of spectral response during formant frequency is alternative relatively important factor for characterizing speech signals.
So, a set of weighting factors among those subbands are also employed into BSE measure to discriminate the magnitude arrangement between speech signals and noise.
According to the experimental result from Wu et al. [7], some subbands contaminated by noise can provide harmful information resulting in error decision of voice activity detection (VAD) with BSE. So, an automatic band-selection method derived from the refined version of the adaptive band selection (ABS) method proposed by Wu et al. is preferable to perform well in on-line, called subband self-extraction (SSE). In order to compensate the limitation of BSE measure for modeling unvoiced sounds, the ratio of low-band energy to full-band energy (RLF) is presented to discriminating the
unvoiced sound from background noises.
This Chapter 2 is organized as the followings. Section 2.2 will introduce the theory of entropy. What is the motivation of using the entropy measure to describe the nature of banded lines on voice spectrogram? Additionally, the proposed feature parameters are stated, respectively. In Section 2.3, we derive the so-called subband self-extraction (SSE) method, which is extended from ABS and can adaptively select useful bands in on-line. And then, the procedure for implementing the proposed entropy-based VAD algorithm based on the measures of BSE and RLF and the strategy of SSE is outlined.
Section 2.4 discusses the performance of the proposed VAD algorithm under various noise conditions and compares its performance with that of ATF-based one. Finally, Section 2.5 summarizes the findings and discusses possible directions for future work.
2.2 The Robust Feature Parameters
This section introduces the theory of entropy and further shows the motivation of using the entropy for detecting speech. In addition, the robust feature parameters will be illustrated herein in detail.
2.2.1 Motivation
Fig. 2-1 displays that the waveform of a mixed signal comprising vehicle noise, multi-talker babble noise, factory noise, speech, and white noise and the corresponding spectrogram. Regarding to Fig. 2-1 (b), the voice-active spectrogram is dominated by the inherent nature of banded lines (or called formant traces). It is found that the nature is able to sufficiently discriminate speech signal from background noise. Fig. 2-2 displays the spectrograms of clean speech and noisy speech with four kinds of noise at 0 dB. In this figure, the nature of banded lines on voice spectrogram is seen to be existed against various types of additive noise. So, the formant frequency representation is a highly efficient, compact representation of the
time-varying characteristics of speech. The following statements will show how to use the representation of banded lines to detect voice activity by using a measure of entropy.
Entropy, firstly used in information theory by C. Shannon [32], is regarded as the amount of information that must be provided about a random signal x in order to specify it uniquely. It measures the degree of organization (uncertainty) of the signal
and is defined by
( ) ( ) log[1 ( )],k k
k
H x =
∑
P x ⋅ P x (2-1)where x=
{ }
xk 0≤ ≤ −k N 1 and ( )P x is the probability of k x . kHow to use the definition of entropy for characterizing speech signal? Regarding to Fig. 2-3, the waveform of a Mandarin digit “eight” uttered by native speaker and the corresponding spectrogram are shown in Fig. 2-3(a)-(b), respectively. Since the pitch varies continuously within a speech segment for speech production, the banded lines on voice-active spectrogram are also continuous. When such a clear set of banded lines exist in some frequency bands for a long enough time, the voice activity can be quite certainly presented [33]. Fig. 2-2(c)–(d) show the spectrum magnitude of voice activity obtained by the short-time Fourier transform (STFT) over a solid-line region in Fig. 2-3(a) and that of voice inactivity obtained segment by STFT over a dashed-line region in Fig. 2-3(a), respectively. Inspecting the difference between the
spectrum magnitude of voice activity and that of voice inactivity, we can regard that the amount of variance (uncertainty) from the spectral magnitude during voice activity indeed exceeds that during voice-inactivity. Consequently, a measure of spectral entropy can be used for discriminating the spectral difference between during voice activity and during voice inactivity even if noise level is greater than speech level.
2.2.2 A Measure of Conventional Spectral Entropy (SE)
J. L. Shen et al. [30] firstly used a measure of entropy for detecting speech segments under adverse conditions. The measure was defined in spectral domain and named as spectral entropy (SE). Their experimental results have been revealed that the result of SE during voice activity differs from that during voice-inactivity. The deviation for calculating SE parameter is described as follows.
The STFT of a given time frame s n l is accomplished by ( , ) l frame. th N is the total number of frequency bins in STFT for each frequency w
frame. ( )W n is a Hamming window. The spectral energy of each frame, ( , )
Then, the probability associated with each spectral energy component, ( , )P i l , can
Following normalization, the corresponding SE, H l , is defined as follows. ( )
2 1
2.2.3 A Measure of The Proposed Banded Spectrum Entropy (BSE)
In fact, the SE measure performs well in white or quasi-white noise, but fails in colored noise. The magnitude associated with each frequency bin is easily contaminated by noise. This results in degradation of the efficiency of VAD performing in seriously low SNRs. So, frequency band analysis is employed into a measure of SE for improving the robustness.
Next, we decompose the input signals into 32 uniform subbands. The subband energy, ( , )E m l , is given by b
where N is the number of total decomposed subband on each frame. The limits of b
the summation denote the boundary of each subband. For example, if m= , the 1th boundary of the first subband means that k is 0 to 3.
Consequently, we modify (2-4) as (2-7) shown as below:
In fact, the magnitude arrangement of spectral response during formant frequency is alternative relatively important factor for characterizing speech signals. So, the order of subband power must be considered for describing the magnitude arrangement.
The well-known measure of entropy, however, cannot indicate a distribution (spatial information) of the data sequence. Fig. 2-4 denotes the power distribution of all decomposed 32 subband during voice-active frames and voice-absent frames. It illustrates that the classical spectral entropy is not able to discriminate the difference between the duration of speech and the duration of non-speech for the distribution of subband energy. The distribution of subband energy during a speech segment and that during a non-speech segment are shown in Fig. 2-4(a) and Fig. 2-4(b). Measuring on the two kinds of distributions of subband energy by spectral entropy, we can get the same logarithmic BSE value (H=21.9601). Major cause is resulted form the invalid description in the nature of banded lines on voice-active spectrogram.
To solve this problem, a set of weighting factors W m l( , ) among those subbands
are employed into BSE measure to discriminate the magnitude arrangement between spectral energy among all 32 subbands and regarded as a normalized factor. If
( , )
W m l on the m subband is great, this implies that the banded line may be th
located around these subbands from the (m−1)th subband to the (m+1)th subband.
Conversely, if the ( , )W m l value is low, it indicates that the banded line is not located on these subbands. By using the set of weighting factors ( , )W m l , the magnitude arrangement of spectral response can be explicated the nature of banded lines on voice-active spectrogram and (2-8) is modified as (2-11) as follows:
1
Fig. 2-5 clearly indicates that the proposed BSE parameter, using the method of band decomposition and a set of weighting factors ( , )W m l , more sufficiently characterizes the speech signals than other entropy-based parameter so that SE
parameter proposed J. L. Shen et al. [30].
In fact, the frequency energies of difference types of noise are concentrated on different frequency bands [7], as shown in Fig. 2-6. This observation demonstrates that the subbands with larger noisy energy more contaminate the useful frequency information than do the other bands. The bands with larger noisy energy can be regarded as harmful subbands afterworld and must be discarded accurately to yield more accurate frequency information. Although the BSE remains a good feature parameter, the detection sometimes fails at seriously low SNRs, especially when relatively harmful bands are involved. How to discard the harmful subbands or preserve the useful subbands becomes a serious task. The number of harmful bands (or useful subbands) is relatively related to the background noise level [7]. To easily estimate the background noise level, we extend the MiMSB parameter [34] to adaptively choose one subband with minimum energy for estimating the varying noise level roughly. Regardless of changing level of noise, a normalized minimum band energy (NMinBE) parameter is proposed to estimate the background noise level for
precisely deciding the number of useful subbands. The NMinBE parameter is
determined as follow:
where the min
{}
⋅ operator selects the minimum band energy among all 32 subbandenergies for a given frame, and log
[ ]
⋅ is the logarithmic operation. The number of useful subbands, N l , required to yield reliable information. Fig. 2-7 displays the ub( ) relation between N lub( ) and NMinBE( )l .level (corresponding to a low SNR). According to (2-13), for the lth frame the first (32−N lub( )) frequency bands with larger energies are adaptively selected to remove noise component. Finally, the measure of BSE parameter with a strategy of extracting the useful subband is achieved by the follows:
( )
The variable noise level and the varying statistics of noise, in general, results in a varying ( )N l over an entire signal. According to the relationship between ub N l ub( ) and noise level described in (2-13), we can evaluate the efficiency of a strategy of extracting the useful subband on a measure of BSE. Fig. 2-8(a) and 2-8(b) plots the waveform of someone’s saying the Mandarin digit “eight” with increasing-level of factory noise and the corresponding spectrogram, respectively. The voice activity is not easily detected in an adverse environment due to a measure of BSE involving
some harmful subbands, as shown in Fig. 2-8(c). Regarding to the Fig. 2-8(d), it is shown that the proposed NMinBE parameter can reflect the variation of noise level.
According to the relation in Fig. 2-7, the number of useful subbands can be determined as shown in Fig. 2-8(e). Fig. 2-8(f) displays that a measure of BSE with a manual extraction of useful subbands can greatly improve the performance of detecting voice activity, especially at variable level of noise. Consequently, how to automatically extract useful subbands with time is crucial and discussed in next section.
2.2.4 The Ratio of Low-band Energy to Full-band Energy (RLF)
Unlike voiced sounds, unvoiced sounds do not have any component of formant frequency. By measuring the energy level, the unvoiced sound is difficultly discriminated from background noise. The majority of unvoiced sounds, however, display string spectral concentration in higher frequency range. The background noise display uniform spectral distribution, It is possible to distinguish between speech-active and background noise by examining the distribution of energy along the frequencies. Low-band energy, Elow( )l , measured on the below 1000 Hz, is computed as follows:
(
04 ˆ)
( ) 10 log ( , ) ,
low energy
E l = ×
∑
ω πω== X ω l (2-15)where ˆXenergy is obtained by ignoring the harmful subbands.
Similarly, full-band energy, Efull( )l , measured on entire frequency bandwidth (0~4000 Hz), is given by
2.3 The Proposed Entropy-based VAD Algorithm
According to [26], the required characteristics of an ideal voice activity detector are reliability, robustness, accuracy, adaptation, simplicity and real-time processing.
Although some existing VAD algorithms are extremely accurate, they all depend on complicated computation and are not reliable in real applications. For example, Wang et al. [28] proposed a robust algorithm based on wavelet analysis, but it indeed
performs in off-line. E. Nemer et al. [20] used higher-order-statistics (HOS) parameter to detect speech, but the calculation of this parameter required too much computing time. Wu et al. [7] suggested an adaptive band-selection (ABS) method, which can select useful bands automatically, as noise cancellation to perform ATF parameter well. However, the execution of ABS method depends on the obtainment of all information from the entire recorded signals. Although those algorithms are
inappropriate for practical implementation, some ideas related to those algorithms are adopted herein. The ABS method proposed by Wu et al. [7] is strong with respect to noise cancellation. The ABS was used to preserve the useful bands (or discard the harmful bands) for each frame, but the execution of band selection depends on entire recorded signal. The drawbacks of ABS are thus as the following:
-- Firstly, the decision of band selection is not immediately determined. Since the method is an off-line strategy, its decision must be determined by analyzing an entire recorded signal.
-- Secondly, the indexes associated with the harmful bands vary with time for entire recorded signals in practice. However, Wu et al. assumed that the indexes of harmful bands were fixed. This assumption does not hold.
So, how to detect whenever the index of harmful bands vary with time is relatively required. Regarding to the Fig. 2-8(f) again, we observe that the selected subbands are not contaminated by noise. The corresponding BSE value in voice absence is small and smoothly and slightly varies with time as comparing with Fig. 2-8(c). Conversely, regarding to Fig. 2-8(c), the selected bands are contaminated by background factory noise. However, the entropy value in voice absence is large and its variation is also violent. In the conclusion, it reveals that the determined entropy value is quite large and violently varying whenever the considered subbands include harmful subbands;
the determined entropy value is small and its variation is very smooth if the considered subbands do not include harmful subbands. This finding provides a hint about how to detect whenever the indexes of harmful bands vary with time.
2.3.1 An Adaptive Threshold Method
In order to extract the harmful subbands automatically, an adaptive thresholding
In order to extract the harmful subbands automatically, an adaptive thresholding