ACCEPTANCE LETTER - 音視訊號融合之情緒偵測系統

Dear ChingShun Lin,

Congratulations! It is my great pleasure to announce you that your paper, Fast Ensemble Empirical Mode

Decomposition for Speech-Like Signal Analysis Using Shaped Noise Addition, is accepted by the ICIS2011: The 4th International Conference on Interaction Sciences which will be held from August 16-18, 2011 in Busan, Republic of Korea.

This year, we have accepted a large number of various papers from more than 25 countries and it wasn't easy work for us to select the most innovative and well-written papers among them.

With the assigned ID and PW, you can check or revise your personal and paper information from web system of the ICIS2011. http://www.aicit.org/ICIS

Mission and Aims

ICIS will trigger off the scientific community to propose topics that should be tackled from research perspective and let the community explain how to best use their tools for practical and theoretical problems of Interaction Sciences. Many various contributions are foreseen from prospective authors. This includes use-cases of theoretical tools and methods to solve practical problems. Such contributions should be as usable as possible by practitioners in the related field. We also expect research results from practitioners that have identified a problem that could be solved by tools from network sciences. One of the missions of ICIS is to make the scientific community aware of the importance of the issues in Interaction Sciences and to suggest means by which the problem may be solved by the scientific community. The contributions should stimulate interaction between theoreticians and practitioners and also have high potential impact in either field.

We aims to bring together and share brilliant and ideas, accumulated knowledge and unique experiences for mutual benefit of researchers and practitioners and explore possible directions for development of Multidisciplinary and Hybrid/Convergent research in the areas of Interaction Sciences.

Once again, congratulations on your paper acceptance.

I look forward to seeing you at the conference soon.

Prof. Franz I. S. Ko, Ph.D.

General Chair, ICIS2011.

Honorary Director General, IBC, Cambridge, UK.

Vice-President, The World Congress of Arts, Sciences and Communications, Cambridge, UK.

Address: 707, Seokjang-dong, Gyeongju-si, Gyeongbuk, 780-741, Korea(Rep. of) Registration Number: 505-10-96301

TEL: +82-70-7730-2833

Fast Ensemble Empirical Mode Decomposition for Speech-Like Signal Analysis Using Shaped Noise Addition

ChingShun Lin, JyngSiang Wang, and ZongChao Cheng Department of Electronic Engineering

National Taiwan University of Science and Technology 43, Section 4, Keelung Rd., Taipei, Taiwan

[email protected], [email protected], and [email protected]

Abstract—Empirical mode decomposition (EMD) is one of the useful approaches for processing nonlinear and non-stationary signals. However, its shortcomings include mode mixing and end effects that usually appear in the decomposed bands. Although a noise-assisted data analysis (NADA) called ensemble empirical mode decomposition (EEMD) has been proposed to circumvent this problem, doing so also results in an inevitably long computation for alleviating the mode mixing. In this paper, we use shaped noise instead of white noise as a dis-turbance for a fast convergence of EEMD. The signal-spectrum-dependent noise (SSDN) is able to effectively randomize the targeted signal in time domain, and then significantly save the superfluous calculation around the corresponding energy-free frequencies. The experimental results also show that both pink noise and brown noise outperform the white noise in terms of computation for the EEMD of speech-like signal.

Keywords-Ensemble empirical mode decomposition; Intrinsic mode function; Noise-assisted data analysis; Signal-spectrum-dependent noises.

I. INTRODUCTION

Signal decomposition in the time-frequency domain has been a very useful approach in speech processing applica-tions, such as speaker identification, language recognition, pitch estimation, speech coding, pathological voice analy-sis, etc. Although many methods of signal decomposition for speech signal have been studied for decades, accurate and reliable processing is still a challenging task owing to the diversity of speech characteristics. Most of speech signal processing approaches are based on the assump-tion that speech signal is staassump-tionary in short time without considering its quasi-periodicity. Although the frequency-domain representation often provides useful information, this representation does not indicate how the spectrum evolves over time. In this issue, the time-frequency representation (TFR) is appreciated due to its reliable expression of speech and robust set of transformation [1]. TFR is a set of transforms that maps a 1-D time-domain signal into a 2-D representation of energy versus time and frequency. There are several transformations available for the time-frequency representation such as the widely used short-time Fourier transform (STFT) and wavelet transform (WT) [2], [3]. An inherent drawback with STFT is the limitation between time

and frequency resolutions. The wavelet transform, on the other hand, is similar to STFT in which it also provides a time-frequency map of the signal being analyzed. However, limited by the size of the wavelet basis, the downside of the uniform resolution results in the uniformly poor resolution.

Moreover, an important limitation of the wavelet analysis is its non-adaptive nature. Once the wavelet basis is selected, one will have to use it to analyze all the signal. Although both STFT and WT offer finite time-frequency resolution under the restriction of uncertainty principle, the underlying mathematical ideas are the same for all these representations [4].

The empirical mode decomposition (EMD) used in the speech analysis, on the other hand, is an approach without taking advantage of logarithmic scale or optimal kernel.

These properties would be suitable for speech signal that is, in general, locally stationary but globally nonstationary. It was proposed as an adaptive time-frequency signal analysis approach, which has been attracted extensive attention and applied successfully in many fields such as tide analysis, earthquake prediction, distortion detection, structural testing, fault diagnosis, and cardiac arrhythmias measurement. In addition, EMD is famous in revealing instantaneous change of frequency or time from nonlinear and nonstationary signal so that the features distributed in the time-frequency domain can be accurately detected [5].

As useful as EMD proved to be, it still remains several an-noying difficulties unresolved. One of the major drawbacks of the original EMD is the mode mixing defined as a single intrinsic mode function (IMF) either containing signals of widely disparate scales, or a signal of a similar scale residing in different IMF components. Mode mixing is a consequence of signal intermittence that could not only cause serious aliasing in the time-frequency distribution, but also make the physical meaning of individual IMF ambiguous [6], [7].

To alleviate this drawback, Wu and Huang proposed the en-semble empirical mode decomposition (EEMD) that defines the true IMFs as the average of an ensemble of trials [8].

After adding a certain white noise into the targeted signal, the signal in the band will have a uniformly distributed reference scale which forces the EEMD to exhaust all

pos-Input target signal s(t) and threshold ε

Set

k-th trial finished and obtain M-th IMFs and r (t) _M

Noise generator

noise

Figure 1. Flowchart of improved ensemble empirical mode decomposition for speech-like signal analysis using added shaped noises.

sible solutions in the sifting processes for minimizing mode mixing [8]. The final presentation of the EEMD is still an energy-frequency-time distribution, designated as the Hilbert spectrum. Unlike the Fourier and wavelet transforms, EEMD has no a priori defined basis, and therefore this technology is capable of processing nonlinear and nonstationary signals successfully. However, EEMD usually takes a long time to obtain the consistent intrinsic mode functions (IMFs), especially for signals with abrupt changes. In this work, we would like to show how the introduction of shaped noises may accelerate the EEMD processing for speech-like signal analysis.

The organization of this paper is as follows. In the next section, we review the processes of the original EMD, and then explicate the principle of ensemble empirical mode decomposition. In Section III, properties of shaped noises for EEMD are explored, including colored noises and signal-spectrum-dependent noises. Experimental results are then presented in Section IV to illustrate the performance of proposed system. Finally, conclusion and future research are provided in Section V. The flowchart outlining such a fast ensemble empirical mode decomposition for speech-like signal analysis using added shaped noises is illustrated in Fig. 1.

II. ENSEMBLE EMPIRICAL MODE DECOMPOSITION

The ensemble empirical mode decomposition (EEMD) is proposed to improve the decomposition results in the EMD method. The procedures for both algorithms are described as follows.

A. Original EMD

Empirical mode decomposition (EMD) is one of the effective approaches for processing nonstationary signals.

The components result from EMD, called intrinsic mode function (IMF), is characterized by two properties:

1) The number of extrema and the number of zero crossings should differ by no more than one.

2) The local average defined by the average of the maximum and minimum envelopes is zero, i.e., both envelopes are locally symmetric around the envelope mean.

Based on these rules, signal may be decomposed into a number of IMFs. Considering a real and stable sequence s(t), which may be divided into fine-scale details and a residue. The IMFs of the signal s(t) are found by iterating the following sifting processes:

1) Initialize r0(t) ← s(t), and i ← 1.

2) Find the i-th IMF.

a) Initialize gi,j(t) ← ri(t), and the number of sifts j ← 0.

b) Find the local maxima and minima of gi,j(t).

c) Estimate the maximum envelope ui,j(t) of gi,j(t) by passing a cubic spline through the local max-ima. Similarly, find the minimum envelope li,j(t) with the local minima.

d) Compute an approximation to the local average:

m_i,j(t) ← 0.5(u_i,j(t) + l_i,j(t)).

e) Extract the detail gi,j+1(t) ← gi,j(t) − mi,j(t), and let j ← j + 1.

f) Check whether the stopping criterion defined as the standard deviation from two consecutive results in the sifting process is smaller than a given value ² [5]. where T denotes the length of original signal. If SD(t) ≥ ², return to step b). If not, let the i-th IMF xi(t) ← gi,j+1(t).

3) Update r_i+1(t) ← r_i(t) − x_i(t).

4) Repeat step 2) with i ← i + 1 until the residue ri+1(t) has at most one extremum or a constant remained.

The goal of sifting is to repeatedly subtract the large-scale features of the signal from the fine-scale ones. Finally, the

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Figure 2. Spectrogram of the word: /aa-jh-eh-k-t-w-ih-dh/ at 16kHz sampling rate and 16-bit depth resolution.

signal s(t) is represented as a sum of IMFs and the residue:

s(t) = XM i=1

xi(t) + rM(t) (2) The sifting processes remove the low-frequency information in each step of EMD until the highest frequency component remains. Theoretically, adding all the IMFs together with the residue can reconstruct the signal without signal distortion.

Cubic spline interpolation is commonly used to approximate the upper and lower envelopes in the EMD; however, it usu-ally fails to model boundaries, especiusu-ally those with abrupt changes. Although one of the approaches to circumvent this problem is to use a longer signal and omit the ends of processed signal, this approach does not work for a short signal [9]. After obtaining the IMFs, we apply the Hilbert transform on each IMF xi(t):

yi(t) = 1

where P denotes the Cauchy principal value of the integral.

The analytic signal of xi(t) can then be defined as [10]:

zi(t) = xi(t) + jyi(t) = ai(t)e^jϕⁱ^(t) (4) where ai(t) =p

xi(t)²+ yi(t)² and ϕi(t) = tan⁻¹(^y_xⁱ^(t)

i(t)) are the instantaneous amplitude and the instantaneous phase at time t, respectively. Then the instantaneous frequency may be derived from:

fi(t) = ωi(t)

2π =ϕ˙i(t)

2π (5)

and the original signal s(t) being analyzed can be repre-sented as:

Figure 3. Time-frequency representation of Hilbert amplitude spectrum of the word: /aa-jh-eh-k-t-w-ih-dh/ (16kHz; 16-bit).

Notice that the residue rM(t) is excluded owing to its intrin-sic monotonousness or constant. Since both instantaneous amplitude and instantaneous frequency are functions of time, the time-frequency representation of Hilbert amplitude spectrum may be expressed as:

H(ω, t) = Re XM i=1

ai(ωi, t)e^j

Rωi(t)dt (7)

Based on the Hilbert-Huang spectrum, the marginal spec-trum h(ω) can be formulated as:

h(ω) = Z _T

H(ω, t)dt (8)

where T denotes the length of original signal. The marginal spectrum h(ω) provides a measurement of total amplitude (or energy) at every frequency.

B. Ensemble EMD

When signal is intermittent, the dyadic property is often compromised in the original EMD. Adding white noise to the targeted signal may provide a uniformly distributed reference scale, which enables EMD to repair the mode mixing. Moreover, the EEMD method can separate the natural scale of signals clearly without selecting any a priori subjective criterion [8]. Since the corresponding IMFs of different series of noise are uncorrelated to each other, the noise in each trial may be canceled out in the ensemble mean with sufficient trials. With these properties of the EMD, the ensemble empirical mode decomposition is developed as follows [8]:

1) Add white noise series to the targeted signal.

2) Decompose the signal with added white noise into IMFs.

0 100 200 300 400 500 600 700 800 900 1000

Number of trails

Signal to error ratio (dB)

Signal to added noise ratio: 15 White noise

Pink noise Brown noise SSD noise

Figure 4. Convergence rates for speech-like signal decomposition using different shaped noises (λ = 15).

3) Repeat step 1 and step 2 with different white noise series.

4) Obtain the ensemble means of corresponding IMFs of the decompositions as the final result.

The iteration is terminated when the number in the ensemble approaches a given boundary N :

xi(t) = 1 the noise-added signal, σ is the standard deviation of the added noise, and rk(t) is the residual after extracting the first k IMF components [7], [8]. As the noises in each trial are different in individual trials, the noises can almost be removed by the ensemble mean of entire trials. The ensemble number N should be as large as possible for a reliable result.

III. SHAPED NOISES

The EMD is a dyadic filter bank for any white (or fractional Gaussian) noise-only series [11]. Recent studies of the statistical properties of white noise showed that the EMD is an effective self-adaptive dyadic filter bank when applied to the white noise [12]. In general, if the length of the time series is T , we can get blog₂(T )c − 1 IMFs. The ensemble number N and the noise amplitude σ are the two parameters that need to be assigned in the EEMD approach.

The rule suggested in [8] is:

σN = σ

√N (10)

where σN is the final standard deviation of error defined as the difference between the targeted signal and the corre-sponding IMFs. If noise amplitude σ is too small, it could

0 100 200 300 400 500 600 700 800 900 1000

Number of trails

Signal to error ratio (dB)

Signal to added noise ratio: 20 White noise

Pink noise Brown noise SSD noise

Figure 5. Convergence rates for speech-like signal decomposition using different shaped noises (λ = 20).

not introduce any disturbance for signal diversity in EEMD.

The rule of thumb for noise amplitude setting is about one fifth standard deviation of the amplitude of targeted signal [8]. In addition, by increasing the ensemble members, the effect of the added white noise will always be reduced to a negligibly small level. In general, an ensemble number of a few hundreds will lead to a consistent result. As the improved EEMD flowchart shown in Fig. 1, in which we take the EMD procedure as a special case by setting N = 1 and σ = 0.

A. Colored noises

Power-law noise is defined as a signal with components at all frequencies and its power spectral density per unit of bandwidth is proportional to 1/f^α[13]. For white noise, the spectral density is flat over the whole frequency by setting α = 0, whereas pink noise has α = 1. Compared with pink noise, brown noise has an even stronger shift in energy towards the lower spectrum (α = 2). If the signal is mainly composed of high-frequency components, the noise amplitude may be relatively small. On the other hand, it may be increased if the signal is dominated by the low-frequency components. Therefore, for the signal with the energy spectrum decreases with the increasing frequency, pink noise or brown noise would be the suitably added noise in the EEMD application with the spectral similarity in between.

B. Signal-spectrum-dependent noises (SSDN)

EEMD fully uses statistical characteristics of the white noise to perturb the target signal in their true solution neigh-borhood and then cancels out the white noise via ensemble averaging. Obviously, a small amplitude noise leads to a small error for decomposition. However, if the added noise

0 100 200 300 400 500 600 700 800 900 1000

Number of trails

Signal to error ratio (dB)

Signal to added noise ratio: 25 White noise

Pink noise Brown noise SSD noise

Figure 6. Convergence rates for speech-like signal decomposition using different shaped noises (λ = 25).

is too small, the targeted signal will not be perturbed to a different state for the ensemble calculation. As the amplitude of the added noise increases, the ensemble number must be increased as well to ensure the precise decomposition, and therefore results in the long computation. To increase the convergence rate for the speech-like signal without involving the redundant noise components, we define the shaped noise of the power spectrum S(f ) as:

ρk(t) = λ · Real³

IFFT(S¹²(f )e^i2πφ(t))´

(11) where λ is the signal to added noise ratio, and φ(t) is a series of uniform distribution on the interval [0, 1]. On the base of the established criterion of added noise in EEMD, the ratio of signal-spectrum-dependent noise is determined. After adding the shaped noise, the signal will have enough extrema for alleviating mode mixing without wasting computation around the energy-free frequencies. All the corresponding IMFs are obtained on the principle that the average of statistically uncorrelated random sequences is equal to zero, and therefore the effect of added noises on the targeted signal is eliminated.

IV. EXPERIMENTAL RESULTS

The decomposition result with truly physical meaning of the EMD is not the one without noise, it is assigned to be the ensemble mean of a large number trials comprising the noise-added signal instead [8]. According to dependency of ensemble number and noise amplitude indicated in Eq. 10, if one of the parameters is fixed, the other will be easy to compute. Even so, the proper settings for the number of ensemble and the amplitude of added white noise are still not well defined. In other words, there is no a specific principle to guide the choice of the white noise amplitude [8], not to

0 100 200 300 400 500 600 700 800 900 1000

Number of trails

Signal to error ratio (dB)

Signal to added noise ratio: 30 White noise

Pink noise Brown noise SSD noise

Figure 7. Convergence rates for speech-like signal decomposition using different shaped noises (λ = 30).

mention non-Gaussian noise addition. As a result, we have to try different noise levels to select the relatively right one.

In this work, the ensemble number N is set to 1000 for a fair comparison of EEMD convergence rate.

In order to verify how the introduction of shaped noises may facilitate the speech-like signal decomposition, we consider a male voice series of length T extracted from the TIMIT database (see Fig. 2 for the spectrogram and its counterpart generated by EEMD in Fig. 3). If the number of trials is N and every single trial contains M IMFs, the result of the k-th trial may be represented as the matrix form:

Xk= To find out the degree of convergence at a specific trial k, we average the element-by-element products of the N -th IMFs and treat it as the criterion:

R_s(N ) = 1

The similar formula is also applied to the other Xk to calculate the average squared errors between the final and the k-th results:

Re(k) = 1

Finally, the signal to error ratio may be expressed as:

SER(k) = 10 log₁₀

µRs(N ) Re(k)

, 1 ≤ k < N (15)

With the inclusion of shaped noises in the signal decom-position, the sifting process was repeated until 10 IMFs were

在文檔中音視訊號融合之情緒偵測系統 (頁 45-52)