Overview - 針對非特定語者語音辨識使用不同前處理技術之比較

Chapter 1 Introduction

1.2 Overview

The chapter of thesis is organized as follows. In chapter 2, the front-end techniques of the speech recognition system will be introduced, including the feature extraction methods, such as LPC, MFCC and PLP, utilized in this thesis. The chapter 3 will show the concept of Hidden Markov Model and its training and recognition procedure. Then the experimental results and comparison of different features will be shown in chapter 4. The experimental conclusion will be given in the last chapter.

Chapter 2 Front-End Techniques of Speech Recognition System

In modern speech recognition systems, the front-end techniques mainly includes converting the analog signal to a digital form, extracting important signal characteristics such as energy or frequency response, and augmenting perceptual meanings of these characteristics, such as human production and hearing. The purpose of the front-end processing of the speech signal is to transform a speech waveform into a sequence of parameter blocks and to produce a compact and meaningful representation of the speech signal. Besides, the front-end techniques can also remove the redundancies of the speech and then reduce the computational complexity and storage in the training and recognition steps, thus the performance of recognition will improve through effective front-end techniques.

Independent of what the parameter kind extracted later is, there are four simple pre-processing steps, including constant bias removing, pre-emphasis, frame blocking, and windowing, which are applied prior to performing feature extraction. And these steps will be expressed and stated in the following four sections. In addition, three common feature extraction methods, Linear Prediction Coding (LPC) [2], Mel Frequency Cepstral Coefficient (MFCC) [3], and Perceptual Linear Predictive (PLP) Analysis [4], will be described in the last section of this chapter.

2.1 Constant bias Removing

The speech waveform probably has a nonzero mean, denoted as DC bias, due to the environments, the recording equipments, or the analogous-digital conversion. In

order to get better feature vectors, it is necessary to estimate the DC bias and then remove it. The DC bias value is estimated by

∑ ( )

= ^N

bias s k

DC N

1 (2-1)

where s(k) is the speech signal possessing N samples. Then the signal after removing the DC bias, denoted by s′

( )

k , is given

( ) ( )

k s k DC_bias

s′ = − , 1≤k≤N (2-2) where N is the total samples of the speech signal. After the process of constant bias removing, the pre-emphasis filter is then applied to the speech signal s′

( )

k which is stated in the next section.

2.2 Pre-emphasis

The purpose of pre-emphasis is to eliminate the effect of glottis while producing sound and to compensate the high-frequency parts depressed by the speech generation system. Typically, the pre-emphasis is fulfilled with a high-pass filter in a form as

( )

z =1−µz⁻¹

P , 0.9 ≤ µ ≤1.0 (2-3)

which increases the relative energy of the high-frequency spectrum and introduces a zero near µ. In order to cancel a pole near z = 1 due to the glottal effect, the value of µ is usually greater than 0.9 and it is set to be µ = 0.97 in this paper. The pole and zero of the filter P(z) = 1- 0.97z⁻¹ are 0 and 0.97 respectively. Furthermore, the frequency responses for the pre-emphasis filter with µ = 0.9, 0.97, and 1 are given in Fig 2-1.

The filter is intended to boost the signal spectrum 20dB per decade approximately [5].

Fig.2-2 shows the comparison of the speech signal before and after pre-emphasis.

2.3 Frame Blocking

The objective of frame blocking is to decompose the speech signal into a series of overlapping frames. In general, the speech signal changes rapidly in time domain;

Fig.2- 1 Frequency Response of the pre-emphasis filter

Fig.2- 2 Speech signal (a) before pre-emphasis and (b) after pre-emphasis (a)

(b)

Frame period Frame 1

Frame 2 Frame 3

‧‧‧

Frame n

Frame duration

‧‧‧

Feature vectors ‧‧‧

nevertheless, the spectrum changes slowly with time from the viewpoint of the frequency domain. Hence, it could be assumed that the spectrum of the speech signal is stationary in a short time, and then it is more reasonable to do spectrum analysis after blocking the speech signal into frames. There are two parameters should be concerned, that is frame duration and frame period, shown in Fig.2-3.

I. Frame duration

The frame duration is the length of time (in seconds), usually ranging between 10 ms ~ 30 ms, over which a set of parameters are valid. If the sampling frequency of the waveform is 16 kHz and the frame duration is 25 ms, there are 16 kHz × 25 ms = 400 samples in one frame. It is noted that the total number of samples in a frame is called the frame size.

II. Frame period

As shown in Fig.2-3, the frame period is often selected on purpose shorter than the frame duration to avoid the characteristics changing too rapidly between two successive frames. In other words, there is an overlap with time length equal to the difference of frame duration and frame period.

Fig.2- 3 Frame blocking

2.4 Windowing

After frame blocking, the process of windowing applies to each frame by multiplying a Hamming window, shown in Fig.2-4 for N=64, to minimize the spectrum distortion and discontinuities. Let the Hamming window be given as

( )

⎟

⎠

⎜ ⎞

⎝

⎛

⋅ −

−

= 1

46 2 0 54

0 N

cos n .

. n

w π

, 0 ≤ n ≤ N−1 (2-4)

where N is the window size, chosen the same as the frame size. Then the result of windowing process to m-th sample s_m(n) can be obtained as

( )

n s

( ) ( )

n wn

s_mw = _m , 0 ≤ n ≤ N−1 (2-5)

Fig.2-5 shows an example of the time domain and frequency response for two successive frames, frame m and frame m+1, of the speech signal before and after multiplying by a Hamming window. From this figure, the spectrum of s_mw(n) is smoother than the s_m(n). It is noted that there is little variation between two consecutive frames in their frequency response.

Fig.2- 4 Hamming window (a) in time domain and (b) frequency response (a)

(b)

Fig.2- 5 Successive frames before and after windowing

smw(n)

windowing windowing

frame m+1 ……

frame m

2.5 Feature Extraction Methods

Feature extraction is the major part of front-end technique for the speech recognition system. The purpose of feature extraction is to convert the speech waveform to a series of feature vectors for further analysis and processing. Up to now, several feasible features have been developed and applied to the speech recognition, such as Linear Prediction Coding (LPC), Mel Frequency Cepstral Coefficient (MFCC), and Perceptual Linear Predictive (PLP) Analysis, etc. The following sections will present all the techniques.

2.5.1

Linear Prediction Coding (LPC)

For the past years, Linear Prediction Coding (LPC), also known as auto-regressive (AR) modeling, has been regarded as one of the most effective techniques for speech analysis. The basic principle of LPC states that the vocal tract transfer function can be modeled by an all-pole filter as

( ) ( )

( )

z z a

GU z z S

H _p

k k

1 1

−

∑

−

(2-6)

where S(z) is the speech signal, U(z) is the normalized excitation, G is the gain of the excitation, and p is the number of poles (or the order of LPC). As for the coefficients {a₁, a₂,…,a_p}, they are controlled by the vocal tract characteristics of the sound being produced. It is noted that the vocal tract is a non-uniform acoustic tube which extends from the glottis to the lips and varies in shape as a function of time. Suppose that characteristic of vocal tract changes slowly with time, thus {a_k} are assumed to be constant in a short time. The speech signal s(n) can be viewed as the output of the all-pole filter H(z), which is excited by acoustic sources, either impulse train with period P for voiced sound or random noise with a flat spectrum for unvoiced sound,

Periodic impulses

Random noises (voiced)

(unvoiced) G

H(z)=

( )

A 1

S(z)

glottis vocal tract model

U(z) P

shown in Fig.2-6.

From (2-6), the relation between speech signal s(n) and the scaled excitation Gu(n) can be rewritten as

( )

n a s

(

n k

)

( )

is a linear combination of the past p speech samples. In general, the prediction value of the speech signal s(n) is defined as

( ) ∑ ( )

and then the prediction error e(n) could be found as

( ) ( ) ( ) ( ) ∑ ( )

which is clearly equal to the scaled excitation Gu(n) from (2-7). In other words, the prediction error reflects the effect caused by the scaled excitation Gu(n).

To use the LPC is mainly to determine the coefficients {a₁, a₂,…,a_p} that minimizes the square of the prediction error. From (2-9), the mean-square error, called the short-term prediction error, is then defined as

( ) ∑ ( ) ∑ ( )

where N is the number of samples in a frame. It is commented that the short-term

Fig.2- 6 Speech production model estimated based on LPC model

prediction error is equal to G² and the notation of s_n(m) is defined as

( ) ( ) ( )

⎩⎨

⎧ + ≤ ≤ −

= 0 otherwise

which means s_n(m) is zero outside the window w(m). It can be imaged that In the range of m= 0 to m = p − 1 or in the range of m = N to m = N − 1 + p , the windowed signals s_n(m) are predicted as ŝ_n(m) by previous p signals and some of the previous signals are equal to zero since s_n(m) is zero when m< 0 or m > N − 1 . Therefore, the prediction error en(m) is sometimes large at the beginning (m= 0 to m = p − 1 ) or the end ( m= N to m = N − 1 + p ) of the section (m = 0 to m = N − 1 + p ) .

The minimum of the prediction error can be obtained by differentiating E_n with respect to each a_k and setting the result to zero as

=0 autocorrelation function r_n(i) and r_n(i− k) respectively.The autocorrelation function is defined as

( ) ∑

⁻⁺

( ) ( )

where r_n(i− k) is equal to r_n(k−i). Hence, it is equivalent to use r_n(|i− k|) to replace

in (2-16). By replacing (2-16) with autocorrelation function r_n(i) and r_n(i− k), we can obtain

which matrix form is expressed as

( ) ( ) ( ) ( ) ( )

which is in the form of Rx = r where R is a Toeplitz matrix, that means the matrix has constant entries along its diagonal.

The Levinson-Durbin recursion is an efficient algorithm to deal with this kind of equation, where the matrix R is a Toeplitz matrix and furthermore it is symmetric.

Hence the Levinson-Durbin recursion is then employed to solve (2-20), and the recursion can be divided into three steps, as

Step 1. Initialization

( )

0 r_n

( )

( ) (

j,i a j,i

) ( ) (

k i a i j,i 1

)

a = −1 − − −

( )

ⁱ ⁼

(

¹⁻^k

( )

ⁱ ²

)

^{( )}

ⁱ⁻¹

}

Step 3. Final Solution for j= 1 to p

( ) ( )

j a j,p

a =

where the aˆ_j =a

( )

j for j= 1 , 2 , … , p , and the coefficients k ( i ) are called reflection coefficients whose value is bounded between 1 and -1. In general, the r_n( i ) is replaced by a normalized form as

( ) ( ) ( )

normailizd n

n_ rn

i i r

r = (2-18)

which will result in identical LPC coefficients (PARCOR) but the recursion will be more robust to the problem with arithmetic precision.

Another problem of LPC is to decide the order p. As p increases, more detailed properties of the speech spectrum will be reserved and the prediction errors will be lower relatively, but it should be notice when p is beyond some value that some irrelevant details will be involved. Therefore, the guideline for choosing the order p is given as

( )

, unvoiced voiced 5 or 4

⎩⎨

=⎧ +

s s

p F (2-19)

where F_s is the sampling frequency of the speech in kHz [6]. For example, if the speech signal is sampled at 8 kHz, then the order p is can be chosen as 8~13. Another rule of thumb is to use one complex pole per kHz plus 2-4 poles [7], hence p is often chosen as 10 for the sampling frequency 8 kHz.

Historically, LPC is first used directly in the feature extraction process of the automatic speech recognition system. LPC is widely used because it is fast and simple.

In addition, LPC is effective to compute the feature vectors by Levinson-Durbin recursion. It is noted that the unvoiced speech has higher error than the voiced speech since the LPC model is more accurate for voiced speech. However, the LPC analysis approximates power distribution equally well at all frequencies of the analysis band which is inconsistent with human hearing because the spectral resolution decreases with frequency beyond 800 Hz and hearing is also more sensitive in the middle frequency range of the audible spectrum.[11]

In order to make the LPC more robust, the cepstral processing, which is a kind of homomorphic transformation, is then employed to separate the source e(n) from the all-pole filter h(n). It is commented that the homomorphic transformation

( )

n D

(

( )

)

xˆ = is a transformation that converts a convolution

( ) ( ) ( )

n e n h n

x = ∗ (2-20)

into a sum

( ) ( ) ( )

n eˆ n hˆn

xˆ = + (2-21)

which is usually used for processing signals that have been combined by convolution.

It is assumed that a value N can be found such that the cepstrum of the filter hˆ

( )

n ≈0 for n ≥ N and the excitation of eˆ

( )

n ≈0 for n < N. The lifter (“l-i-f-ter” is the inverse of the word “f-i-l-ter”) l(n) is used for approximately recovering eˆ

( )

n and hˆ

( )

n from xˆ

( )

n . Fig.2-7 shows how to recover h(n) with l(n) given by

( )

⎩⎨⎧

≥

= <

N n

N n n

l 0

1 (2-22)

and the operator D usually uses the logarithmic arithmetic and D^-1 use inverse Z-transform. In the similar way, the l(n) is given by

( )

⎩⎨⎧

which is utilized for recovering the signal e(n) from x(n).

In general, the complex cepstrum can be obtained directly from LPC coefficients by the formula expressed as

( ) ( )

where hˆ(n) is the desired LPC-derived cepstrum coefficients c(n). It is noted that, while there are finite number of LPC coefficients, the number of cepstrum is infinite.

Empirically, the number of cepstrum which is approximately equal to 1.5p is sufficient.

Fig.2- 7 Homomorphic filtering

D[ ]

2.5.2

Mel-Frequency Cepstral Coefficients (MFCC)

The Mel-Frequency Cepstral Coefficients (MFCC) is the most widely used feature extraction method for state-of-the-art speech recognition system. The conception of MFCC is to use nonlinear frequency scale, which approximates the behavior of the auditory system. The scheme of the MFCC processing is shown in Fig.2.8, and each step will be described below.

After the pre-processing steps discussed above, including constant bias removing, pre-emphasis, frame blocking, and windowing, are applied to the speech signal, the Discrete Fourier Transform (DFT) is then performed to obtain the spectrum where DFT is expressed as

( )

^N _w

( )

^j ^ik/N

where N is the size of DFT chosen the same as the window size. The Fast Fourier Transform (FFT) is often adopted to substitute for the DFT for more efficient computation. The Mel filter banks will be defined later after making a short introduction of the Mel scale.

Fig.2- 8 Scheme of obtaining Mel-frequency Cepstral Coefficients Pre-processing

The Mel scale, is obtained by Stevens and Volkman [8][9], is a perceptual scale motivated by nonlinear properties of human hearing and it attempts to mimic the human ear in terms of the manner that the frequencies are sensed and resolved. In the experiment, the reference frequency was selected as 1 kHz and equaled it with 1000 mels where a mel is defined as a psychoacoustic unit of measuring for the perceived pitch of a tone [10]. The subjects were asked to change the frequency until the pitch they perceived was twice the reference, 10 times, half, 1/10, etc. For instance, if the frequency they perceived is twice the reference, namely 2 kHz, while the actual frequency is 3.5 kHz, the frequency 3.5 kHz is mapping to the Mel frequency twice 1000 mels, that is, 2000 mels. The formulation of Mel scale is approximated by

( )

⎟

where B(f) is a function for mapping the actual frequency to the Mel frequency, shown in Fig.2.9, and the Mel scale frequency is almost linear below 1 kHz and is logarithmic above. The Mel filter bank is then designed by placing M triangular filters non-uniformly along the frequency axis to simulate the band-pass filters of human ears, and the m-th triangular filter is expressed as

( )

in the above equation can be calculated by

( ) ( ) ( ) ( )

sampling frequency of the speech signal and the function B ( f ) is the function to map the actual frequency to Mel frequency given in (2-24). The function B^-1(b) is the inverse of the B( f ) given by

( )

⁷⁰⁰

(

¹⁰ ²²⁹⁵ ¹

)

1 = −

− b b/

B (2-29)

where b is the Mel frequency. It is noted that the boundary points f(m) are uniformly spaced in the Mel scale. By replacing B and B^-1 in (2-28) by (2-26) and (2-29), the equation can be rewritten as

( )

which can be used in programming. In general, M is equal to 20 for the speech signal with 8 kHz sampling frequency and 24 for 16 kHz sampling frequency. The Mel filter banks of the 8 kHz (M= 20) and 16 kHz (M = 24) are shown in Fig.2-10(a) and Fig.2-10(b) respectively. The region of spectrum below 1 kHz is processed by more filter banks since this region contains more information on the vocal tract such as the first formant. The nonlinear filter bank is employed to achieve both frequency and time resolution where the narrow band-pass filter at low frequencies enables harmonics to be detected and the longer band-pass filter at high frequencies allows for higher temporal resolution of bursts.

The Mel spectrum is derived by multiplied each FFT magnitude coefficient with the corresponding filter gain as

( )

^k ^S

( ) ( )

^k ^H ^k

X_t = _t _m , 0≤k <N−1 (2-31) and the results is accumulated and taken logarithm as

( ) ∑

⁻

( )

which is robust to noise and spectral estimation errors. The reason of using the magnitude of St(k) is that the information of phase is useless in speech recognition.

The logarithm operation is utilized to reduce the component amplitudes at every frequency and to perform a dynamic compression in order to make the feature extraction less sensitive to variations in dynamics where the dynamics means the magnitude of the sound. Besides, the logarithm is applied to separate the excitation produced by the vocal tract and the filters that represents the vocal tract.

Since the log-magnitude spectrum Y(m) is real and symmetric, the inverse Discrete Fourier Transform (IDFT) is reduced to the Discrete Cosine Transform (DCT) and applied to derive the Mel Frequency Cepstral Coefficients c_t(i) as

( ) ( )

_⎟⎟

where L is the number of cepstrum coefficients desired and L≤ M . It is noted that the cepstrum is defined in the quefrency domain. The process of DCT successfully separates the excitation and the vocal tract, in other words, the low quefrencies, namely lower order of cepstrum, represents the slow changes of the envelope of the vocal tract and the high quefrencies, namely, higher order of cepstrum represents the periodic excitation. In general, 12 MFCCs (L=12) and the energy is adapted where the energy term computed by the log of the energy as

∑ ( )

which often referred to as absolute MFCCs, and then the first and second-order

derivatives of these absolute coefficients are given

which are useful to cancel the channel effect of the speech. In addition, the derivative operation is utilized to obtain the dynamic evolution of the speech signal, that is, the temporal information of the feature vector ct(i). If the value of P is too small, the dynamic evolution may not be caught; if the value P is too large, the derivatives have less meaning since two frames may describe different acoustic phenomena. In practice, the order of MFCC is often chosen as 39, including 12 MFCCs ({c(i)}|i=1,2,…,12), energy term (et) and their first-order derivatives (∆{c(i)}|i=1,2,…,12, ∆{et}) and second-order derivatives (∆²{c(i)}|i=1,2,…,12, ∆²{et}).

Fig.2- 9 Frequency Warping according to the Mel scale (a) linear frequency scale (b) logarithmic frequency scale

(a) (b)

2.5.3

Perceptual Linear Predictive (PLP) Analysis

The Perceptual Linear Predictive (PLP) analysis is first presented and examined by Hermansky in 1990 [4] for analyzing speech. This technique combines several engineering approximations of psychophysics of human hearing processes, including critical-band spectral resolution, the equal-loudness curve, and the intensity-loudness power law. As a result, the PLP analysis is more consistent with the human hearing. In addition, the PLP analysis is beneficial for speaker-independent speech recognition due to its computational efficiency and yielding a low-dimensional representation of

Fig.2-10 The Mel filter banks (a) Fs = 8 kHz and (b) Fs =16 kHz

H2(k)

(a)

(b)

H20(k) H18(k)

H1(k) …… H19(k)

H24(k) H22(k)

H1(k) H2(k) …… H23(k)

speech. The block diagram of the PLP method is shown in Fig.2.11, and each step will be described below. [12]

Step I. Spectral analysis

The fast Fourier Transform (FFT) is first applied on the windowed speech segment (sw(k), for k=1,2,…N) into the frequency domain. The short- term power spectrum is expressed as

( )

[

(

S_t

( )

) ]

[

(

S_t

( )

) ]

P = + (2-37)

where the real and imaginary components of the short-term speech spectrum are squared and added. There is an example in Fig.2-12 which shows the short-term speech signal and its power spectrum P(ω).

Fig.2-11 Scheme of obtaining Perceptual Linear Predictive coefficeints Pre-processing

{s(k)}

Speech

{s_w(k)} Critical-band analysis

Equal-Loudness Pre-emphasis

Intensity-Loudness

Conversion IDFT

Autoregressive modeling All-pole Model

FFT

Step II. Critical-band analysis

The power spectrum P(ω) is then warped along the frequency axis ω into the Bark scale frequency Ω as

( )

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧ ⎟ +

⎠

⎜ ⎞

⎝ + ⎛

= 1

1200 ln 1200

π ω π

ω ω

Ω (2-38)

where ω is the angular frequency in rad/sec, which is shown in Fig.2-13. The resulting power spectrum P(Ω) is then convoluted with the simulated critical-band masking curve Ψ(Ω) and get the critical-band power spectrum Θ(Ωi) as

( ) ∑ ( ) ( )

−

= ²⁵

3 1 .

.P Ω Ψ Ω Ωi

Ω

Θ _i , i=1,2,...,M (2-39)

where M is number of Bark filter banks and the critical-band masking curve Ψ(Ω), shown in Fig.2-14, is given by,

Fig.2-12 Short-term speech signal (a) in time domain and (b) power spectrum (b)

(a)

⎪⎪

where Ω is the Bark frequency just mentioned in (2-38). This step is similar to Mel filter banks processing of MFCC where the Mel filter banks are replaced by the

在文檔中針對非特定語者語音辨識使用不同前處理技術之比較 (頁 12-0)