• 沒有找到結果。

Chapter 2 Front-End Techniques of Speech Recognition System

2.5 Feature Extraction Methods

2.5.1 Linear Prediction Coding (LPC)

For the past years, Linear Prediction Coding (LPC), also known as auto-regressive (AR) modeling, has been regarded as one of the most effective techniques for speech analysis. The basic principle of LPC states that the vocal tract transfer function can be modeled by an all-pole filter as

( ) ( )

( )

A

( )

z

z z a

GU z z S

H p

k

k k

1 1

1

1

=

=

=

=

(2-6)

where S(z) is the speech signal, U(z) is the normalized excitation, G is the gain of the excitation, and p is the number of poles (or the order of LPC). As for the coefficients {a1, a2,…,ap}, they are controlled by the vocal tract characteristics of the sound being produced. It is noted that the vocal tract is a non-uniform acoustic tube which extends from the glottis to the lips and varies in shape as a function of time. Suppose that characteristic of vocal tract changes slowly with time, thus {ak} are assumed to be constant in a short time. The speech signal s(n) can be viewed as the output of the all-pole filter H(z), which is excited by acoustic sources, either impulse train with period P for voiced sound or random noise with a flat spectrum for unvoiced sound,

Periodic impulses

Random noises (voiced)

(unvoiced) G

H(z)=

( )

z

A 1

S(z)

glottis vocal tract model

U(z) P

P

shown in Fig.2-6.

From (2-6), the relation between speech signal s(n) and the scaled excitation Gu(n) can be rewritten as

( )

n a s

(

n k

)

Gu

( )

n

is a linear combination of the past p speech samples. In general, the prediction value of the speech signal s(n) is defined as

( ) ∑ ( )

and then the prediction error e(n) could be found as

( ) ( ) ( ) ( ) ∑ ( )

which is clearly equal to the scaled excitation Gu(n) from (2-7). In other words, the prediction error reflects the effect caused by the scaled excitation Gu(n).

To use the LPC is mainly to determine the coefficients {a1, a2,…,ap} that minimizes the square of the prediction error. From (2-9), the mean-square error, called the short-term prediction error, is then defined as

( ) ∑ ( ) ∑ ( )

where N is the number of samples in a frame. It is commented that the short-term

Fig.2- 6 Speech production model estimated based on LPC model

prediction error is equal to G2 and the notation of sn(m) is defined as

( ) ( ) ( )

⎩⎨

⎧ + ≤ ≤ −

= 0 otherwise

1

which means sn(m) is zero outside the window w(m). It can be imaged that In the range of m= 0 to m = p − 1 or in the range of m = N to m = N − 1 + p , the windowed signals sn(m) are predicted as ŝn(m) by previous p signals and some of the previous signals are equal to zero since sn(m) is zero when m< 0 or m > N − 1 . Therefore, the prediction error en(m) is sometimes large at the beginning (m= 0 to m = p − 1 ) or the end ( m= N to m = N − 1 + p ) of the section (m = 0 to m = N − 1 + p ) .

The minimum of the prediction error can be obtained by differentiating En with respect to each ak and setting the result to zero as

=0 autocorrelation function rn(i) and rn(ik) respectively.The autocorrelation function is defined as

( ) ∑

+

( ) ( )

where rn(ik) is equal to rn(ki). Hence, it is equivalent to use rn(|ik|) to replace

in (2-16). By replacing (2-16) with autocorrelation function rn(i) and rn(ik), we can obtain

which matrix form is expressed as

( ) ( ) ( ) ( ) ( )

which is in the form of Rx = r where R is a Toeplitz matrix, that means the matrix has constant entries along its diagonal.

The Levinson-Durbin recursion is an efficient algorithm to deal with this kind of equation, where the matrix R is a Toeplitz matrix and furthermore it is symmetric.

Hence the Levinson-Durbin recursion is then employed to solve (2-20), and the recursion can be divided into three steps, as

Step 1. Initialization

( )

0 rn

( )

0

( ) (

j,i a j,i

) ( ) (

k i a i j,i 1

)

a = −1 − − −

( )

i =

(

1k

( )

i 2

)

E

( )

i1

E

}

Step 3. Final Solution for j= 1 to p

( ) ( )

j a j,p

a =

where the j =a

( )

j for j= 1 , 2 , … , p , and the coefficients k ( i ) are called reflection coefficients whose value is bounded between 1 and -1. In general, the rn( i ) is replaced by a normalized form as

( ) ( ) ( )

0

normailizd n

n_ rn

i i r

r = (2-18)

which will result in identical LPC coefficients (PARCOR) but the recursion will be more robust to the problem with arithmetic precision.

Another problem of LPC is to decide the order p. As p increases, more detailed properties of the speech spectrum will be reserved and the prediction errors will be lower relatively, but it should be notice when p is beyond some value that some irrelevant details will be involved. Therefore, the guideline for choosing the order p is given as

( )

, unvoiced voiced 5 or 4

⎩⎨

=⎧ +

s s

F

p F (2-19)

where Fs is the sampling frequency of the speech in kHz [6]. For example, if the speech signal is sampled at 8 kHz, then the order p is can be chosen as 8~13. Another rule of thumb is to use one complex pole per kHz plus 2-4 poles [7], hence p is often chosen as 10 for the sampling frequency 8 kHz.

Historically, LPC is first used directly in the feature extraction process of the automatic speech recognition system. LPC is widely used because it is fast and simple.

In addition, LPC is effective to compute the feature vectors by Levinson-Durbin recursion. It is noted that the unvoiced speech has higher error than the voiced speech since the LPC model is more accurate for voiced speech. However, the LPC analysis approximates power distribution equally well at all frequencies of the analysis band which is inconsistent with human hearing because the spectral resolution decreases with frequency beyond 800 Hz and hearing is also more sensitive in the middle frequency range of the audible spectrum.[11]

In order to make the LPC more robust, the cepstral processing, which is a kind of homomorphic transformation, is then employed to separate the source e(n) from the all-pole filter h(n). It is commented that the homomorphic transformation

( )

n D

(

x

( )

n

)

= is a transformation that converts a convolution

( ) ( ) ( )

n e n h n

x = ∗ (2-20)

into a sum

( ) ( ) ( )

n n hˆn

= + (2-21)

which is usually used for processing signals that have been combined by convolution.

It is assumed that a value N can be found such that the cepstrum of the filter

( )

n ≈0 for n ≥ N and the excitation of

( )

n ≈0 for n < N. The lifter (“l-i-f-ter” is the inverse of the word “f-i-l-ter”) l(n) is used for approximately recovering eˆ

( )

n and

( )

n from

( )

n . Fig.2-7 shows how to recover h(n) with l(n) given by

( )

⎩⎨⎧

= <

N n

N n n

l 0

1 (2-22)

and the operator D usually uses the logarithmic arithmetic and D-1 use inverse Z-transform. In the similar way, the l(n) is given by

( )

⎩⎨⎧

which is utilized for recovering the signal e(n) from x(n).

In general, the complex cepstrum can be obtained directly from LPC coefficients by the formula expressed as

( ) ( )

where hˆ(n) is the desired LPC-derived cepstrum coefficients c(n). It is noted that, while there are finite number of LPC coefficients, the number of cepstrum is infinite.

Empirically, the number of cepstrum which is approximately equal to 1.5p is sufficient.

Fig.2- 7 Homomorphic filtering

D[ ]

相關文件