Perceptual Linear Predictive (PLP) Analysis

Chapter 2 Front-End Techniques of Speech Recognition System

2.5 Feature Extraction Methods

2.5.3 Perceptual Linear Predictive (PLP) Analysis

The Perceptual Linear Predictive (PLP) analysis is first presented and examined by Hermansky in 1990 [4] for analyzing speech. This technique combines several engineering approximations of psychophysics of human hearing processes, including critical-band spectral resolution, the equal-loudness curve, and the intensity-loudness power law. As a result, the PLP analysis is more consistent with the human hearing. In addition, the PLP analysis is beneficial for speaker-independent speech recognition due to its computational efficiency and yielding a low-dimensional representation of

Fig.2-10 The Mel filter banks (a) Fs = 8 kHz and (b) Fs =16 kHz

H2(k)

(a)

(b)

H20(k) H18(k)

H1(k) …… H19(k)

H24(k) H22(k)

H1(k) H2(k) …… H23(k)

speech. The block diagram of the PLP method is shown in Fig.2.11, and each step will be described below. [12]

Step I. Spectral analysis

The fast Fourier Transform (FFT) is first applied on the windowed speech segment (sw(k), for k=1,2,…N) into the frequency domain. The short- term power spectrum is expressed as

( )

[

(

S_t

( )

) ]

[

(

S_t

( )

) ]

P = + (2-37)

where the real and imaginary components of the short-term speech spectrum are squared and added. There is an example in Fig.2-12 which shows the short-term speech signal and its power spectrum P(ω).

Fig.2-11 Scheme of obtaining Perceptual Linear Predictive coefficeints Pre-processing

{s(k)}

Speech

{s_w(k)} Critical-band analysis

Equal-Loudness Pre-emphasis

Intensity-Loudness

Conversion IDFT

Autoregressive modeling All-pole Model

FFT

Step II. Critical-band analysis

The power spectrum P(ω) is then warped along the frequency axis ω into the Bark scale frequency Ω as

( )

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧ ⎟ +

⎠

⎜ ⎞

⎝ + ⎛

= 1

1200 ln 1200

π ω π

ω ω

Ω (2-38)

where ω is the angular frequency in rad/sec, which is shown in Fig.2-13. The resulting power spectrum P(Ω) is then convoluted with the simulated critical-band masking curve Ψ(Ω) and get the critical-band power spectrum Θ(Ωi) as

( ) ∑ ( ) ( )

−

= ²⁵

3 1 .

.P Ω Ψ Ω Ωi

Ω

Θ _i , i=1,2,...,M (2-39)

where M is number of Bark filter banks and the critical-band masking curve Ψ(Ω), shown in Fig.2-14, is given by,

Fig.2-12 Short-term speech signal (a) in time domain and (b) power spectrum (b)

(a)

⎪⎪

where Ω is the Bark frequency just mentioned in (2-38). This step is similar to Mel filter banks processing of MFCC where the Mel filter banks are replaced by the analogous trapezoid Bark filter banks. The step between two banks is constant on the Bark scale, and the interval is chosen so that the filter banks must cover the whole analysis band. For example, 21 Bark filter banks, which cover from 0-Bark to 19.7-Bark in 0.985-Bark steps, are employed for analyzing speech signal of 16 kHz sampling frequency, shown in Fig.2-15. It is noted that 8 kHz is mapping to 19.687-Bark and the steps are usually chosen approximately 1-Bark. Fig.2-16 is the power spectrum after applying the Bark filter banks (M = 21) to the speech signal in Fig.2-12. The Bark filter banks and the Mel filter banks are both allocate more filters to the lower frequencies, where the hearing is more sensitive. Sometimes, the Bark filter banks are replaced with the Mel filter banks.

Fig.2-13 Frequency Warping according to the Bark scale

Fig.2-14 Critical-band curve

Fig.2-15 The Bark filter banks (a) in Bark scale (b) in angular frequency scale

Step III. Equal-loudness pre-emphasis

In order to compensate the unequal sensitivity of human hearing at different frequencies, the sampled power spectrum Θ(Ωi) obtained in the (2-39) is then pre-emphasis by the simulates equal loudness curve E(ω), expressed as

( )

Ω_i E

( ) ( )

ω Θ Ω_i

Ξ = ⋅ , i=1,2,...,M (2-41)

where the function E(ω) is given by

( ) ( )

(

² ⁶

) (

² ² ⁹

) (

⁶ ²⁶

)

4 6 2

10 58 9 10

38 0 10

3 6

10 8 56

× +

= +

. .

. ω .

E ω ω ω

ω (2-42)

where E(ω) is a high pass filter. Then the value of the first and last samples are made equal to the values of their nearest neighbors, thus Ξ(Ωi) begins and ends with two equal-valued samples. Fig.2-17 shows the power spectrum after equal-loudness pre-emphasis. From the Fig.2-17, the part of higher frequency in Fig.2-16 has been well compensated.

Fig.2-16 Critical-band power spectrum

Step IV. Intensity-loudness power law

Since the nonlinear relation between intensity of the sound and its perceived loudness, the spectral compression is then utilized by using the power law of hearing given by

( )

Ω_i Ξ

( )

Ω_i ^0.³³

Φ = , i=1,2,...,M (2-43) where a cubic root compensation of critical band energies is applied. This step can reduce the spectral-amplitude variation of the critical-band spectrum. It is noted that the log arithmetic is adopted in the process of MFCC.

Fig.2-17 Equal loudness pre-emphasis

Fig.2-18 Intensity-loudness power law

Step V. Autoregressive modeling

The autocorrelation coefficients rs(n) are not computed in the time domain through (2-18) but is obtained as the inverse Fourier transform (IDFT) of the power spectrum P(ω) of the signal. The IDFT is better choice than the FFT here since only a few autocorrelation values are needed. If the order of the all pole model is equal to p, only the first p+1 autocorrelation values are used to solve the Yule-Walker equation.

Then the standard Durbin-Levinson recursion is employed to compute the PLP coefficients.

Chapter 3 Speech Modeling and Recognition

During the past several years, Hidden Markov Model (HMM) [20][21][22] has become the most powerful and popular speech model used in ASR because of its wonderful ability of characterizing the speech signal in a mathematically tractable way and better performance comparing to other methods. The assumption of the HMM is that the data samples can be well characterized as a parametric random process, and the parameters of the stochastic process can be estimated in a precise and well-defined framework.

3.1 Introduction

In a typical HMM based ASR system, the HMM is proceeded after the feature extraction. The input of the HMM is the discrete time sequence of feature vectors, such as MFCCs, LPCs, etc. These feature vectors are customarily called observations, since these feature vectors represent the inforamtion observable from the incoming speech utterance. The observation sequence O ={o₁, o₂, …, o_T} is a set of the observations from time 1 to time T, where the time t is the frame index.

An Hidden Markov Model can be used to represnent a word (one, two, three, etc) , a syllable (“grand”, “fa”, “ther”, etc), a phone (/b/, /o/, /i/, etc), and so forth. The Hidden Markov Model is essentially structured by a state sequence q=

{

q1,q2,L,q_T

}

where q_t∈

{

S1,S2,L,S_N

}

, N is the total number of states and each state is generally associated with a multidimensional probability distribution. The states of HMM can

be viewed as collections of similar acoustical phenomena in an utterance. The total number of state N should be chosen well to represent these phenomena. In general, different number of state of HMM would lead to differnet recognition results [12].

For a particular state, an observation can be generated according to the associated probability distribution. This means that there is not a one-to-one correspondence between the observation and the state, and the state sequence cannot be determined unanimously by a given oberservation sequence. It is noticed that only the observation is visible, not the state. In other words, the model possesses hidden states and is named as the “Hidden” Markov Model.

3.2 Hidden Markov Model

Formally speaking, a Hidden Markov Model is defined as Λ=

(

A,B,π

)

, which includes the initial state distribution π, state-transition probability distribution A, and observation probability distribution B. Each elements will be illustrated respectively as follows.

I. Initial state distribution π

The initial state distribution is defined as π= {πi}in which

(

)

i =P q₁=S

π , 1≤i≤N (3-1)

where πi is the probability that the initial state q₁ of the state sequence

{

q1,q2,L,q_T

}

q is Si. Thus, the summation of the probability of all possible initial state is equal to 1, given as

2 1

1+π + +π_N =

π _L (3-2)

II. State-transition probability distibution A

The state transition probability distribution A of an N-state HMM can be expressed as {aij} or in the form of square matrix

with constant probability aij

(

q j|q i

)

a_ij = _t₊₁= _t = , 1≤i,j≤N (3-4) representing the transition probability from state i at time t to state j at time t+1.

Briefly, the transitions among the states are governed by a set of probabilities a_ij, called the transition probabilities, which are assumed not changing with time. It is noticed that the summation of all the probabilities from a particular state at time t to itself and the others at time t+1 should be equal to 1, i.e. the summation of all the entries in the i-th row is equal to 1, given as of q being generated by the HMM is

(

| ,

)

a_q _q a_q _q a_q_T _q_T

P q Aπ =πi ₁ ₂ ₂ ₃L ₋₁ (3-6)

For example, the transition probability matrix of a three-state HMM can be expressed in the form as

for arbitrary time t. Fig.3-1 shows all the possible paths, labeled with transition probabilities between states, from time 1 to T. The structure without any constrain imposed on state transitions is called ergodic HMM. It is easy to find that the number of all possible paths

( )

^N² ^T⁻¹ (in this case N= 3 ) would greatly increase as time increasing.

A left-to-right HMM (namely Bakis model) with the elements of the state-transition probability matrix

aij , for j< i (3-9)

is adopted in general cases to simplify the model and reduce the computation time.

The main conception of a left-to-right HMM is that the speech signal varies with time from left to right, that is, the acoustic phenomena change sequentially and the first state must be S1. There are two general types of left-to-right HMM, shown in Fig.3-2.

Fig.3-1 Three-state HMM

By using a three-state HMM as an example, the transition probability matrix A with left-to-right and one-skip constrain, shown in Fig.3-3, can be express as

⎥⎥ possible paths between states of a three-state left-to-right HMM from time 1 to time T.

If no skip is allowed, the transition probability matrix A can be express as

⎥⎥

where the element a₁₃ in (3-7) is replaced by zero. Similarly, Fig.3-5 shows all possible paths between states of a no-skip three-state HMM from time 1 to time T.

Fig.3-2 Four-state left-to-right HMM with (a) one skip and (b) no skip

Fig.3-3 Typical left-to-right HMM with three states (a)

III. Observation probability distribution B

Since the state sequence q is not observable, each observation ot can be envisioned as being produced with the system in state q_t. Assume that the production of ot in each possible state Si is stochastic, where i=1, 2,…, N, and is characterized by a set of observation probability functions B = {bj(ot)} where

( )

(

t t j

)

j P |q S

b o = o = , j=1,2,...,N (3-12)

Fig.3-4 Three-state left-to-right HMM with one skip

Fig.3-5 Three-state left-to-right HMM with no skip S1

which discribes the probability of the observation ot being produced with respect to state j. If the distribution of the observations are continuous and infinite, the finite mixture of Gaussian distributions, that is, a weighted sum of M Gaussian distributions is used, expressed as

( ) (

jm jm t

)

where µ_jm and Σ_jm indicates the mean vector and the covariance matrix of the m-th mixture component in state Sj. The observations are assumed to be independent to each other, the covariance matrix can be reduced to a diagonal form Σ_jm as

( ) ( )

or simplified as a vector with L-dimension as

( ) ( ) ( )

[

^σjk ^σjk ^σjk ^L

]

jk = 1 2 L

Σ (3-15)

where L is the dimension of the observation ot. The mean vector can be expressed as

( ) ( ) ( )

[

jm jm jm ^L

]

jm = µ 1 µ 2 _L µ

µ (3-16)

Then, the observation probability function bj(ot) can be written as

( )

As for the weighting coefficient w_jk, it must satisfying

where w_jk is non-negative value.

Fig.3-6 shows that the probabilities of the observations sequence O ={o₁, o₂, o₃, o₄} generated by state sequence q = {q1, q₂, q₃, q₄} are bq₁(o₁), bq₂(o₂), bq₃(o₃), bq₄(o₄), respectively.

3.3 Training Procedure

Given a HMM Λ={A, B, π} and a set of observations O ={o₁, o₂,…, o_T}, the purpose of training the HMMs is to adjust the model parameters so that the likelihood

(

O|Λ

)

P is locally maximized by using iterative procedure. The modified k-means algorithm [19] and Viterbi algorithm are employed in the process of obtaing initial HMMs. The Baum-Welch algorithm (or called the forward-backward algorithm) is performed to train the HMMs. Before applying the training algorithm, prepareation work of the corpus and HMM is required prior to the trainging procedure as below I. A set of speech data and their associated transcriptions should be prepared, and the

speech data must be transformed to the a series of feature vectors (LPC, RC, LPCC, MFCC, PLP, etc).

Fig.3-6 Scheme of probability of the observations q₁

b_q

1(o₁) aq

1 q

2 aq

2 q

3 aq

T-1 q T

o₁ o₂ o_T-1 o_T

b_q

2(o₂) b_q

T-1(o_T-1) b_q

T(o_T)

time Observations

… _q

q₂ q_T-1 T

…

II. The number of states and the number of mixtures in a HMM must be determined, according to the degree of variations in the unit. In general, 3~5 states and 6~8 states are used for representing the English phone and Mardarin Chinese phone, respectively.

It is noted that the features are the the observations of the HMM, and these observations and the transcriptions are then utilized to train the HMMs.

The training procedure can be divided into two manners depending on whether the sub-word-level segment information, or called the boundary information, is available, that is labeled with boundary manually. If the segment information is available, such as Fig.3-7(a), the estimation of the HMM parameter would be easier and more precise; otherwise, training with no segment information would cost more computation time to re-align the boundary and re-estimate the HMM, in addition, the HMM often performs not as good as the one with well-segment information. The transcription and boundary condition should be saved in text files, such as the form in Fig.3-7(b)(c).

It is noted that if the speech doesn’t have segment information, it is also necessary to get the transcription and save it before training. The block diagram of the training procedure is shown in Fig.3-8. The main difference between training the HMM with boundary information and training the HMM without boundary information is on the processing of creating the initialized HMM. Then, the following section will divided into two parts to present the details of creating the initialized HMM.

(a)

Fig.3-7 (a) Speech labeled with the boundary and transcription save as text file (b) with and (c) without boundary information

Fig.3-8 Training procedure of the HMM 0 60 sil

60 360 yi 360 370 sp 370 600 ling 600 610 sp 620 1050 wu 1050 1150 sil

(b) (c)

sil yi sp ling sp wu sil

Initial HMM with k-means and Viterbi

alignment (Fig.3-9)

Initial HMM with global mean and

variance With

boundary information?

Yes

Baum-Welch and Viterbi alignment to

obtain estimated HMM

Baum-Welch re-estimation

Feature vectors (observations)

…

Get HMMs

Baum-Welch re-estimation

Get HMMs

Viterbi search

I. Boundary information is available

The procedure of creating the initialized HMMs is shown in Fig.3-9, Fig 3-10.

The modified k-means algorithm and the viterbi algorithm are utilized in training iteration. On the first iteration, the training data of a specific model are uniformly divided into N segments, where N is the number of states of HMM, and the successive segments are associated with successive states. Then, the HMM parameters πi and aij

can be estimated first by

3.3.1

Midified k-means algorithm

For continuous-density HMM with M Gaussian mixtures per state, the modified k-means [13][14] are used for cluster the observations O into a set of M clusters which are associated to the number of mixtures in a state, shown in Fig.3-9. Let the i-th cluster of a m-cluster set at the k-th iteration denote as ω_m,^k_i where i =1,2,…, m and k = 1,2,…, kmax with kmax being the maximum allowable iteration count. Y(ω) is the representive pattern for cluster ω. the number of clusters in the current iteration and i is the iteration counter in classification process. The modified k-means algorithm is given by

(i) Set m=1, k=1 and i=1; ω_m,^k_i =Oand compute the mean Y(Ο) of the entire training set O.

(ii) Classify the vectors by minimum distance principle. Accumulate the total

intracluster distance for each cluster ω_m,^k_i denoted as ∆^k_i. If none of the following conditions meet then back to (ii) and k=k+1.

a. ω_m,^k⁺¹_i =ω_m,^k_i, for all i=1,2,…,m

b. k meets the preset maximum allowable number of iterations.

c. The change in the total accumulated distance is below the preset threshold ∆_th.

(iii) Record the mean and the covariance of the m-cluster,. If m is reached the number of mixtures M, then stop, else, go to (iv).

(iv) Split the mean of the cluster that has largest intracluster distance and m=m+1, reset k and go to (ii).

From the modified k-means, the observations are clustered into M groups where M is the number of mixtures in a state. The parameters can be estimated by

jm =covariancematrix of theobservationsclassifiedin cluster min state Σ HMM parameters is all updated.

Fig.3-9 The block diagram of creating the initialized HMM

Fig.3-10 Modified k-means

Uniform Segmentation

Initialize Parameters

Viterbi alignment

Feature vectors (observations)

…

Modified k-means

Update the Model parameters

Converged ?

Initialized HMM Modified k-means

q₁

Global mean

。

3.3.2

Viterbi Search

Except for the first estimation of the HMM, the uniform segmentation is replaced by Viterbi alignment, viz Viterbi search, which is applied to find the optimal state sequence q ={q₁, q₂,…,q_T} where model Λ and the observations sequences O ={o₁, o₂,…, o_T} are given. By the Viterbi alignment, each observation will be re-align to the state so that the new sate sequence q ={q₁, q₂,…,q_T} maximizes the probability of generating the observation sequence O ={o₁, o₂,…, o_T}.

By taking logarithm of the model parameters, the Viterbi algorithm [14] can be impletement with only N²T additions and wihout any multiplications. Define δ_t

( )

i be the highest probability along the singal path at time t, expressed as

( )

_{ _}^P

(

^o ^o ^o ^Λ

)

and by induction we can obtain

( ) [ ( ) ] ^{( )}

1 max ₊

+ = _t _ij _j _t

t j i δ i a b o

δ (3-25)

which is shown in Fig.3-11.

Fig.3-11 Maximization the probability of generating the observation sequence

δ_t(1)

The Viterbi algorithm is expressed as follows (i) Preprocessing

( )

(ii) Initialization

( ) (

₁

( ) ) ( )

₁

(iii) Recursion

( ) (

( ) )

i N

[

( )

]

(iv) Termination

[ ]

( )

ⁱ

(v) Backtracking

( )

t ^* t

t q

q =ψ ₊₁ ₊₁ , t=T−1,T−2,...,1 (3-35)

From the above, the state sequence q which maximizes P~^*implies an alignment of observations with states.

The above procedures, viterbi alignment, modified k-means and parameter estimation, are applied until P~^*

converges. After obtaining the initialized HMM, the Baum-Welch algorithm and the Viterbi search are then applied to get the first

estimation of the HMM. Finally, the Baum-Welch algorithm is performed repeatedly to reestimate the HMMs simultaneously. The Baum-Welch algorithm will be introduced later.

II. Boundary information is not available

In this case, all the HMMs are initialized to be identical and the mean and the variance of the all states are set to be eqaul to the global mean and variance. As for the initial state distribution π and state-transition probability distribution A, there is no information to compute these parameters; hence, the parameters π and A should be set arbitrarily. From the above process, the initialized HMMs are then generated.

Afterwards, the processes for reestimating HMMs are resemble the reestimated processes for boundary information, that is using the Baum-Welch algorithm. After reestimating by Baum-Welch algorithm, the Viterbi search is also needed to re-align the boundaries of the sub-word. This step is different to the training procedure which already have boundary information. The next section will introduce the Baum-Welch algorithm employed in the HMM training processing.

3.3.3

Baum-Welch reestimation

The Baum-Welch algorithm, known as the forward-backward algorithm is the core of training HMM. Consider the forward variable α_t(i) defined as

( )

i P

(

o,o ,...,o ,q i|Λ

)

α_t = ₁ ₂ _t _t = (3-36)

that means the probability of the state i at time t which having generating the observation sequence o₁, o₂,…, o_t given the model Λ, shown in Fig.3-12. The forward

variable is obtained inductively by Step 1. Initialization:

( ) ( )

₁

1 i π_ib_i o

α = , 1≤i≤N (3-37)

Step II. Induction:

( ) ( ) ( )

₁

In similar way, the backward variable is defined as

( )

i P

(

o_t ,o_t ,...,o_T|q_t i,Λ

)

t = ₊₁ ₊₂ =

β (3-39)

that represent the probability of the observation sequence from t +1 to the end given state i at time t and the model Λ, shown in Fig.3-12. The backward variable is obtained inductively by

Step I. Initialization:

( )

i =1

βT _,1≤i≤N (3-40)

Step II. Induction:

( )

( ) ( )

_j _t _ij

Fig.3-12 Forward variable and backward variable

Besides, three variables should be defined, that is ξ_t( i , j ) and the posteriori posteriori probability γ_t(i) is expressed as

( ) ( ) ∑ ( )

which represent the probability of being in state i at time t with the k-th mixture component accounting for o_t.

The HMM parameter A, π can be re-estimated by using the variables mentioned above as

jk =covariancematrix of theobservationsat state andmixture Σ

( ) ( )( )

From the statistical viewpoint of estimating HMM by Expectation-Maximization (EM) algorithm, the equations for estimating the parameters are the same as the equations derived from Baum-Welch algorithm.

Besides, it has been shown that the likelihood function will converge to a critical point after iterations and the Baum-Welch algorithm leads to a local maximum only due to the complexity of the likelihood function.

3.4 Recognition Procedure

Given the HMMs and the observation sequence O ={o₁, o₂,…, o_T }, the recognition stage is to compute the probability P(O|Λ) by using an efficient method,

在文檔中針對非特定語者語音辨識使用不同前處理技術之比較 (頁 31-0)

Perceptual Linear Predictive (PLP) Analysis

Chapter 2 Front-End Techniques of Speech Recognition System

2.5 Feature Extraction Methods

2.5.3 Perceptual Linear Predictive (PLP) Analysis

( )

[

(

( )

) ]

[

(

( )

) ]

( )

( ) ∑ ( ) ( )

( )

( ) ( )

( ) ( )

(

) (

) (

)

( )

( )

Chapter 3

Speech Modeling and Recognition

3.1 Introduction

{

}

{

}

3.2 Hidden Markov Model

(

)

(

)

{

}

(

)

(

)

( )

( )

(

)

( ) (

)

( ) ( )

( ) ( ) ( )

[

]

( ) ( ) ( )

[

]

( )

3.3 Training Procedure

(

)

Midified k-means algorithm

Viterbi Search

( )

( )

(

)

( ) [ ( ) ] ( )

( )

( ) (

( ) ) ( )

( ) (

( ) )

[

( )

]

[ ]

( )

( )

Baum-Welch reestimation

( )

(

( ) [ ( ) ] ^{( )}