Chapter 2 Front-End Techniques of Speech Recognition System
2.5 Feature Extraction Methods
2.5.3 Perceptual Linear Predictive (PLP) Analysis
The Perceptual Linear Predictive (PLP) analysis is first presented and examined by Hermansky in 1990 [4] for analyzing speech. This technique combines several engineering approximations of psychophysics of human hearing processes, including critical-band spectral resolution, the equal-loudness curve, and the intensity-loudness power law. As a result, the PLP analysis is more consistent with the human hearing. In addition, the PLP analysis is beneficial for speaker-independent speech recognition due to its computational efficiency and yielding a low-dimensional representation of
Fig.2-10 The Mel filter banks (a) Fs = 8 kHz and (b) Fs =16 kHz
H2(k)
(a)
(b)
H20(k) H18(k)
H1(k) …… H19(k)
H24(k) H22(k)
H1(k) H2(k) …… H23(k)
speech. The block diagram of the PLP method is shown in Fig.2.11, and each step will be described below. [12]
Step I. Spectral analysis
The fast Fourier Transform (FFT) is first applied on the windowed speech segment (sw(k), for k=1,2,…N) into the frequency domain. The short- term power spectrum is expressed as
( )
ω[
Re(
St( )
ω) ]
2[
Im(
St( )
ω) ]
2P = + (2-37)
where the real and imaginary components of the short-term speech spectrum are squared and added. There is an example in Fig.2-12 which shows the short-term speech signal and its power spectrum P(ω).
Fig.2-11 Scheme of obtaining Perceptual Linear Predictive coefficeints Pre-processing
{s(k)}
Speech
{sw(k)} Critical-band analysis
Equal-Loudness Pre-emphasis
Intensity-Loudness
Conversion IDFT
Autoregressive modeling All-pole Model
FFT
Step II. Critical-band analysis
The power spectrum P(ω) is then warped along the frequency axis ω into the Bark scale frequency Ω as
( )
⎪⎭⎪⎬
⎫
⎪⎩
⎪⎨
⎧ ⎟ +
⎠
⎜ ⎞
⎝ + ⎛
= 1
1200 ln 1200
6
2
π ω π
ω ω
Ω (2-38)
where ω is the angular frequency in rad/sec, which is shown in Fig.2-13. The resulting power spectrum P(Ω) is then convoluted with the simulated critical-band masking curve Ψ(Ω) and get the critical-band power spectrum Θ(Ωi) as
( ) ∑ ( ) ( )
−
=
−
= 25
3 1 .
.P Ω Ψ Ω Ωi
Ω
Ω
Θ i , i=1,2,...,M (2-39)
where M is number of Bark filter banks and the critical-band masking curve Ψ(Ω), shown in Fig.2-14, is given by,
Fig.2-12 Short-term speech signal (a) in time domain and (b) power spectrum (b)
(a)
⎪⎪
where Ω is the Bark frequency just mentioned in (2-38). This step is similar to Mel filter banks processing of MFCC where the Mel filter banks are replaced by the analogous trapezoid Bark filter banks. The step between two banks is constant on the Bark scale, and the interval is chosen so that the filter banks must cover the whole analysis band. For example, 21 Bark filter banks, which cover from 0-Bark to 19.7-Bark in 0.985-Bark steps, are employed for analyzing speech signal of 16 kHz sampling frequency, shown in Fig.2-15. It is noted that 8 kHz is mapping to 19.687-Bark and the steps are usually chosen approximately 1-Bark. Fig.2-16 is the power spectrum after applying the Bark filter banks (M = 21) to the speech signal in Fig.2-12. The Bark filter banks and the Mel filter banks are both allocate more filters to the lower frequencies, where the hearing is more sensitive. Sometimes, the Bark filter banks are replaced with the Mel filter banks.
Fig.2-13 Frequency Warping according to the Bark scale
Fig.2-14 Critical-band curve
Fig.2-15 The Bark filter banks (a) in Bark scale (b) in angular frequency scale
Step III. Equal-loudness pre-emphasis
In order to compensate the unequal sensitivity of human hearing at different frequencies, the sampled power spectrum Θ(Ωi) obtained in the (2-39) is then pre-emphasis by the simulates equal loudness curve E(ω), expressed as
( )
Ωi E( ) ( )
ω Θ ΩiΞ = ⋅ , i=1,2,...,M (2-41)
where the function E(ω) is given by
( ) ( )
(
2 6) (
2 2 9) (
6 26)
4 6 2
10 58 9 10
38 0 10
3 6
10 8 56
× +
×
× +
×
× +
×
= +
. .
. ω .
E ω ω ω
ω
ω (2-42)
where E(ω) is a high pass filter. Then the value of the first and last samples are made equal to the values of their nearest neighbors, thus Ξ(Ωi) begins and ends with two equal-valued samples. Fig.2-17 shows the power spectrum after equal-loudness pre-emphasis. From the Fig.2-17, the part of higher frequency in Fig.2-16 has been well compensated.
Fig.2-16 Critical-band power spectrum
Step IV. Intensity-loudness power law
Since the nonlinear relation between intensity of the sound and its perceived loudness, the spectral compression is then utilized by using the power law of hearing given by
( )
Ωi Ξ( )
Ωi 0.33Φ = , i=1,2,...,M (2-43) where a cubic root compensation of critical band energies is applied. This step can reduce the spectral-amplitude variation of the critical-band spectrum. It is noted that the log arithmetic is adopted in the process of MFCC.
Fig.2-17 Equal loudness pre-emphasis
Fig.2-18 Intensity-loudness power law
Step V. Autoregressive modeling
The autocorrelation coefficients rs(n) are not computed in the time domain through (2-18) but is obtained as the inverse Fourier transform (IDFT) of the power spectrum P(ω) of the signal. The IDFT is better choice than the FFT here since only a few autocorrelation values are needed. If the order of the all pole model is equal to p, only the first p+1 autocorrelation values are used to solve the Yule-Walker equation.
Then the standard Durbin-Levinson recursion is employed to compute the PLP coefficients.
Chapter 3
Speech Modeling and Recognition
During the past several years, Hidden Markov Model (HMM) [20][21][22] has become the most powerful and popular speech model used in ASR because of its wonderful ability of characterizing the speech signal in a mathematically tractable way and better performance comparing to other methods. The assumption of the HMM is that the data samples can be well characterized as a parametric random process, and the parameters of the stochastic process can be estimated in a precise and well-defined framework.
3.1 Introduction
In a typical HMM based ASR system, the HMM is proceeded after the feature extraction. The input of the HMM is the discrete time sequence of feature vectors, such as MFCCs, LPCs, etc. These feature vectors are customarily called observations, since these feature vectors represent the inforamtion observable from the incoming speech utterance. The observation sequence O ={o1, o2, …, oT} is a set of the observations from time 1 to time T, where the time t is the frame index.
An Hidden Markov Model can be used to represnent a word (one, two, three, etc) , a syllable (“grand”, “fa”, “ther”, etc), a phone (/b/, /o/, /i/, etc), and so forth. The Hidden Markov Model is essentially structured by a state sequence q=
{
q1,q2,L,qT}
where qt∈
{
S1,S2,L,SN}
, N is the total number of states and each state is generally associated with a multidimensional probability distribution. The states of HMM canbe viewed as collections of similar acoustical phenomena in an utterance. The total number of state N should be chosen well to represent these phenomena. In general, different number of state of HMM would lead to differnet recognition results [12].
For a particular state, an observation can be generated according to the associated probability distribution. This means that there is not a one-to-one correspondence between the observation and the state, and the state sequence cannot be determined unanimously by a given oberservation sequence. It is noticed that only the observation is visible, not the state. In other words, the model possesses hidden states and is named as the “Hidden” Markov Model.
3.2 Hidden Markov Model
Formally speaking, a Hidden Markov Model is defined as Λ=
(
A,B,π)
, which includes the initial state distribution π, state-transition probability distribution A, and observation probability distribution B. Each elements will be illustrated respectively as follows.I. Initial state distribution π
The initial state distribution is defined as π= {πi}in which
(
i)
i =P q1=S
π , 1≤i≤N (3-1)
where πi is the probability that the initial state q1 of the state sequence
{
q1,q2,L,qT}
=
q is Si. Thus, the summation of the probability of all possible initial state is equal to 1, given as
2 1
1+π + +πN =
π L (3-2)
II. State-transition probability distibution A
The state transition probability distribution A of an N-state HMM can be expressed as {aij} or in the form of square matrix
with constant probability aij
(
q j|q i)
p
aij = t+1= t = , 1≤i,j≤N (3-4) representing the transition probability from state i at time t to state j at time t+1.
Briefly, the transitions among the states are governed by a set of probabilities aij, called the transition probabilities, which are assumed not changing with time. It is noticed that the summation of all the probabilities from a particular state at time t to itself and the others at time t+1 should be equal to 1, i.e. the summation of all the entries in the i-th row is equal to 1, given as of q being generated by the HMM is
(
| ,)
aq q aq q aqT qTP q Aπ =πi 1 2 2 3L −1 (3-6)
For example, the transition probability matrix of a three-state HMM can be expressed in the form as
S1
for arbitrary time t. Fig.3-1 shows all the possible paths, labeled with transition probabilities between states, from time 1 to T. The structure without any constrain imposed on state transitions is called ergodic HMM. It is easy to find that the number of all possible paths
( )
N2 T−1 (in this case N= 3 ) would greatly increase as time increasing.A left-to-right HMM (namely Bakis model) with the elements of the state-transition probability matrix
=0
aij , for j< i (3-9)
is adopted in general cases to simplify the model and reduce the computation time.
The main conception of a left-to-right HMM is that the speech signal varies with time from left to right, that is, the acoustic phenomena change sequentially and the first state must be S1. There are two general types of left-to-right HMM, shown in Fig.3-2.
Fig.3-1 Three-state HMM
By using a three-state HMM as an example, the transition probability matrix A with left-to-right and one-skip constrain, shown in Fig.3-3, can be express as
⎥⎥ possible paths between states of a three-state left-to-right HMM from time 1 to time T.
If no skip is allowed, the transition probability matrix A can be express as
⎥⎥
where the element a13 in (3-7) is replaced by zero. Similarly, Fig.3-5 shows all possible paths between states of a no-skip three-state HMM from time 1 to time T.
Fig.3-2 Four-state left-to-right HMM with (a) one skip and (b) no skip
Fig.3-3 Typical left-to-right HMM with three states (a)
III. Observation probability distribution B
Since the state sequence q is not observable, each observation ot can be envisioned as being produced with the system in state qt . Assume that the production of ot in each possible state Si is stochastic, where i=1, 2,…, N, and is characterized by a set of observation probability functions B = {bj(ot)} where
( )
t(
t t j)
j P |q S
b o = o = , j=1,2,...,N (3-12)
Fig.3-4 Three-state left-to-right HMM with one skip
Fig.3-5 Three-state left-to-right HMM with no skip S1
which discribes the probability of the observation ot being produced with respect to state j. If the distribution of the observations are continuous and infinite, the finite mixture of Gaussian distributions, that is, a weighted sum of M Gaussian distributions is used, expressed as
( ) (
jm jm t)
where µjm and Σjm indicates the mean vector and the covariance matrix of the m-th mixture component in state Sj. The observations are assumed to be independent to each other, the covariance matrix can be reduced to a diagonal form Σjm as
( ) ( )
or simplified as a vector with L-dimension as
( ) ( ) ( )
[
σjk σjk σjk L]
jk = 1 2 L
Σ (3-15)
where L is the dimension of the observation ot. The mean vector can be expressed as
( ) ( ) ( )
[
jm jm jm L]
jm = µ 1 µ 2 L µ
µ (3-16)
Then, the observation probability function bj(ot) can be written as
( )
As for the weighting coefficient wjk, it must satisfying
1
where wjk is non-negative value.
Fig.3-6 shows that the probabilities of the observations sequence O ={o1, o2, o3, o4 } generated by state sequence q = {q1, q2, q3, q4} are bq1(o1), bq2(o2), bq3(o3), bq4(o4), respectively.
3.3 Training Procedure
Given a HMM Λ={A, B, π} and a set of observations O ={o1, o2,…, oT}, the purpose of training the HMMs is to adjust the model parameters so that the likelihood
(
O|Λ)
P is locally maximized by using iterative procedure. The modified k-means algorithm [19] and Viterbi algorithm are employed in the process of obtaing initial HMMs. The Baum-Welch algorithm (or called the forward-backward algorithm) is performed to train the HMMs. Before applying the training algorithm, prepareation work of the corpus and HMM is required prior to the trainging procedure as below I. A set of speech data and their associated transcriptions should be prepared, and the
speech data must be transformed to the a series of feature vectors (LPC, RC, LPCC, MFCC, PLP, etc).
Fig.3-6 Scheme of probability of the observations q1
bq
1(o1) aq
1 q
2 aq
2 q
3 aq
T-1 q T
o1 o2 oT-1 oT
bq
2(o2) bq
T-1(oT-1) bq
T(oT)
time Observations
… q
q2 qT-1 T
…
…
…
II. The number of states and the number of mixtures in a HMM must be determined, according to the degree of variations in the unit. In general, 3~5 states and 6~8 states are used for representing the English phone and Mardarin Chinese phone, respectively.
It is noted that the features are the the observations of the HMM, and these observations and the transcriptions are then utilized to train the HMMs.
The training procedure can be divided into two manners depending on whether the sub-word-level segment information, or called the boundary information, is available, that is labeled with boundary manually. If the segment information is available, such as Fig.3-7(a), the estimation of the HMM parameter would be easier and more precise; otherwise, training with no segment information would cost more computation time to re-align the boundary and re-estimate the HMM, in addition, the HMM often performs not as good as the one with well-segment information. The transcription and boundary condition should be saved in text files, such as the form in Fig.3-7(b)(c).
It is noted that if the speech doesn’t have segment information, it is also necessary to get the transcription and save it before training. The block diagram of the training procedure is shown in Fig.3-8. The main difference between training the HMM with boundary information and training the HMM without boundary information is on the processing of creating the initialized HMM. Then, the following section will divided into two parts to present the details of creating the initialized HMM.
(a)
Fig.3-7 (a) Speech labeled with the boundary and transcription save as text file (b) with and (c) without boundary information
Fig.3-8 Training procedure of the HMM 0 60 sil
60 360 yi 360 370 sp 370 600 ling 600 610 sp 620 1050 wu 1050 1150 sil
(b) (c)
sil yi sp ling sp wu sil
Initial HMM with k-means and Viterbi
alignment (Fig.3-9)
Initial HMM with global mean and
variance With
boundary information?
Yes
No
Baum-Welch and Viterbi alignment to
obtain estimated HMM
Baum-Welch re-estimation
Feature vectors (observations)
…
Get HMMs
Baum-Welch re-estimation
Get HMMs
Viterbi search
I. Boundary information is available
The procedure of creating the initialized HMMs is shown in Fig.3-9, Fig 3-10.
The modified k-means algorithm and the viterbi algorithm are utilized in training iteration. On the first iteration, the training data of a specific model are uniformly divided into N segments, where N is the number of states of HMM, and the successive segments are associated with successive states. Then, the HMM parameters πi and aij
can be estimated first by
1
3.3.1
Midified k-means algorithm
For continuous-density HMM with M Gaussian mixtures per state, the modified k-means [13][14] are used for cluster the observations O into a set of M clusters which are associated to the number of mixtures in a state, shown in Fig.3-9. Let the i-th cluster of a m-cluster set at the k-th iteration denote as ωm,ki where i =1,2,…, m and k = 1,2,…, kmax with kmax being the maximum allowable iteration count. Y(ω) is the representive pattern for cluster ω. the number of clusters in the current iteration and i is the iteration counter in classification process. The modified k-means algorithm is given by
(i) Set m=1, k=1 and i=1; ωm,ki =Oand compute the mean Y(Ο) of the entire training set O.
(ii) Classify the vectors by minimum distance principle. Accumulate the total
intracluster distance for each cluster ωm,ki denoted as ∆ki. If none of the following conditions meet then back to (ii) and k=k+1.
a. ωm,k+1i =ωm,ki, for all i=1,2,…,m
b. k meets the preset maximum allowable number of iterations.
c. The change in the total accumulated distance is below the preset threshold ∆th.
(iii) Record the mean and the covariance of the m-cluster,. If m is reached the number of mixtures M, then stop, else, go to (iv).
(iv) Split the mean of the cluster that has largest intracluster distance and m=m+1, reset k and go to (ii).
From the modified k-means, the observations are clustered into M groups where M is the number of mixtures in a state. The parameters can be estimated by
j
jm =covariancematrix of theobservationsclassifiedin cluster min state Σ HMM parameters is all updated.
Fig.3-9 The block diagram of creating the initialized HMM
Fig.3-10 Modified k-means
Uniform Segmentation
Initialize Parameters
Viterbi alignment
Feature vectors (observations)
…
Modified k-means
Update the Model parameters
Converged ?
Initialized HMM Modified k-means
q1
Global mean
。
3.3.2
Viterbi Search
Except for the first estimation of the HMM, the uniform segmentation is replaced by Viterbi alignment, viz Viterbi search, which is applied to find the optimal state sequence q ={q1, q2,…,qT} where model Λ and the observations sequences O ={o1, o2,…, oT} are given. By the Viterbi alignment, each observation will be re-align to the state so that the new sate sequence q ={q1, q2,…,qT} maximizes the probability of generating the observation sequence O ={o1, o2,…, oT}.
By taking logarithm of the model parameters, the Viterbi algorithm [14] can be impletement with only N2T additions and wihout any multiplications. Define δt
( )
i be the highest probability along the singal path at time t, expressed as( )
{ }P(
o o o Λ)
and by induction we can obtain
( ) [ ( ) ] ( )
11 max +
+ = t ij j t
t j i δ i a b o
δ (3-25)
which is shown in Fig.3-11.
Fig.3-11 Maximization the probability of generating the observation sequence
δt(1)
The Viterbi algorithm is expressed as follows (i) Preprocessing
( )
i(ii) Initialization
( ) (
1( ) ) ( )
1(iii) Recursion
( ) (
t( ) )
i N[
t( )
ij]
ij(iv) Termination
[ ]
~( )
i(v) Backtracking
( )
t * t*
t q
q =ψ +1 +1 , t=T−1,T−2,...,1 (3-35)
From the above, the state sequence q which maximizes P~*implies an alignment of observations with states.
The above procedures, viterbi alignment, modified k-means and parameter estimation, are applied until P~*
converges. After obtaining the initialized HMM, the Baum-Welch algorithm and the Viterbi search are then applied to get the first
estimation of the HMM. Finally, the Baum-Welch algorithm is performed repeatedly to reestimate the HMMs simultaneously. The Baum-Welch algorithm will be introduced later.
II. Boundary information is not available
In this case, all the HMMs are initialized to be identical and the mean and the variance of the all states are set to be eqaul to the global mean and variance. As for the initial state distribution π and state-transition probability distribution A, there is no information to compute these parameters; hence, the parameters π and A should be set arbitrarily. From the above process, the initialized HMMs are then generated.
Afterwards, the processes for reestimating HMMs are resemble the reestimated processes for boundary information, that is using the Baum-Welch algorithm. After reestimating by Baum-Welch algorithm, the Viterbi search is also needed to re-align the boundaries of the sub-word. This step is different to the training procedure which already have boundary information. The next section will introduce the Baum-Welch algorithm employed in the HMM training processing.
3.3.3
Baum-Welch reestimation
The Baum-Welch algorithm, known as the forward-backward algorithm is the core of training HMM. Consider the forward variable αt(i) defined as
( )
i P(
o,o ,...,o ,q i|Λ)
αt = 1 2 t t = (3-36)
that means the probability of the state i at time t which having generating the observation sequence o1, o2,…, ot given the model Λ, shown in Fig.3-12. The forward
variable is obtained inductively by Step 1. Initialization:
( ) ( )
11 i πibi o
α = , 1≤i≤N (3-37)
Step II. Induction:
( ) ( ) ( )
1In similar way, the backward variable is defined as
( )
i P(
ot ,ot ,...,oT|qt i,Λ)
t = +1 +2 =
β (3-39)
that represent the probability of the observation sequence from t +1 to the end given state i at time t and the model Λ, shown in Fig.3-12. The backward variable is obtained inductively by
Step I. Initialization:
( )
i =1βT , 1≤i≤N (3-40)
Step II. Induction:
( )
N( ) ( )
j t ijFig.3-12 Forward variable and backward variable
Besides, three variables should be defined, that is ξt( i , j ) and the posteriori posteriori probability γt(i) is expressed as
( ) ( ) ∑ ( )
which represent the probability of being in state i at time t with the k-th mixture component accounting for ot.
The HMM parameter A, π can be re-estimated by using the variables mentioned above as
jk =covariancematrix of theobservationsat state andmixture Σ
( ) ( )( )
From the statistical viewpoint of estimating HMM by Expectation-Maximization (EM) algorithm, the equations for estimating the parameters are the same as the equations derived from Baum-Welch algorithm.
Besides, it has been shown that the likelihood function will converge to a critical point after iterations and the Baum-Welch algorithm leads to a local maximum only due to the complexity of the likelihood function.
3.4 Recognition Procedure
Given the HMMs and the observation sequence O ={o1, o2,…, oT }, the recognition stage is to compute the probability P(O|Λ) by using an efficient method,
Given the HMMs and the observation sequence O ={o1, o2,…, oT }, the recognition stage is to compute the probability P(O|Λ) by using an efficient method,