智慧型車輛之控制、感測與資訊處理技術研發(II)---子計畫II：智慧車用語音與聲響資訊處理技術研發與虛擬實境之多媒體影音控制技術開發

(1)

行政院國家科學委員會補助專題研究計畫成果報告

※※※※※※※※※※※※※※※※※※※※※※※

※

智慧型車輛之控制、感測與資訊處理技術研發﹙1/3﹚

※

子計畫二：智慧車用語音與聲響資訊處理技術研發

※

※ ※

※※※※※※※※※※※※※※※※※※※※※※※

計畫類別：□個別型計畫

■整合型計畫

計畫編號：NSC90－2213－E－009－096－

執行期間：90 年 08 月 01 日至 91 年 07 月 31 日

計畫主持人：林進燈教授

共同主持人：

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

執行單位：國立交通大學電機與控制工程研究所

中

華

民

國

91 年

7 月

31 日

(2)

行政院國家科學委員會專題研究計畫成果報告

子計畫二：智慧車用語音與聲響資訊處理技術研發

In-Car Speech and Audio Information Processing for Smart Car System

計畫編號：NSC90-2213-E-009-096

執行期限：90 年 08 月 01 日至 91 年 07 月 31 日

主持人：林進燈教授交通大學電控所

Abstr act

A new speech recognition technique that is used in the environment inside the intelligent vehicles is proposed for continuous speech-independent recognition of spoken Mandarin digits. One popular tool for solving such a problem is the HMM-based one-state algorithm. However, two problems existing in this conventional method prevent it from practical use on our target problem. One is the lack of a proper selection mechanism for robust acoustic models for speaker-independent recognition. The other is the information of intersyllable co-articulatory effect in the acoustic model is contained or not. In this paper, we adopt the principle component analysis (PCA) technique to solve these two problems. At first, a generalized common-vector (GCV) approach is developed based on the eigenanalysis of covariance matrix to extract an invariant feature over different speakers as well as the acoustical environment effects and the phase or temporal difference. The GCV scheme is then integrated into the conventional HMM to form the new GCV-based HMM, called GCVHMM, which is good at speaker- independent recognition. For the second problem, context-dependent model is done in order to account for the co-articulatory effects of neighboring phones. It is important because the co-articulatory effect for continuous speech is significantly stronger than that for isolated utterances. However, there must be numerous context-dependent models generated because of modeling the variations of sounds and pronunciations. Furthermore, if the parameters in those models are all distinct, the total number of model parameters would be very huge. To solve the problems above, the decision tree state tying technique is used to reduce the number of parameter, hence reduce the computation complexity.

1. Intr oduction

Automatic speech recognition (ASR) is useful as a form of input. It is especially useful when someone's hands or eyes are busy. It also

allows people with handicaps such as blindness or palsy to use computers. Especially for the environment inside the intelligent vehicles, automatic speech recognition is a helpful and friendly man-machine interface for the drivers of the intelligent vehicles. Because of the potential applications mentioned above, we attempt to develop a speaker-independent automatic speech recognition system for Mandarin digits.

In recent years, most automatic speech recognition technologies were based on the so-called Hidden Markov Models (HMM) and used the connected word pattern matching method to achieve continuous speech recognition. There exists many methods to solve the connected word pattern-matching problem. One well-known method is called one-state algorithm. There are two problems in continuous speech recognition based on the one-stage algorithm, one is how to build a reference model to characterize the acoustic feature of speech signal, the other is the information of intersyllable co-articulatory effect in the acoustic model is contained or not.

Due to the first problem mentioned above, the one-state algorithm is sensitive to the reference patterns, and thus the choice of reference patterns is important. One well-known and widely used statistical method of characterizing the spectral properties of the frames of a speech pattern is the HMM approach. The better the HMM models the acoustic signals, the better performance the one-state algorithm can achieve. One of the most important issues of speaker-independent (SI) speech recognition system is the estimation of robust speech model over different speakers. The statistical speech models for each phone unit of the recognition system should be estimated to cover the spectral variations in speech signal caused by intra-speaker differences. In this thesis, we propose a new framework of HMM called

generalized common-vector-based HMM

(GCVHMM) as a reference for speaker-independent automatic speech recognition.

For the second problem, context-dependent model is done in order to account for the

(3)

co-articulatory effects of neighboring phones. It is important because the co-articulatory effect for continuous speech is significantly stronger than that for isolated utterances. However, there must be numerous context-dependent models generated because of modeling the variations of sounds and pronunciations. Furthermore, if the parameters in those models are all distinct, the total number of model parameters would be very huge. To solve the problems above, the decision tree state tying technique is used to reduce the number of parameter, hence reduce the computation complexity.

2. HMM

The HMM, which uses probabilistic functions of Markov chains to model random processes, is a model of stochastic process. The effectiveness of this model class lies in its ability to deal with non-stationarity that often appears in the observed data sequences. HMMs usually turn out to be a good model for non-stationary process, such as the sequence of the speech observation vectors.

2.1 Elements of an HMM

An HMM can be characterized by the set of parameters A, δ , and B. We list all these parameters as following to represent an HMM. 1. N, the number of states in the model. The

states are hidden in HMM, which have some physical significance attached.

2. M, the number of mixtures per state for the output probability distribution of a continuous probability density function (pdf) of Gaussian mixtures.

3. The state transition probability distribution

A= [ai,j], where

( | ), 1 , .

, 1

a P j i i j N

i j = θt= θt− = ≤ ≤

4. The observation probability distribution in state j, B = {bj(ot)}, where

.

1 ),

|

(

)

(

o

P

o

j

N

b

_j _t

=

_t

θ

_t

=

≤

5. The initial state distribution δ ={ δi} where

.

1 ),

(

₀

i

N

P

i

=

θ

=

≤

δ

It can be seen from the above discussion that a complete specification of a HMM requires specification of two model parameters (N and M), and the specification of the three probability measures A, B, and finally the initial state distribution δ. For convenience, we indicate the complete parameter set of the model by:

Ù =(A, ä, B)

2.2 Three Basic Issues for HMMs

In order to solve that the HMM can be used in real-world applications, there are three basic problems as follows:

Issue 1: Given the observation sequence O, and a model Ù , how do we efficiently compute P(O|Ù), the probability of the observation sequence given the model?

Issue 2: Given the observation sequence O, and the model Ù , how do we choose a corresponding state sequence

Θ

ˆ

, which is optimal in some meaningful sense?

Issue 3: How do we adjust the model parameters Ù = {A, ä, B} to maximize

P(O|Ù )?

3. GCVHMM

The statistical speech models for each phone unit of the speaker-independent (SI) recognition system should be estimated to cover the spectral variations in speech signals caused by intra-speaker differences. Gülmezoðlu, et al. proposed a common vector approach (CVA) for SI isolated word recognition. In CVA, a common vector that represents common properties of one specific spoken word is obtained by estimating a common subspace. However, CVA needs the impractical assumption that the training data form a set of linearly independent vectors.

In this chapter, we generalize the CVA to relax its constrain and propose a new extension of HMM called generalized

common-vector-based HMM (GCVHMM). There are two phases

in the GCVHMM, extraction of robust features and estimation of HMM. In the first phase, a generalized CVA is developed based on the eigenanalysis of covariance matrix to extract an invariant feature, called generalized common vector (GCV). To relax the linearly independent assumption in the original CVA, we divide the eigenvalues of covariance matrix into two sets such that all the eigenvalues of the first set are greater than those of the second set. The common vector is obtained by projecting feature vectors on the subspace spanned by the eigenvectors whose corresponding eigenvalues are in the second set. In the second phase, the GCVs are used for the estimation of continuous observation density in HMM and form the so-called GCVHMM. In GCVHMM, in addition to the original elements of a traditional HMM, a new element, GCV transformation matrix, is added to extract GCV from speech feature vectors. Finally, a re-estimation algorithm based on Baum-Welch method to estimate all the parameters of GCVHMM is derived.

(4)

In this thesis, a N state, left-to-right continuous observation density HMM, denoted as Ω, is considered. The initial probability for state i is denoted byδi =P(θ0=i), 1≤ i ≤N, and

the transition probability from state i to state j by

ai,j= P(θt= j | θt-1= i) for 1 ≤ i, j ≤ N. Denote

{ }

1 N i i δ = δ ₌ , and

{ }

_, , 1 N i j _{i j} A a = = . For the

calculation of the observation density in state i, denoted as bi(ot), for observation ot, the generalized common vector of ot given the matrix transformation of generalized common vector is first extracted. Then bi(ot) =P(ot|θt=i), 1≤ i≤N assumed to be a mixture of Gaussians is then given as

( )

, , 1 ( ) , 1 M i t i k i k t k b o c b o i N = =

∑

≤ ≤

where M is the mixture number, ci,k is the probability of mixture k in state i, and bi,k(o) is the gaussian distribution given by

1 , , , , , , , 1 ( ) ( ) 2 , , 1 ( ) ( 2 ) t i k i k i k t i k i k s y y i k t D i k b o e η η π − − − Λ − = Λ

where Ds=D-Dg is the dimension of the extracted GCV yt,i,k from ot, yt,i,k is the GCV of ot for mixture k in state i, and Λi,k and ηi,k are the covariance matrix and mean vector corresponding to mixture k in state i, respectively. Λi,k is assumed to be diagonal, i.e.,

, ,1 , ,2 , , , 0 0 0 0 0 0 s i k i k i k i k D σ σ σ       Λ =         L L M M O M L so that -1_, _{, ,}1 1 s D i k _l σi k l − =

Λ =

∏

. The GCV yt,i,k from

ot for mixture k in state i is defined as , , , ,t t i k i k o y =V where , , ,1, , ,2,..., , , s T i k i k i k i k D V = v v v _

is matrix transformation of generalized common vector for mixturek in state i. For convenience in the following derivation, we also define

, , ,

i k Vi k i k

η = µ

then we can write

(

)

, , , , , , , t i k t i k i k i k t i k z = y −η =V o −µ Denote

{ }

1 N i i B= b ₌ and Ω= {δ, A, B}.

3.2 Reestimation algor ithm for the

par ameter s of GCVHMM

For an observation sequence O= (o1, o2, … ,

oT) unobserved state sequence Θ = (θ0, θ1,

θ2, … , θT), and unobserved mixture component sequence K = (k1, k2, … , kT), the joint

probability density of P(O,Θ,K|Ω) is defined as

(

)

0 1, , , 1 , , | ( ) t t t t t t T k k t t P O K δ_θ a_θ₋ _{θ θ}c b_θ o = Θ Ω =

∏

where T is the number of observation in O. It follows that the likelihood of O given Ω has the form

(

|

)

( , , | ) K P O P O K Θ Ω =

∑∑

Θ Ω

where the summations are over all possible state sequences and mixture component sequences.

Given an observation sequence O, the objective is to maximize P(O|Ω) over all parameters inΩ. It is, however, difficult to solve this problem by directly maximizing P O( |Ω) over Ω. In this following, we shall use the EM algorithm to estimate the parameters of HMM. The EM algorithm is a two-step iterative procedure. In the first step, called the expectation step (E step), we compute the auxiliary function for the equation

( , ') ( , , | )log ( , , | ') all all K

Q P O K P O K

Θ

ΩΩ =

∑∑

Θ Ω Θ Ω

In the second step, called the maximization step (M step), we find the value of Ω' that maximizes Q(Ω,Ω'), i.e.,

'

arg max ( ,Q ')

Ω

Ω = Ω Ω

It has been shown that if Q(Ω,Ω')≥Q(Ω,Ω), then P(O|Ω') ≥ P(O|Ω). Therefore, iteratively applying the E and M steps of equations guarantees monotonic increase in the likelihood. The iterations are continued until the increase in the likelihood is less than some predetermined threshold.

From the following decomposition: log ( , ,P OΘ K|Ω =')

0 1

' ' ' '

, , ,

1 1 1

log log log log ( )

t t t t t t T T T k k t t t t a c b o θ θ θ θ θ δ ₋ = = = +

∑

+

∑

+

∑

it is straightforward to shown that Q(Ω,Ω') can be decomposed into a sum of four auxiliary functions:

{ }

' , ₁ 1 ( , ') ( , ') [ , ] i N _N a i j _j i Q Q_δ δ Q a = = Ω Ω = Ω +

∑

Ω

{ }

' , ₁ 1 [ , ] j N _M c j k _k j Q c = = +

∑

Ω ' , 1 1 ( , ) N M b j k j k Q b = = +

∑∑

Ω where ' 0 1 ( , ') ( , , | ) log N i i K Q_δ δ P Oθ i K δ = Ω =

∑∑

= Ω

{ }

' ' , ₁ 1 , 1 1 [ , ] ( , , , | )log i N T N a i j _j t t i j j t K Q a P Oθ₋ iθ j K a = _{= =} Ω =

∑∑∑

= = Ω

{ }

' ' , ₁ , 1 1 [ , ] ( , , | )log j M T M c j k _k t t j k k t Q c P Oθ j k k c = _{= =} Ω =

∑∑

= = Ω

(5)

' ' , , 1 ( , ) ( , , | )log ( ) T b j k t t j k t t Q b P Oθ j k k b o = Ω =

∑

= = Ω

This implies that the four sets of parameters can be independently maximized. The maximization results of first three auxiliary functions are 0 ( , | ) ( | ) i P O i P O θ δ = = Ω Ω 1 1 , 1 1 ( , , | ) ( , | ) T t t t i j T t t P O i j a P O i θ θ θ − = − = = = Ω = = Ω

∑

1 , 1 ( , , | ) ( , | ) T t t t j k T t t P O j k k c P O j θ θ = = = = Ω = = Ω

∑

Substituting the following decomposition

' ' 1 ' ' 1 ' , , , , , , , 1 1 log ( ) (2 ) log( ) 2 2 2 t t t t t t t t t t T s k t k t k k t k D b_θ o =− π + Λ_θ − − z_θ Λ_θ −z_θ where ' ' ' ' ' ,t,t ,t,t t,t t,t( t,t) t k t k k k t k z_θ =y_θ −η_θ =V_θ o −µ_θ for log _, ( ) tkt t

b_θ o and differentiating it with respect to ' , j k µ and ' 1 , , j k l σ − _{,, we obtain} 1 , 1 ( , , | ) ( , , | ) T t t t t j k T t t t P O j k k o P O j k k θ µ θ = = = = Ω ⋅ = = = Ω

∑

' 2 , , , 1 , , 1 ( , , | ) ( , , | ) T t t t j k l t j k l T t t t P O j k k z P O j k k θ σ θ = = = = Ω ⋅ = = = Ω

∑

where z_{t j k l}'_{, , ,} is the lth element of z_{t j k}'_{, ,} . To obtain the solution for '

, , j k l v , which is the lth element of v'_{j k}_, : ' ' , ,, , , 1, 1 j k l j k l s v v = ≤ ≤l D the constrains ' ' , , , , , , 1 1 ( , 1) 2 s D i k l j k l j k l l v v ρ = −

∑

are added to Qb( ,Ωb'j k_, ). Then,

' , ' , , ( , ) 0 b j k j k l Q b v ∂ Ω = ∂ r we obtain , , , , , , , j k j k l j k l j k l R v =ε v where ' ' , , , 1 ( , , | )( )( ) T T j k t t t j k t j k t R P Oθ j k k o µ o µ = =

∑

= = Ω − −

It can be said that Rj,k characterizes the variations for mixture k in state j so that it plays the same role as ΦX in previous section. Thus, the eigenvectors of Rj,k corresponding to the eigenvalues of smallest Ds are selected to constitute the GCV matrix transformation for mixture k in state j.

4. A Hybr id Decision Tree

To overcome this limitation, we have introduced the integrated generalized common vector approach into the conventional HMM in chapter 3, which is better at speaker-independent recognition because of its ability to extract common invariant features over different speakers. Besides modeling acoustic parameters, most of the variations are due to consistent contextual effects in practice. Therefore, we can focus our research on context-based information. Since the co-articulatory effect for continuous speech is significantly stronger than that for isolated utterances, it is important to study the modeling of context-dependent “subword” units. Here, “subword” means “Mandarin digits”, which indicate syllable equally.

The most important reason why we use the method of decision tree state tying is that the total number parameters in all models is prohibitively large. The computation complexity to train all these parameters would be intolerable. To reduce the total number of model parameters, one approach is to reduce the number of parameters in each model. The way of using continuous HMMs with tied parameters, parameter tying, reduces the parameter count while maintaining the model accuracy, and is popularly used in most ASR systems.

After incorporating common vector features mentioned in Chapter 3 with the structure of decision tree state tying, the decision tree algorithm should be modified as follows:

1. Locate a (small) set of left context digit syllable questions manually.

2. For each center Mandar in digit syllable p:

nEstimate all left context digit syllable GCVHMMs.

nFor each Markov state k in the model topology, classify all the k-th output distribution in all left context digit syllables using a binary tree.

a)Put all the tr aining data in k-th state of all left context digit syllables into the r oot node.

b)Classify all the tr aining data by each question in the question set. Using the cluster ed data to gener ate Gaussian distr ibution of common featur es by the method of GCVHMM intr oduced in Chapter 3. Then compute the likelihood of the par ent node fr om Equation (4.15): ( )

(

)

( )

, ( ) 1 1 ln 2 ln 2 S i k i t i L= − _D + π + Λ _⋅

∑∑

ϒ t

wher e DS and

Λ

i k, r epr esent the

(6)

of common vector after the method of GCVHMM.

c)Split the node by each question in the question set. By splitting, some tr aining data that come fr om the left context digit syllables which answer yes to the question go to the yes-child node; those which answer no to the no-child node. Then calculate the ever y likelihood of two child nodes. Finally, compute the likelihood incr ement by each question in the question set.

d)Find the best question in the question set by computing the most likelihood incr ement for each of the newly cr eated childr en.

e)Go to step b) unless some stop-gr owing cr iter ia is met.

Third States of * - 0 Question 1 Question 2 Question 3 Yes Yes Yes No No No 4 - 0 5 - 0 9 - 0 silence - 0 6 - 0 8 - 0 1 - 0 3 - 0 7 - 0 0 - 0 2 - 0 4 - 0 1 - 0 8 - 0 etc. .... Third States of 0 Second States of 0

First States of 0 Fourth States of 0 Fifth States of 0

GCVHMM of digit syllable "0"

5. Mandar in

Digits

Recognition

Exper iments

The speech data used in our experiments are the set of continuous Mandarin digits. We use a speech database from 20 persons including 10 males and 10 females. Each one speaks 10 times of each Mandarin digit. The recording sampling rate is 8kHz and stored as 16-bit integer.

5.1 Balanced Cor por a

Digit

Model

Decision Tree State Tying Based on GCVHMMs GCVHMMs HMM 0 91.667 83.333 58.333 1 61.111 27.778 0.000 2 61.538 53.846 76.923 3 100.000 89.474 100.000 4 83.333 50.000 50.000 5 100.000 94.118 100.000 6 60.000 30.000 0.000 7 100.000 46.154 7.692 8 86.667 40.000 60.000 9 100.000 69.231 23.077 Aver age 84.432 58.393 47.603

5.2 Unbalanced Cor por a

Digit Model

Decision Tree State Tying Based

on GCVHMMs GCVHMMs HMM 0 75.610 65.854 53.659 1 66.522 76.087 28.261 2 72.727 54.545 54.545 3 84.091 56.818 90.909 4 93.182 100.000 88.636 5 81.818 97.727 95.455 6 75.556 51.111 80.000 7 95.455 100.000 95.455 8 93.182 77.273 90.909 9 90.909 95.455 95.455 Aver age _82.9052 _77.487 _76.419

5.3 Balanced Tree

Digit Model Balanced Decision Tree State Tying Based on GCVHMMs

Unbalanced Decision Tree State Tying Based

on GCVHMMs 0 75.610 51.220 1 66.522 34.783 2 72.727 63.636 3 84.091 68.182 4 93.182 86.364 5 81.818 77.273 6 75.556 64.444 7 95.455 88.636 8 93.182 77.273 9 90.909 90.909 Aver age 82.9052 70.272

6. Conclusion

To consider the contextual effects of continuous speech that play an important role in Mandarin, we combine a method of the Decision Tree State Tying with GCVHMM. The balanced corpora mean that the count of females and males in the database are equivalent entirely. It shows 26.039% improvement when we replace GCVHMM with Decision Tree State Tying based on GCVHMM. Nevertheless, if the database is unbalanced, the performance comparison shows 5.4% improvement by employing the Decision Tree State Tying based

(7)

on GCVHMM. To overcome leaving the major part of models behind in the unbalanced tree, we modify the tree as the balanced tree. We can find that the results show 12.6332% improvement by employing the balanced decision tree state tying based on GCVHMM. This technique is utilized as a helpful and friendly man-machine interface in the environment inside the intelligent vehicles.

智慧型車輛之控制、感測與資訊處理技術研發(II)---子計畫II：智慧車用語音與聲響資訊處理技術研發與虛擬實境之多媒體影音控制技術開發

行政院國家科學委員會補助專題研究計畫成果報告

※※※※※※※※※※※※※※※※※※※※※※※

※

※

※

智慧型車輛之控制、感測與資訊處理技術研發﹙1/3﹚

※

※

子計畫二：智慧車用語音與聲響資訊處理技術研發

※

※ ※

※※※※※※※※※※※※※※※※※※※※※※※

計畫類別：□個別型計畫

■整合型計畫

計畫編號：NSC90－2213－E－009－096－

執行期間：90 年 08 月 01 日至 91 年 07 月 31 日

計畫主持人：林進燈 教授

共同主持人：

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

執行單位：國立交通大學電機與控制工程研究所

中

華

民

國

91

年

7

月

31

日

行政院國家科學委員會專題研究計畫成果報告

子計畫二：智慧車用語音與聲響資訊處理技術研發

In-Car Speech and Audio Information Processing for Smart Car System

計畫編號：NSC90-2213-E-009-096

執行期限：90 年 08 月 01 日至 91 年 07 月 31 日

主持人：林進燈 教授 交通大學電控所

Abstr act

1. Intr oduction

2. HMM

2.1 Elements of an HMM

.

1

),

|

(

)

(

o

P

o

j

j

N

b

=

θ

=

≤

≤

.

1

),

(

i

i

N

P

=

θ

=

≤

≤

δ

2.2 Three Basic Issues for HMMs

Θ

計畫主持人：林進燈教授

主持人：林進燈教授交通大學電控所