Fast and accurate recognition of very-large-vocabulary continuous Mandarin speech for Chinese language with improved segmental probability modeling

(1)

Fast and Accurate Recognition

of

Very-Large-Vocabulary

Continuous Mandarin Speech for Chinese Language with

Improved Segmental Probability Modeling

Jia-lin Shenl and Lin-shan Lee1l2

1. Dept. of Electrical Engineering, National Taiwan University 2. Information Science, Academia Sinica

Taipei, Taiwan, R.O.C. [email protected]

A

bs t

r

act

This paper presents a fast and accurate recognition of continuous Mandarin speech with very large vocabulary using a n improved Segmental Probability Model(SPM) approach. In order to extensively uti- lize the acoustic and linguistic knowledges to further improve the recognition performance, a few special techniques are thus developed. Preliminary simula- tion results show that the final achievable rate for the base syllable recognition with the improved Segmental Probability Modeling is as high as 91.62%, which indicates a 18.48% error rate reduction and more than 3 times faster than the well-studied sub-syllable-based CHMM. Also, a tone recognizer and a word-based Chi- nese language model are included and the achieved recognition accuracy for the finally decoded Chinese characters is 92.10%.

1 Introduction

Chinese language is not alphabetic and input of Chinese characters into computers is still difficult. Although there exist almost uncountable number of words in Chinese language, a nice characteristic of the language is that each word is composed of one t o several characters which are all monosyllabic, and the total number of phonologically allowed Mandarin syllables is only 1345. Also, Mandarin Chinese is a tonal language. There exist 4 lexical tones and 1 neutral tone. These 1345 Mandarin tonal syllables can be reduced to 408 base syllables disregarding the tones. Since the tones can be separately recognized using the primarily pitch contour information, fast and accurate recognition of the 408 Mandarin base syllables becomes the key problem for Mandarin speech recognition with very large vocabulary. Hidden Markov modeling HMM) of

problem [I], but here a different approach called Seg- mental Probability Modeling (SPM) appropriately u-

tilizing the monosyllabic nature of Mandarin speech is investigated in detail and improved performance was obtained.

SPM was first proposed for the recognition of isolated Mandarin base syllables[2]. This model is very sim- ilar t o continuous hidden Markov model (CHMM) ex- sub-syllabic units has been found very use

I

ul in this

cept that the state transition probabilities are deleted and the states equally segment the syllable. In order to extend the applications of SPM t o continuous speech recognition, the concatenated syllable matching(CSM) algorithm was previously developed[3], which has the following form :

where T[u] is the accumulated score at a point U,

Si(.,.) is the score when the SPM for the syllable i was matched with utterance section (u,v).

In the present research, a few special techniques are developed to further improve the SPM-based continuous Mandarin speech recognition with very large vocabulary. First, a modified SPM (MSPM) is proposed for better modeling the intra-syllabic and inter-syllabic acoustics and coarticulation. Secondly, the fundamental quefrency (or the first cepstrum coefficient) dips are found very useful in the detection of the syllable boundaries in the CSM algorithm. Thirdly, a non- uniform alignment( NUA) and a. segmental weighting (SW) processes are developed t o further improve the recognition accuracy. Finally, a syllable filter based on linguistic knowledge is applied t o eliminate some impossible syllable candidates considering the linguistic admissible transition between syllables. Prelimi- nary experimental results show that these techniques can improve the recognition accuracy step by step and the final achievable recognition rate can be as high as 91.62%, which indicates a l8.48% error rate reduction and more than 3 times faster than the well- studied sub-syllable-based CHM[M. Besides, integrating the tone recognition and the linguistic processing, 92.10% recognition accuracy for the finally decoded Chinese characters is achieved.

This paper is organized as follows. In Section 2, the proposed MSPM is discussed. Then, the dips in the fundamental frequency contour for CSM algorithm is evaluated in Section 3. The NUA and SW processes for improving the MSPM’s are discussed in Section 4. In Section 5, the syllable filter integrating linguistic knowledge is presented. The tone recognition and linguistic processing are described in Section 6. The experimental results are performed and analyzed in section 7. Section 8 finally makes the concluding re-

marks.

(2)

2 Modified SPM (MSPM)

(b) Conventionally each Mandarin syllable is decom-

posed into a n INITIAL/FINAL format. Here INI- TIAL means the initial consonant of a syllable and FINAL means the vowel or diphthong) part but in- 22 context independent(C1) INITIAL’s and 41 context independent(C1) FINAL’s. The 22 CI INITIAL’s can be further expanded t o 113 context dependent CD) following FINAL’s. It has been found that these 113 CD INITIAL’s and 41 CI FINAL’s give very good results for Mandarin speech recognition[l]. A segment sharing concept for SPM has also been proposed be- fore[4] and found very useful in the present problem, in which the first few segments of the SPM’s for the syllables having the same CD INITIAL’s share simi- lar characteristics thus can share the same segments of the models and so do the remaining segments of the models for the syllables having the same CI FINAL’s. Furthermore, in order t o include a ”transition segment” modeling the transition from the FINAL of a

syllable t o the INITIAL of the next syllable, a total of 41* 22+1)=943 transition segments will be needed if a

\

1 syllable transitions are considered. Instead, here all the possible ending phonemes of FINAL’s can be classified into 12 categories as shown in Ta-

ble l.(a) , while the INITIAL’s can be classified into 7

or 11 classes as shown in Table l.(b) and (c). In this way, the number of transition segments is reduced t o 12x(7+1)=96 or 12x(11+1)=144 respectively. These segment-shared SPM’s plus the transition segments constitute the modified SPM (MSPM) proposed here in this paper, for better modeling the intra-syllabic and inter-syllabic acoustics and coarticulation, as shown in Fig.1. The first few segments are used t o model the 113 CD INITIAL’s, the following several segments for the 41 CI FINAL’s and the last segment is the transition segment.

I I

*

I I I

I

I I “ m d vowels

I

eh

back vowels

I

a,o,u

.

...

P, are the linear prediction coeffi-

cients while { Z k } ,

k=1

...

P, are the poles corresponding

to the modes of the linear system of speech. As a consequence, the relationship between { a k } and { z k } can

be derived [5] :

IC = L..P (3)

In the CSM algorithm, the dips in the energy contour are first used to predict the ossible syllable beginning frames in a n utterance[l][37. However, the syllable boundary detection in continuous speech using energy dips is usually unstable and coarse because of the co-articulation effect in continuous speech. In this paper, the dips in fundamental quefrency contour are found very useful in the detection of syllable boundaries which can be used in place of the energy dips. This is because of the INITIAL/FINAL characteristics of Mandarin Chinese which will be discussed as belows.

Suppose an all-pole filter H z) of order P is used t o represent the system transfer

I

unction of speech, which

c1 = a 1 = z1

+

z2

+

...

+

z p

Because { a k } are all real, the poles { z k are either re-

reduced t o the sum of the projection on the real axis for all poles in Z-domain. It is noted that when the poles occur in the right half plane of Z-domain, higher value of c1 can be obtained, while the poles in left half plane imply lower value of c1. Since the spectral peaks for FINAL’s usually locate in the low frequency part while those for INITIAL’s usually locate in high frequency part, some falling gaps in the fundamental quefrency contour will occur in the transition from FI- NAL t o INITIAL. In addition, since th e INITIAL’s are much shorter in the whole syllable such that the majority part of INITIAL’s is co-articulated with the following FINAL’s, the fundamental quefrency contour (4)

a1 or complex conjugate pairs. There

f )

ore, c1 can be

(3)

will immediately rise from INITIAL to FINAL. In other words, there exist some fundamental quefrency dips in the inter-syllable transition boundaries. The dips in the fundamental quefrency contour are believed t o be more accurate and stable than that in the energy contour due t o the INITIAL/FINAL characteristics of Mandarin speech.

4 Non-uniform Alignment and

Segmental Weighting

In SPM's, the stochastic state transition behavior in HMM's is replaced by a deterministic process, i.e., u-

niform segmentation. However, the INITIAL part and the transition segment in a syllable are usually much shorter than the FINAL part, but very important for the recognition of Mandarin syllables. As a result, a non-uniform alignment( NUA) process instead of uniform segmentation is developed to divide the syllable section into segments in an utterance using a nonlinear function. This non-linear function is designed such that the INITIAL and transition parts occupy less length of the whole syllable, which has the form :

n = 1,

...,

N ₍₅₎

where g(n is the ending point for segment n, N is the syllable section. Also, a ( n ) is a non-negative mono- decreasing function except that cr(N

-

1) = - a ( l ) . Ap- parently, a ( N ) equals to zero such that g(N) = L.

Then, a nonlinear segmental weighting (SW) function is used to emphasize the likelihood score of the most discriminative parts, i.e. the INITIAL and transition parts. This segmental weighting function is composed of N elements, i.e. { w l , w2,

...,

W N } , where each

wj is a constant positive value. As a consequence, the

syllable section score Si U, U ) for syllable i in eq.(l) in-

tegrating the NUA and

L

W processes can be expressed as :

total num

b

er of segments, and L is the length of this

N

si(.,.)=

K C w j d j ( u . + g ( j - l ) , u + g ( j ) ; & ) (6)

j = l

factor, w , is the weighting factor for segment j and

d j ( a , b; X i j ) represents the segmental probability of segment j between the section (a,b) when matching with the model A ; , j .

5 Syllable Filter

In order to integrate some linguistic knowled e to further improve the recognition performance, a syaable filter is finally applied to eliminate some illegal syllable candidates such that a more accurate s llable path can be obtained. In addition, the syllabz filter can increase the number of correct candidates to improve

the linguistic processin

.

Here a syllable bigram is

derived t o describe the f!nguistic admissible transition from one syllable to another. The syllable bigram used

here is trained from a large Chinese text corpus which consists of a total of 4.2M chamcters (2.7M words) collected from daily newspapers. Therefore the syllable airs with higher probability in the s llable bi ram impPy stronger linguistically connection ietween tiem. Therefore, eq.(l) can be replaced by the following for- m:

T~[Y

-

11 =;,z$JZ(a

-

1)

+

S ~ ( Z , Y - 1)

+

d'h=-1(i),i](7)

where maxk means the k-th highest score in the 408xm accumulated probabilities, h z - l ( l ) is the top 1 syllable candidate in the frame point x-1 (and P h = - l ( 2 ) , i is the transition probability from syllable hz-l(l t o i. Al-

so, r ] is a weighting factor t o emphasize t

h

e syllable filter probabilities. In other words, for each possible ending point, the top m candidakes must be calcu- lated. It is clear that the recognition process integrating these syllable transition information is time- consuming. Instead, the syllable filter can be added in the post processing in which comparable improvement in the recognition accuracy can be achieved with much higher recognition speed.

- -

6 Tone Recognition and Lin-

guistic Processing

From e q ( l) , for each syllable section in an utterance, not only the base syllable recognition is evaluated but the tones can be recognized in the same phase[l]. Therefore, the final output in the acoustic processing is the tonal s llable recognition result. Here the CHMM is used as t

K

e tone recognizer with a total of 5 models each for one tone. Combining the base syllable and tone recognition results, a tonal syllable lattice with 10 candidates is first constructed for the linguistic processing. Then the tonal syllable lattice is transformed into a word lattice via a lexical access process. Finally the word-based Chinese language model trained from the Chinese text corpus mentioned previously is used to find out the most possible characters, words and sentences.

7 Experiments and Discussion

The speech database used here for speaker dependent task was produced by two male and two female speakers. Each speaker produced 3 sets of all the 1345 isolated Mandarin tonal syllables and 2 continuous ut- terances each for 352 phonetically balanced sentences (with a total of 2701 syllables covering all the 1345

Mandarin syllables). Also, 3 paragraphs randomly ?e-

lected from daily newspapers covering the economlcs, politics and societ y news separately were produced in continuous mode which is composed of totally 106 sentences or 1215 syllables. For base syllable recognition, Cepstral coefficients of order 14 arid the corresponding 14 delta cepstral coefficients are derived from the LPC coefficients and used as feature parameters. Instead, the pitch and energy together with their first and sec- ond order delta coefficients are used to form a feature vector with dimension 6 in the tone recognition.

In the following experiments, the 3 sets of 1345 isolated syllables are used in training initial models, the

(4)

2 x 352 phonetically balanced continuous sentences are used in re-estimating the continuous model parameter-

s, and the rest of 3 articles are used in testing. The recognition rates are evaluated as the percentages of correctly recognized syllables minus insertion rates and deletion rates. Moreover, the results here are average of the four speakers.

As shown in Table 2, the experimental result using syllable-based S P M a nd CSM algorithm provides a n accuracy of 61.27% only, but it can be significantly increased t o 73.91% with the segment sharing concep-

t. Also, the recognition speed is improved by exactly 4 times as compared t o t h e s llable-based SPM. Fur- thermore, the modified SPM [MSPM) with the transition segment representing t h e inter-syllable transition-

s is performed in experiments 3 and 4. The experiments with 7 and 11 classes of INITIAL’S are tested respectively where the error rates are further reduced by 10.43% and 7.82% in comparison with the segment shared SPM with slightly increased recognition time. T he above experimental results are obtained using the energy dips in the CSM algorithm. However, when the full search mode is applied, i.e., every frame is the possible beginning syllable point, the reco nition rates can be immediately increased from 7 6 . 63 gto 83.70% with more than 12 times of recognition time as also listed in Table 2. Now when t h e energy dips are replaced by the fundamental quefrency dips, the recognition com- plexity is almost unchanged but the accuracy can be significantly improved to 83.59%. Then, the proposed non-uniform alignment(NUA), the segmental weight-

ing (SW) function and syllable filter are added in the

segment shared SPM using CSM algorithm. I t can be found that the recognition rates can be improved step by step and finally t o as high as 89.53% and the time needed t o recognize a syllable is 0.38 sec in the last row of Table 2.

As a comparison, three types of CHMM’s are also tested in Table 3. First, the sub-syllable-based CHM- M, with exactly the same 113 CD INITIAL’S and 41 CI FINAL’S as basic units are tested in experiment 10

[I]. The syllable duration limitation is then considered

and the syllable filter is finally added in experiments 11

and 12. It is noteworthy that comparing experiment 9 with 12, 18.48% error rate reduction can be obtained at more than 3 times of recognition speed using the proposed improved SPM techniques.

Finally, we combine a CHMM-based tone recognie- er and a word-based Chinese language model with the base syllable recognition t o find out the output characters. As shown in Table 4, it can be found that the

results for tones are 86.67% and the achieved top 1 and top 10 recognition rates for tonal syllables are 81.10% and 98.97% respectively. T he final results for charac- ter accuracy with 10 syllable candidates included in the tonal syllable lattice are as high as 92.10%.

8 Conclusion

In this paper, we applied a n improved Segmen- tal Probability Model (SPM) t o continuous Mandarin speech recognition with very large vocabulary and achieve very good performance both in accuracy and speed. A few special techniques are developed to further improve the recognition performance step by step by making use of the acoustic and linguistic knowl-

”

quefrency (7 classes)

’1. PIUS NlJA 85.69 0.0 0.16 0.35

8. DlUS

sw

86.27 0.0 0.16 0.35 ~

edges. The final achievable rate is as high as 91.62%, which indicates a 18.48% error rate reduction and more than 3 times faster t h a n the well-studied sub-syllable- based CHMM. Adding a CHMM-based tone recognizer and a word-based Chinese language model, the achieved recognition accuracy for the finally decoded Chinese characters is 92.10%.

References

[l] Hsin-min Wang, Jia-lin Shen and Lin-shan Lee, ”Com- plete Recognition of Continuous Mandarin Speech for

Chinese Language with Very Large Vocabulary but Limited Training Data”, ICASSP, pp.61-64, 1995.

[2] Lin-shan Lee, et.aZ. ”Golden Mandarin(I1)

-

An Im-

proved Single-Chip Real-time Mandarin Dictation

11. plus duration

Machine fo; Chinese Language with very Large Vo- cabulary” , ICA SSP, UP. 5 03-5 06. 19 9 3.

84.2’( 1 0.08 I 0.82 1 1.15

,

- _

[3] Jia-lin Shen, Hsin-min Wang and Lin-shan Lee, ” A n

Initial Study on A Segmental ProbabiKty Model Ap- proach t o Large-Vocabulary Continuous Mandarin Speech Recognition”, ICASSP, pp.133-136, 1994.

[4] Jia-lin Shen, Hsin-min Wang, Renyuan Lyu, Lin-shan

Lee. ”Incremental Speaker Adaptation Using Phoneti- cally Balanced Training Sentences for Mandarin Sylla- ble Recognition Based on Segmental Probability Mod- els”, ICSLP, pp. 443-446, 1994.

[5] M. Schroeder, ”Direct nonrecursive) Relationship Be- tween Cepstrum and

I,

inear Predictor Coefficients”,

IEEE ASSP, pp. 297-311, Apr. 1994.

experiments

1

rate

I

ins.

1

dels.

I

time

t o t a l s y l l a b l e s

the time (sec/syllable) needed is on Sun SPARCZO. exoeriments I rate I ins. I dels. I time

1

I