AUTOMATIC DERIVATION OF A PHONEME SET WITH TONE INFORMATION FOR CHINESE SPEECH RECOGNITION BASED ON
MUTUAL INFORMATION CRITERION Jin-Song Zhang, Xin-Hui Hu, and Satoshi Nakamura ATR Spoken Language Communication Research Laboratories
2-2-2 Kansai Science City, Kyoto 619-0288 Japan {jinsong.zhang, xinhui.hu, satoshi.nakamura}@atr.co.jp
ABSTRACT
An appropriate approach to model tone information is help- ful for building Chinese large vocabulary continuous speech recognition system. We propose to derive an efficient phoneme set of tone-dependent sub-word units to build a recognition system, by iteratively merging a pair of tone-dependent u- nits according to the principle of minimal loss of the mutual information. The mutual information is measured between the word tokens and their phoneme transcriptions in a train- ing text corpus, based on the system lexical and language model. The approach has the capability to keep discrimi- native tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enable a flexible selection of phoneme set ac- cording to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of Initial/Finals, and derived sev- eral phoneme sets with different number of units. Speech recognition experiments using the derived sets showed their effectiveness.
1. INTRODUCTION
Chinese is a tonal language, in which each syllable is as- sociated with a kind of pitch tone. There are four basic tones and one neutral tone. The same syllables with dif- ferent tones have different lexical meaning. It has been an interesting and important topic how to model the tone in- formation to build a Chinese large vocabulary continuous speech recognition (LVCSR) system. Among a number of various kinds of approaches, the one using tone dependen- t sub-word units has the advantage of frame-synchronous consistency with the decoding strategy of the state-of-art LVCSR system, and has been widely adopted [1, 2, 4]. One common problem of these approaches is that the number of phoneme set of the LVCSR system will increase signifi- cantly after introducing tone dependencies. For ex., in the case of widely used traditional Chinese phoneme set of Ini- tials/Finals(IFs), the number of non-tonal IFs is 59, and that of tone-dependent ones is more than 200. As contex- t dependent tri-phone HMMs are usually used in LVCSR systems, their number will explode from tens of thousands to millions when tone-dependency is used, making it very
challenging how to train the tri-phone HMMs robustly. Al- so, the complexity of the phoneme hypotheses lattice will increase significantly, making the decoding more computa- tionally heavy.
The approaches to deal with the problem in the previous studies [1, 2, 4] are to hand-craft a small phoneme set con- taining tone-dependent phonemes, like tonemes [1], tonal main vowels [4], segmental tones [2] and etc.. Although they showed performance improvements in the recognition experiments, they still need to increase the phoneme set by several times due to a full expansion of non-tone unit- s to tone dependent ones. However, a full expansion of tone dependencies may be unnecessary. On the one hand, speakers tend to reduce some tones from their lexical form- s in daily speech [5] when the reductions do not obstacle speech communication. On the other hand, the lexical and language model (e.g., n-gram) information in an LVCSR system is usually very efficient to disambiguate most of ho- mophone words due to a lack of tone information [6], as ev- idenced by the fact that an incorporation of several-times- big tonal phoneme set has led to only slight recognition improvements[1, 4].
By viewing the full expansion of tone dependencies as unnecessary, we propose that only those tone dependencies be incorporated that are necessary for disambiguating word confusions of an LVCSR system. Different from several pre- vious studies on disambiguating word confusions which are based on the acoustic confusions of phonemes [7, 8, 9, 10], our method focuses on the disambiguation power from the lexical and language model. In other words, a tone depen- dency is not incorporated when the lexical and language model can disambiguate those homophone words resulting from the lack of that tone.
The real approach is realized as compacting the redun- dancy of an initial full-tone-dependent unit set, according to the principle of minimal loss of the mutual information.
The mutual information is measured between the word to- kens and their phoneme transcriptions in a training text corpus. A greedy search is adopted to merge two units at a time to minimize the corresponding mutual information loss. The final phoneme set can be flexibly chosen according to a balance between the number of units and the informa- tion quantities. Preliminary speech recognition experiments have been carried out to testify the effectiveness of the de- duced phoneme sets.
I 337
142440469X/06/$20.00 ©2006 IEEE ICASSP 2006
2. CHINESE PHONOLOGY
A Chinese word is composed of one to several characters, and each character is pronounced as a monosyllable with a pitch tone. The totally phonetically differentiable tonal syllables are about 1,300, and the number of base sylla- bles is about 410 when pitch tones are discarded. Tradi- tional Chinese phonology divides the syllables into demi- syllabic units: 21 Initials and 37 Finals, plus four basic lexical tones(Tone 1-4) and a neutral tone (0) (Table 1).
Initials b, p, m, f, d, t, n, l, z, c, s, zh, ch, sh, r, j, q, x, g, k, h,null initial
Finals a, ao, ai, an, ang, o, ou, ong, e, ei, en, eng, er ia, iao, ie, i1, i2, i3, iu, in, ian, ing, iang, iong, u, ua, uo, ui, uai, un, uan, uang, v, ve, vn, van Tone 0, 1, 2, 3, 4
Table 1. Pinyin symbols for Initials, Finals and lexical tones of Chinese syllables.
In the tone-dependent phoneme approach, all the Finals are expanded into tone dependent ones, likea0, a1, a2, a3, a4 and etc.. Although not all the combinations of Finals and tones exist, the number of the tone dependent phoneme set is still more than 200. When isolate monosyllable words are considered, tone contrasts may play an important role in discriminating the words. For ex, the following words:
ma1(mother), ma2(hemp), m3(horse) and ma4(scold), are only differentiated by the tones when in isolations. Howev- er, when in sentences, they will have very different context words. In other words, the lexical and language model (n- gram) has the power to disambiguate the four words even the tone information is ignored. Therefore, we regard that there are many redundancies in the original tone dependent IFs set for a recognition system, when given a lexical and language model.
3. THEORY FOR PHONEME SET OPTIMIZATION
Lexicon in Φ2
Lexicon based conversion
F1,2 Word Lattice Scoring
W1,2
Lexicon LM in Φ1
W
Fig. 1. Illustration of the problem formalization.
We formalize the phoneme set optimization problem into an information coding/decoding approach as illustrated in Figure 1, whereW stands for word based text corpus, Φ1
and Φ2for two different phoneme sets,F1,2for the different phoneme transcriptions of the W based on Φ1,2 lexicons respectively, W1,2 for the decoded words from F1,2 based on the same language model and the respective lexicons.
When a coding method Φi is lossless, the decoded words Wi should satisfyW = Wi. However, when the coding is
not uniquely decodeable, a better coding Φ∗should be the one
Φ∗= arg max
i I(W, Fi) wherei = 1, 2 The mutual informationI(W, Fi) can be calculated as
I(W, Fi) = H(W ) − H(W |Fi) (1)
= logP (W |Fi)− log P (W ) (2)
= log P (Fi|W )
alljP (Fi|Wj)P (Wj) (3) P (W ) and P (Fi|W ) represent two main components in the current speech recognition system: i.e., language mod- eling and probabilistic pronunciation variation modeling.
4. MINIMUM MI LOSS BASED PHONEME SET REDUCTION
We have designed a greedy approach to compact the redun- dancies of an initial phoneme set by iteratively merging one pair of phonemes whose merge leads to the least loss of MI.
Figure 2 illustrated the flow chart of the method.
• Initialization condition: the following resources are prepared.
– Initial phoneme set Φ0: it contains the full tone- dependent sub-word units.
– Lexicon: the one for speech recognition task and is represented in the initial phoneme set.
– Text corpus: the one standing for the speech recognition task.
– Language model: the one of the speech recog- nition system.
• Optimization procedure:
1. MI calculations: for each possible merge of t- wo phonemes Ψi:A + B→A, the reduced MI
∆MI(Ψi) is calculated using the assumingly merged lexicon.
2. Merge decision: among all the possible merges, the one Ψ∗that has the smallest reduced MI is selected as the effective merge of this iteration.
Ψ∗= arg min alli
∆MI(Ψi)
3. Renew the lexicon and phoneme set based on the effective merge Ψ∗.
4. Check if the stop criterion is satisfied or not. If no, go to step 1 and do 1-3 once again. If yes, stop the optimization and output the phoneme merging rules and new lexicon.
To avoid a computationally heavy exhaustive search through all possible phoneme merges, we limited the search to a constrained space of possible merges. It can be defined according to phonetic knowledge about acoustic similarities between pair of phonemes.
I 338
Lexicon and Phoneme Set
Defined Phoneme Merge
:
Ψi=A+B->A
MI Computations ASR Task
(Text Corpus) Language Model
Ψ
=argminMI(
Ψi)
Renew lexicon &
Merge definitions
End? Output
Lexicon New
Phoneme merging rules
N Y
Fig. 2. Illustration of the minimum MI loss based phoneme set reduction.
㪤㪠㩷㫉㪼㪻㫌㪺㫋㫀㫆㫅㩷㫎㫀㫋㪿㩷㫅㫌㫄㪹㪼㫉㩷㫆㪽㩷㫇㪿㫆㫅㪼㫄㪼㩷㫄㪼㫉㪾㪼㫊㩷㪽㫆㫉㩷㫋㫉㪸㫀㫅㫀㫅㪾㩷㪻㪸㫋㪸
㪌㪅㪌㪇 㪌㪅㪌㪇 㪌㪅㪌㪈 㪌㪅㪌㪈 㪌㪅㪌㪉 㪌㪅㪌㪉 㪌㪅㪌㪊
㪈 㪎 㪈㪊 㪈㪐 㪉㪌 㪊㪈 㪊㪎 㪋㪊 㪋㪐 㪌㪌 㪍㪈 㪍㪎 㪎㪊 㪎㪐 㪏㪌 㪐㪈 㪐㪎 㪈㪇㪊 㪈㪇㪐
㪈㪈㪌 㪈㪉㪈
㪈㪉㪎 㪈㪊㪊
㪈㪊㪐 㪈㪋㪌
㪈㪌㪈 㫇㪿㫆㫅㪼㫄㪼㩷㫉㪼㪻㫌㪺㫋㫀㫆㫅
㪤㪠 㪫㪇 㪫㪋
㪫㪈 㪫㪉 㪫㪊
㪤㪠㩷㪩㪼㪻㫌㪺㫋㫀㫆㫅㩷㫎㫀㫋㪿㩷㫅㫌㫄㪹㪼㫉㩷㫆㪽㩷㫇㪿㫆㫅㪼㫄㪼㩷㫄㪼㫉㪾㪼㫊㩷㪽㫆㫉㩷㫋㪼㫊㫋㩷㫊㪼㫋
㪌㪅㪋㪋 㪌㪅㪋㪌 㪌㪅㪋㪌 㪌㪅㪋㪍 㪌㪅㪋㪍 㪌㪅㪋㪎 㪌㪅㪋㪎 㪌㪅㪋㪏 㪌㪅㪋㪏 㪌㪅㪋㪐 㪌㪅㪋㪐
㪈 㪎 㪈㪊 㪈㪐 㪉㪌 㪊㪈 㪊㪎 㪋㪊 㪋㪐 㪌㪌 㪍㪈 㪍㪎 㪎㪊 㪎㪐 㪏㪌 㪐㪈 㪐㪎 㪈㪇㪊 㪈㪇㪐
㪈㪈㪌 㪈㪉㪈
㪈㪉㪎 㪈㪊㪊
㪈㪊㪐 㪈㪋㪌
㪈㪌㪈 㪤㪠
T0 T1
T2 T3 T4
Fig. 3. Illustration of the MI variations with the iterative merges of phonemes. One point move from left-to-right indicates one more merge of two units. The upper panel is for training text corpus, and the lower one for test corpus.
The points ”T4” stands for the initial 206 phoneme set, whereas those of ”T0”s for the non-tonal IFs set.
5. EXPERIMENTS AND RESULTS 5.1. Phoneme Set Design Experimeents
The text corpus (CBTEC) we used is the Chinese version of Basic Travel Conversation Text (BTEC) of ATR. It con- tains about 200,000 sentences with about one million words.
The lexicon size is about 17,000, and the language mod- el is a 2-gram model trained from CBTEC. The size of initial phoneme set is 206 with all tone-dependent units.
The initially defined phoneme merges have 433 possibili- ties, which is designed based on the phonetic similarities of the tonal phonemes. There is one constrain: two different Finals can be merged only when all of them do not have tone-dependencies.
Figure 3 illustrates the MIs of different phoneme sets achieved by merge of units. The figures clearly show that:
• The MI gaps between the points T0 and T4 indicate that some information gets lost when non-tonal IF set is used as the phoneme set for the recognition system.
• There are flat periods of MI variations after T4s in both the training and test data, indicating that a significant number of phoneme merges including tone merges lead to no information loss. Hence, they are the redundancies in the initial full tone-dependent unit set, when given the lexical and language model.
• It offers a flexible way to select a phoneme set to build the speech recognition system according to a balance between the number of units and loss of MI information.
Set Units Initials Finals Tri-phones
T0 59 21 37 107,441
T1 50 18 31 70,128
T2 59 18 40 114,945
T3 80 20 59 292,651
T4 206 21 184 3,022,775
Table 2. Number of units in the selected sets and the number of their corresponding logical HMMs.
We selected five different unit sets to build our speech recognition systems. T0 is the conventional non-tonal IFs with 59 units; T1 has 50 units and showing a similar MI to that of T0; T2 has the same number of units as T0, but showing a better MI than T0; T3 has 80 units, and showing only slight MI loss from the initial phoneme set T4, which has the full tone-dependent set of 206 units. Table 2 lists the number of Initials, Finals and logical tri-phones in the five ASR systems respectively.
5.2. Speech Recognition Experiments
The training speech data for acoustic models is the Beijing part of ATR Accented Speech (ATRAS). It contains more than 40,000 utterances with a total duration of 43 hours by 96 balanced male and female native Beijing speakers. The test speech data is a subset of CBTEC Putonghua test data.
It contains 510 utterances by 5 male and 5 female speakers, each speaker uttering different sentences. The reason to use a slightly accent-mismatched training database is that the ATRAS database has manually annotated phonetic labels, including the tone labels that are different from their lexical forms.
We used the HTK toolkit [11] to build our speech recog- nition systems to test the five different phoneme sets. The feature vector contains 39 dimensions including standard
I 339
MFCC features and log power, together with their first and second ordered derivatives. Cepstral mean subtraction is done at sentence level. We developed phonetic-decision- tree based state-tying tri-phone style HMMs for the differ- ent phoneme sets. Each HMM has left-to-right 3 states, the total number of tied states for each model has a similar number of 2,000, and each state has 20 Gaussian mixtures.
The speech recognition experiments used the same lexical and language model as those used in phoneme set optimiza- tion procedure. The perplexity of the test set is about 40 for the 2-gram language model. The recognition performances are shown in Fig. 4 in Chinese character accuracies.
㪙㪼㫊㫋㩷㪘㪪㪩㩷㪧㪼㫉㪽㫆㫉㫄㪸㫅㪺㪼㫊㩷㫆㪽㩷㪛㫀㪽㪽㪼㫉㪼㫅㫋㩷㪧㪿㫆㫅㪼㫄㪼㩷㪪㪼㫋㫊
㪐㪇㪅㪌 㪐㪈 㪐㪈㪅㪌 㪐㪉 㪐㪉㪅㪌
㪫㪇 㪫㪈 㪫㪉 㪫㪊 㪫㪋
㪧㪿㫆㫅㪼㫄㪼㩷㫊㪼㫋㫊
㪚 㪿㪸 㫉㪅㩷 㪸㪺 㪺㫌 㫉㪸 㪺㫐 㩷㩼
Fig. 4. Character accuracies of speech recognition using different phoneme sets.
㪩㪼㫃㪸㫋㫀㫆㫅㩷㪹㪼㫋㫎㪼㪼㫅㩷㪤㪠㩷㪻㫀㪽㪽㪼㫉㪼㫅㪺㪼㩷㪸㫅㪻㩷㫉㪼㪺㫆㪾㫅㫀㫋㫀㫆㫅 㫇㪼㫉㪽㫆㫉㫄㪸㫅㪺㪼㪅
㪎㪌 㪏㪇 㪏㪌 㪐㪇 㪐㪌
㪫㫆㫇㩷㪉㪇㩷㫊㪼㫅㫋㪅 㪦㫋㪿㪼㫉 㪫㫆㫋㪸㫃
㪤㪠㩷㪞㫉㫆㫌㫇㫊㪅 㪚㪿㪸
㫉㪅㩷㪸㪺 㪺㫌㫉㪸 㪺㫐㩷㩼
㪫㪇
㪫㪊
Fig. 5. Illustration of the relationship between MI differ- ences and the recognition performances for T0 and T3. Top 20 represents the 20 sentences with the maximum MI im- provements when using T3 instead of T0.
The results showed that
• Almost all the derived unit sets (T1 – T4 ) showed some better or similar performances compared with the non-tonal set T0, indicating that derived phoneme sets are efficient for the recognition task.
• Although T1 has 9 phonemes less than T0, it still got similar performance to that of T0, indicating the phoneme set more efficient.
• T3 achieved the highest performance, maybe due to its better balance between the number of units and MI information amount than others.
• A close look at the relationship between the MI dif- ferences and recognition performances of T0 and T3 separated the 510 test sentences into two groups :
one including 20 sentences with the maximum MI im- provements, and the one including all left sentences.
The first group showed more significant recognition improvements than the other one, as shown in Fig.
5, indicating that positive MI difference is correlated with recognition improvement.
6. CONCLUSION
We presented a novel method of derive compact and efficient tone-dependent phoneme set for building Chinese LVCSR system using MI based criterion. The preliminary exper- imental results showed the efficiency of the method. The future work will incorporate acoustic confusability measure- ments into the criterion.
Acknowledgement
This research was supported in part by the National Insti- tute of Information and Communications Technology.
7. REFERENCES
[1] C.-J. Chen and et al., ”New methods in continuous Mandarin recognition”, Proc. of Eurospeech 1997, Vo.
3, pp.1543-1546.
[2] Ch. Huang and et al., ”Segmental Tonal Modeling For Phone Set Design In Mandarin LVCSR”, Proc. of I- CASSP2004, Vol. 1, pp.901-904.
[3] F. Seide and N. Wang, ”Phonetic modeling in the Philips Chinese continuous-speech recognition sys- tem”, Proc. of Int. Symp. on Chinese Spoken Language Processing, 1998.
[4] C.-J. Chen andet al., ”Recognize tone languages using pitch information on the main vowel of each syllable”, Proc. of ICASSP, 2001.
[5] Y. Xu, ”Production and perception of coarticulated tones”, J.A.S.A, 4, pp. 2240-2253, 1994.
[6] J.-S. Zhang and et al., ”Is tone recognition necessary for Chinese speech recognition? ”, Proc. of ASJ, Sep.
2002, pp.5-6.
[7] M. Bacchiani and M. Ostendorf, ”Using automatically- derived acoustic sub-word units in large vocabulary speech recognition”, Proc. of ICSLP, 1998.
[8] D. B. Roe and M. D. Riley, ”Prediction of word con- fusabilities for speech recognition”, Proc. of ICSLP, pp.227-230, 1994.
[9] A. Simons, ”Predictive assessment for speaker in- dependent isolated word recognisers”, Proc. of Eu- rospeech,pp. 1465-1467, 1995.
[10] D. Torre and et al.,”Automatic alternative transcrip- tion generation and vocabulary selection for flexible word recognizers”, Proc. of ICASSP, pp.1463-1466, 1997.
[11] S. Young and et al., ”HTK Speech Recognition Toolkit ver. 3.2”, Cambridge Univ.