A Novel Concept to Improve the Redistribution Process for Language Models

全文

(1)A Novel Concept to Improve the Redistribution Process for Language Models Huang, Feng-Long Department of Computer Science and Information Engineering National United University MaioLi, 360, Taiwan flhuang@nuu.edu.tw. Abstract In the paper, a new concept, based on the nonuniform redistribution probability for novel events, to improve the smoothing method in language models is proposed. Basically, there are two processes in the smoothing methods: 1)discounting and 2)redistributing. Instead of uniform probability assignment to each unseen events used by most smoothing methods, we propose new technique to improve the redistribution process. Referring to the probabilistic behavior of all seen events, the redistribution process for novel events in our method is non-uniform. The proposed technique is exploited on well-known and frequently-used Good-Turing smoothing method. The empirical results are demonstrated and analyzed for two n-gram models. The improvement is obvious and effective for smoothing methods, especially. number of M. The predicted sequence w can be expressed: A language model is regarded as the probability distribution over events or token sequences (texts) that models how often each sequence occurs as a sentence. Chain rule is used to decompose probability prediction: P(w1m ) = P(w1w2...wm ) = P(w1)P(w2 | w11)P(w3 | w12 )...P(wm | w1m−1) m. = P(w1)∏P(wi | w1i −1). m. where w1 denotes the word sequence with m words. Speech S. selected words word-match. features selecting. W. α. on higher unseen event rate. Keywords: Language model, Smoothing method, GoodTuring, Cross entropy, Non-uniform Redistribution.. 1. Introduction. (1). i =2. w. sentencematch. P( ). α acoustic model. language model. Fig. 1: LMs in a speech recognition system.. Statistical language Models In many domains of natural language processing (NLP); such as speech recognition [1], grammar parser [4], document retrieval [17] and machine translation [Brown]; the statistical language models (LMs) [6], [10] plays an important role in natural language processing. The LMs can be exploited, for instance, to decide the correct target word sequence w . As shown in Fig. 1 of a speech recognition system, the P(W) is the conditional probability of a word sequence W given a speech data S, where W=w1w2w…wm is a possible translation of texts, m is word. n-gram Model Because of the finite training corpora in real world and to reduce the parameter space of word feature in languages, the approximate probability of a given word by using the (n-1)th preceding words is used to estimate sequence W. The probability model with various n can be written: m. P ( w1m ) ≅ P ( w1 )∏ P ( wi | wi − n +1 ).. (2) i =1 where wi-n+1 denotes the history of n-1 word for word wi..

(2) called unigram, bigram and trigram models [1], [8] and. for all possibly occurred events involves discounting and redistributing processes:. [16], respectively.. Discounting Prosess. In many applications, the models for n=1, 2 and 3 are. In Eq. (2), the probability for each event or token can be obtained by training the bigram model (for clarity, bigram model is illustrated). Therefore the probability of a word bigram will be written as: P(wi | wi −1) =. C(wi −1wi ) , i −1w) w. ∑ C(w. (3). where C(wi) is the count of word wi occurred in training corpus. The probability P of Eq. (3) refers to the relative frequency and such method is called maximum likelihood estimation (MLE).. Data sparity issue in Language Models As shown in Eq. (3), C(‧) of a novel word, which don’t occur in the corpus, may be zero because of the limited training data and infinite language. It is always hard for us to collect sufficient datum. The potential issue of MLE is that the probability for unseen events is exactly zero. This is so-called the zero-count problem. It is obvious that zero count will lead to the zero probability of P(‧) in Eqs. (2) and (3). The problem can be be a data sparity issue. The prediction of zero probability of certain event is unreliable and unfeasible for most applications, especially for language models. The smoothing techniques [3], [4], [11] and [19], are essential and employed by language model to overcome the issue zero count of traditional language models, as described above. There are many smoothing methods, such as Add-1, Good-Turing [6], deleted interpolation [7], Katz [13], etc. There are several literatures discussing about smoothing methods [3], [4], [12], [14], [15], [16] and [18].. 2. Smoothing Processes in LMs The basic idea of smoothing process is to adjust the total probability of seen events and leave some probability mass (so-called escape probability, Pesc) for unseen events. Smoothing algorithms can be considered as discounting some counts of seen events in order to obtain the escape probability Pesc which will be assigned into the zero counts of unseen events. The adjustment of smoothed probability. The probability of all seen and unseen events is summed to be 1 (unity). First operation of smoothing method is the discounting process, which discount the probability of all seen events. The adjustment can be divided into two types: static and dynamic. Static smoothing methods, as most smoothing methods, discount the probability based on the frequency of events in trained corpus. However, dynamic smoothing method, i.e., cached-based language, discounts the probability based on the frequency of seen events in cache and trained corpus.. Redistributing Process In this operation of smoothing algorithm, the escape probability Pesc obtained from all seen events will be redistributed to all unseen events. Pesc is usually shared by all the unseen events. That is, the escape probability Pesc is redistributed uniformly to each unseen event, Pesc/U, where U is the number of unseen events of a language. The redistribution process of most well known smoothing methods, such as Add-1, Absolute discounting, Good-Turing, Delete interpolation, Back-off and WittenBell, and so on. The escape probability Pesc (or called probability mass for all unseen events) is uniformly shared by all unseen events. It is a possible factor that affects the performance of smoothing algorithm. There are little previous papers discussing how to redistribute the escape probability PESC, and how the different redistribution can improve the smoothing methods for language models.. 3. Improving the Smoothing Process Interval Behavior of Seen Events Count As described in previous section, there are two main processes for smoothing methods; discounting and redistributing. Within the redistributed process, the escape probability Pesc (or so-called probability mass for unseen events) is shared uniformly by all unseen events for most smoothing methods, such as Add-one, Delete interpolation and Witten-Bells method A and C. In other words, each event obtain same smoothed probability Pesc/U. Based on the observation of behaviors for seen events, each event has its probability relying on the event frequency in the.

(3) training corpus. It is obvious that the probability distrubition for each event is quite different. Therefore, It is unreasonable to assign same probability to each incoming unseen events. Based on the empirical results, we can obtain the. next incoming unseen event. Note that all the probability for seen and unseen events should be unity; which must satisfy the basic statistical condition. Supposed that the interval yi on training data Ni, the distribution for all unseen events can be as follows:. frequency interval (offset) between two new successive events for two models; Chinese character word unigrams and bigrams. There are 100M (108) Chinese characters for source training data. The sentences in source are segmented into words and 65M (65*106) words are obtained. The length of word is 1.45 Chinese characters per word in average. The recourse files are randomly selected and we obtain the offset diagrams. More than 100 training processes are implemented and then the final curve can be obtained in average. The regression curves Y1 and Y2 for Chinese word unigram and character bigram models can be described as follows: Y1 = 1E-10x3 - 4E-06x2 + 0.0307x - 39.825 Y2 = -1E-16x4 + 2E-11x3 - 6E-07x2 + 0.0058x - 3.7502 where x and y denotes the data size the offset. An idea for redistributing escape probability Pesc is that how many tokens read-in while the next new event will occur. It means the count interval between two successive different events, which vary usually with the training data N. Basically, the larger the training data N, the larger the interval. In the beginning of training phase, next new events will occur in short interval of count. It means that next new event will occur rapidly at smaller N while slowly at larger N. The larger the training data N is, the larger the offset (interval) is. The empirical regression curves present the general interval of original intervals and its trend increased gradually. Note that the regression curves varied with N and flatter at the beginning and steeply at end of curves. The regression curve will be employed to calculate the smoothed probability, described next.. Redistributing Process for Unseen Events As described above, the regression curves from seen events can be used to demonstrate the interval of unseen events. Based on the regression curves derived from the seen events occurrence, we can furthermore derive the behaviors for estimating the probability assigning to the. di =. 1 yi U. ∑. j =1. 1 yj. ,. (4). where yi denotes the interval on location i and U denotes the types of unseen events. 1/yi can be regared as the derivatives at yi and as the probability for unseen events. The smoothed probability assigning to an unseen event Ui is: (5) pi = Pesc * di Referring to Eqs (4) and (5), the total smoothed probability for all unseen events is Pesc. The probability for all seen and unseen events are summed as unity.. 4. Evaluation Our proposed method will be evaluated on the wellknown and popular smoothing Good-Turing technique. The cut-off value for word count is usually used to improve the technique. Based on the empirical results, we can obtain best cut-off value on various training data N.. Basic Idea of Good-Turing Method Good-Turing method is a well-known and effective smoothing technique, which was first described by I. J. Good and A. M. Turing in 1953 [7] and used to decipher the German Enigma code during World War II. Some previous works are in [4] and [9]. Notation nc denotes the number of n-grams with exactly c count in the corpus. For example, n0 represent that the number of n-grams with zero count and n1 means the number of n-grams which exactly occur once in training data. Based on Good-Turing smoothing, the redistributed count c*will be presented in term of nc, nc+1 and c as: n (6) c * =. ( c + 1) c +1 . nc. Best Cut-off Value in Good-Turing Smoothing Method In the most previous works of smoothing methods, they discussed the situation the possible event types B were much larger than the training data N. ( N/B << 1) ,.

(4) such as words trigram models in English text or character. cross entropy (CE), unseen event rates and improvements. triigrams in Chinese. However, the situation should be. of different cut-off co on various N for word unigram and. considered in certain applications. For instance, the event. character bigram models, respectively. The best cut-off co. 8. types B of Chinese character bigram is close to 1.69*10 ,. can be found on various N for both models. For the word. while the training data size N, in general, is usually less. unigrams model, it is apparent that the best CE. 8. than 1*10 . In such case, the ratio of N/B is smaller than 1.. improvement reaches near 1.8% at N=0.5M, and the. The cut-off value co for event count is used to improve the Good-Turing Smoothing, as shown in previous section. The best co on various Training data N should be analyzed to obtain better improvement.. effectiveness decreases while the N is larger, as shown in. Cross Entropy. Fig. 2. For bigram model, the best CE improvement reaches near 14.3% at N=1M, and the effectiveness decreases while the N is larger, as shown in Fig. 3. Both models reach lower CE while the cut-off and. In the subsection, we introduce cross entropy, which. non-uniform redistribution technique are exploited. It can. has always been used to evaluate and compare different. improve better, especially on higher unseen event rate. It. probabilistic model. For a testing data set T which. means that smaller training data N will reach same. contains a set of events, e1,e2,…,em, the probability for the. performance of that on larger N by using the proposed. testing set P(T) can be described as:. technique to improve the smoothing process.. m. P (T ) = ∏ P (ei ),. (7). i =1. 5. Conclusions. where m denotes the number of events in testing set T. In the paper, we have proposed an effective. and P(ei) denotes the probability of event ei, obtained. technique, based on the non-uniform redistribution. from n-gram language model, assigning to event ei.. probability for novel events, to improve the redistribution. probability. process in smoothing method of language models. The. distribution p that generated some data the cross entropy. smoothing method is used to resolve the zero count. CH can be employed. For example, we use some M,. problems in traditional language models. The cut-off co. When. we. don’t. know. the. actual. which is a model of p. Therefore, the cross entropy CE of. for event count is used to improve the zero nc issue of. M on p can be regarded as:. Good-Turing Smoothing.. 1 ∑p(w1w2w3...wn )logM(w1w2w3...wn ) . n→∞ n W∈L. CE( p, M) = lim. Based on the probabilistic behavior of seen events, (8). the redistribution process exploited by our technique is non-uniform. The improvements discussed in the paper. 4.4 Data Sets and Empirical Models In the following experiments, two text sources are. are apparent and effective on Good-Turing smoothing methods.. collected from the news texts and Academic balanced. Empirical results are demonstrated and analyzed for. corpus (ASBC); the former and the later contain 100M. two language models to evaluate the proposed technique. and 10M Chinese characters, respectively. In this paper,. methods discussed in the paper; Chinese word unigrams,. we construct two language models to evaluate the cross. character bigram model. The cross entropy can be reduced. entropy (CE) of the technique for improving the. in these two models.. smoothing process; 1) word unigrams model (the length. Both models reach lower CE for various cut-off co on different training data N and non-uniform redistribution probability are used. Two methods can improve better, especially on higher unseen event rate. In other word, we can improve especially the CE for application with small training data N. The best CE improvement reaches 1.8%. of a word is 1.45 characters in average), 2) Chinese character bigrams. The cross entropy is calculated on various data size N in our experiments. Comparing uniform with non-uniform redistribution probability for unseen events, Fig. 2 and Fig. 3 display the.

(5) and 14.3% for word unigrams and character bigram models.. Information Retrieval, SIGIR ’04, Sheffield Yorkshine, UK. [11] Jurafsky D. and Martin J. H., 2000, Speech and. Reference. Language Processing, Prentice Hall. [12] Juang B. H and Lo S. H., 1994, On the Bias if the Good -Turing Estimate of Probabilities, IEEE Trans. on Signal Processing, Vol. 42, No. 2, pp. 496-498. [13] Katz S. M., March 1987, Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer, IEEE Trans. On Acoustic, Speech and Signal Processing, Vol. ASSP-. [1] Brown P. F., Pietra V. J., deSouza P. V., Lai J. C., and Mercer R. L., 1992, Class-Based n-gram Models of Natural Language, Computational Linguistics, 18, pp. 467-479. [2] Brown P. F., Pietra V. J., Jphn Cocke, Stphen A. DellaPietra, Vencent J. DellaPietra, Frederick Jelinek, John D. Laffery, Robert L.Mercer, and Paul S. Roossin, 1990, A Statistical Approach to Machine Translation, Computational Linguistics, 16(2), pp. 7985. [3] Chen Stanley F. and Goodman Joshua, 1999, An Empirical study of smoothing Techniques for Language Modeling, Computer Speech and Language, Vol. 13, pp. 359-394. [4] Church K. W., 1988, A Stochastic Parts Program and Noun Phrase parser for Unrestricted Text, Proceedings of the 2nd Conference on Applied natural Language processing, pp. 136-143. [5] Church K. W. and Gale W. A., 1991, A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilies of English Bigrams, Computer Speech and Language, Vol. 5, pp 19-54. [6] Essen U. and Steinbiss, 1992, Cooccurrence Smoothing for Stochastic Language Modeling, IEEE International conference on Acoustic, Speech and Signal Processing, Vol. 1, pp. 161-164. [7] Good I. J., 1953, The Population Frequencies of Species and the Estimation of Population Parameters, Biometrika, Vol. 40, pp. 237-264. [8] Jelinek F., 1997, Automatic Speech RecognitionStatistical Methods, M.I.T. [9] Jelinek F. and Mercer R. L., 1980, Interpolated Estimation of Markov Source Parameters from Spars Data, Proceedings of the Workshop on Pattern Recognition in Practice, North-Holland, Amsterdam, The Northlands, pp. 381-397. [10] Jianfeng Gao,Jian-Yue Nie, Guangyuan Wu, and Guihong Cao, 2004, Dependence Language Model for. 35, pp. 400-401. [14] Kneser R. and Ney H., 1995, Improved Backing-Off for M-gram Language Modeling, IEEE International conference on Acoustic, Speech and Signal Processing, pp. 181-184. [15] Nádas A., 1985, On Turing’s Formula for Word Probabilities, IEEE Trans. On Acoustic, Speech and Signal Processing, Vol. ASSP-33, pp. 1414-1416. [16] Ney H. and Essen U., 1991, On Smoothing Techniques for Bigram-Based Natural Language Modeling, IEEE International conference on Acoustic, Speech and Signal Processing, pp. 825828. [17] Oren Kurland and Lillian Lee, 2004, Corpus Structure, Language and Ad Hoc Information Retrieval, SIGIR ’04, Sheffield Yorkshine, UK. [18] Standley F. Chen and Ronald Rosenfeld, January 2000, A Survey of Smoothing Techniques, for ME Models, IEEE Transactions on Speech and Audio Processing, Vol. 8, No. 1, pp. 37-50. [19] Witten L. H. and Bell T. C., 1991, The ZeroFrequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression, IEEE Transaction on Information theory, Vol. 37, No. 4, pp. 1085-1094..

(6) unseen event rate and CE improvement using regression curve for word unigram model. Cross entropy (CE) for word unigram with uniform distribution 13.4. 9.00% 8.00% 7.00%. k=5 k=8 k=10 k=15 k=20 k=25 k=fz. 12.6. 12.2. Percentage. Cross entropy. 13. 6.00%. unseen event rate. 5.00%. CE improvement. 4.00% 3.00% 2.00%. 11.8. 1.00% 11.4 0.5. 1. 2. 4. 8. 16. 32. 0.00% 0.5. 64. 1. 2. Training data N, M. 4. 8. 16. 32. 64. Training data N, M. Figure 2: the cross entropy, unseen event rates and improvements on different cut-off co on various N for word unigram model. unseen event rate and CE improvement for bigram models. Cross Entropy (CE) wiwth different count cut-off for bigram models based on uniform distribution. based on the regression curve 39.00%. 20.8. 34.00%. k=5 19.8. percentage. Cross entropy. k=10 k=15 k=20 k=25. 18.8. k=fz. 29.00%. unseen event rate. 24.00%. CE improvement. 19.00% 14.00%. 17.8. 9.00% 4.00%. 16.8. 1 1. 10. 20. 30. 40. 50. 60. Training data N, M. 70. 80. 90. 100. 10. 20. 30. 40. 50. 60. 70. 80. Training data N, M. Figure 3: the cross entropy, unseen event rates and improvements on different cut-off co on various N for character bigram model.. 90. 100.

(7)