Implication to Latent Variable Modeling

5.2 General Discussion

5.2.3 Implication to Latent Variable Modeling

Part of our effort is devoted to extending the idea of information preservation to latent variable modeling. In one of the special cases, where the dependence re-lation between observed and latent variables appears to be deterministic, we find that conventional inference algorithms, such as expectation-maximization (EM) or Markov Chain Monte Carlo (MCMC), may fail to work. These methods rely on ex-ploitation of the correlation between observed and latent variables to find the best latent model, while in the dependence case the correlation may no longer be useful in guiding the search.

To deal with this problem, we develop an inference method, called utility-bias trade-off, based on the iterative approximation techniques that we have used in various information preservation problems. The proposed framework complements the con-ventional approaches. Specifically, it relies on two quantities, utility and bias, to guide the search for the best latent model. Since this result is incomplete and does not entirely fit into the big picture of this thesis work, we move the details out of the main text. Interested readers are referred to Appendix A.3.

5.3 Concluding Remarks

The foremost contribution of this thesis is the development of information preser-vation. This concept provides an unified way for modeling different optimization

strategies. The proposed principle can be applied to some classic learning problems in probabilistic modeling, such as regression and cluster analysis, and two other natural language applications, such as unsupervised word segmentation and static index pruning. The latter case demonstrates that our method is suitable for solving complicated cases where other mathematical principles do not fit.

Our approach provides a common ground for relating various optimization prin-ciples, such as maximum and minimum entropy methods. In our framework, the optimization process is directed toward approximation for a reference hypothesis, an essential concept that may have been implicitly implied in conventional methods.

Making this concept explicit improves our understanding about how the entropy-based optimization criteria work. It also resolves the incompatibility issue between entropy maximization and minimization, since in the view of information preserva-tion, the two principles differ only in the target to approximate.

Our experimental study in unsupervised word segmentation and static index pruning has created new methodologies toward these problems. For unsupervised word seg-mentation, our regularization approach has significantly boosted the segmentation accuracy of an ordinary compression method, and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, our approach suggests a new way of prioritizing index entries. The proposed information-based measure has achieved state-of-the-art performance in this task, and it has done so more efficiently than the other methods.

Many interesting issues have been uncovered and remained open in this thesis work.

In the cluster analysis problem, our approach leads to a new regularization method that has not been discussed before; estimation of the normalization factor of the error distribution is also of theoretical value. Similarly, a numerical approximation specifically-tailored for information preservation will have huge impact to most of the problems that we have discussed in the thesis work. We also expect seeing more applications of information preservation to maximum cross-entropy problems, since

the former may serve as an economical approximation to the latter. We believe that these directions will lead to fruitful results.

There are some related natural language problems that we look forward to applying information preservation to, such as text summarization (as a sentence pruning prob-lem) and named-entity recognition (as a specialized segmentation probprob-lem). Success in these essential tasks will broaden the impact of this approach. Moreover, we ex-pect this thesis work to create new conversations and studies as to the mathematical principles that underpin probabilistic methods. Deeper understanding about these principles may produce new applications or new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing.

Bibliography

Ismail S. Altingovde, Rifat Ozcan, and ¨Ozg¨ur Ulusoy. A practitioner’s guide for static index pruning. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, chapter 65, pages 675–679.

Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-642-00957-0. doi: 10.1007/978-3-642-00958-7 65. URL http://dx.doi.org/10.1007/

978-3-642-00958-7_65.

Ismail S. Altingovde, Rifat Ozcan, and ¨Ozg¨ur Ulusoy. Static index pruning in web search engines: Combining term and document popularities with query views.

ACM Transactions on Information Systems, 30(1), March 2012. ISSN 1046-8188.

doi: 10.1145/2094072.2094074. URL http://dx.doi.org/10.1145/2094072.

2094074.

Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Efficient un-supervised recursive word segmentation using minimum description length. In Proceedings of the 20th international conference on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Association for Computational Lin-guistics. doi: 10.3115/1220355.1220507. URL http://dx.doi.org/10.3115/

1220355.1220507.

Nan Bernstein-Ratner. The phonology of parent child speech. Children’s language, 6:159–174, 1987.

Roi Blanco and ´Alvaro Barreiro. Static pruning of terms in inverted files. In

Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors, Ad-vances in Information Retrieval, volume 4425 of Lecture Notes in Computer Science, chapter 9, pages 64–75. Springer Berlin Heidelberg, Berlin, Heidel-berg, 2007. ISBN 978-3-540-71494-1. doi: 10.1007/978-3-540-71496-5 9. URL http://dx.doi.org/10.1007/978-3-540-71496-5_9.

Roi Blanco and Alvaro Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), January 2010. ISSN 1046-8188.

doi: 10.1145/1658377.1658378. URL http://dx.doi.org/10.1145/1658377.

1658378.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022, January 2003. URL http:

//www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.

Michael R. Brent and Timothy A. Cartwright. Distributional regularity and phono-tactic constraints are useful for segmentation. In Cognition, pages 93–125, 1996. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.

33.1129.

Stefan B¨uttcher and Charles L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM interna-tional conference on Information and knowledge management, CIKM ’06, pages 182–189, New York, NY, USA, 2006. ACM. ISBN 1-59593-433-2. doi: 10.1145/

1183614.1183644. URL http://dx.doi.org/10.1145/1183614.1183644.

David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, and Aya Soffer. Static index pruning for information re-trieval systems. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 43–50, New York, NY, USA, 2001. ACM. ISBN 1-58113-331-6. doi:

10.1145/383952.383958. URL http://dx.doi.org/10.1145/383952.383958.

Jing-Shin Chang and Keh-Yih Su. An unsupervised iterative method for chinese new lexicon extraction. In International Journal of Computational Linguistics &

Chinese Language Processing, pages 97–148, 1997. URL http://citeseer.ist.

psu.edu/viewdoc/download?doi=10.1.1.26.6659&rep=rep1&type=pdf.

Lee-Feng Chien. PAT-tree-based keyword extraction for chinese information re-trieval. SIGIR Forum, 31(SI):50–58, 1997. ISSN 0163-5840. doi: 10.1145/258525.

258534. URL http://dx.doi.org/10.1145/258525.258534.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. ISSN 00359246. doi: 10.2307/2984875. URL http://web.mit.edu/6.435/www/Dempster77.pdf.

Thomas Emerson. The second international chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, volume 133. Jeju Island, Korea, 2005.

Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. Accessor variety criteria for chinese word extraction. Comput. Linguist., 30:75–93, March 2004. ISSN 0891-2017. doi: 10.1162/089120104773633394. URL http://dx.doi.org/10.1162/

089120104773633394.

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 6(6):721–741, November 1984. doi: 10.1080/

02664769300000058. URL http://dx.doi.org/10.1080/02664769300000058.

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. Contextual depen-dencies in unsupervised word segmentation. In Proceedings of the 21st Interna-tional Conference on ComputaInterna-tional Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 673–680, Strouds-burg, PA, USA, 2006. Association for Computational Linguistics. doi: 10.3115/

1220175.1220260. URL http://dx.doi.org/10.3115/1220175.1220260.

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54, July 2009. ISSN 00100277. doi: 10.1016/j.cognition.2009.03.008. URL http:

//dx.doi.org/10.1016/j.cognition.2009.03.008.

Zellig S. Harris. From phoneme to morpheme. Language, 31(2):190–222, 1955. ISSN 00978507. doi: 10.2307/411036. URL http://dx.doi.org/10.2307/411036.

Daniel Hewlett and Paul Cohen. Bootstrap voting experts. In Proceedings of the 21st international jont conference on Artifical intelligence, IJCAI’09, pages 1071–

1076, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc. URL http://portal.acm.org/citation.cfm?id=1661616.

Daniel Hewlett and Paul Cohen. Fully unsupervised word segmentation with BVE and MDL. In Proceedings of the 49th Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics: Human Language Technologies: short pa-pers - Volume 2, HLT ’11, pages 540–545, Stroudsburg, PA, USA, 2011. As-sociation for Computational Linguistics. ISBN 978-1-932432-88-6. URL http:

//portal.acm.org/citation.cfm?id=2002843.

Jin H. Huang and David Powers. Chinese word segmentation based on contex-tual entropy. In Proceedings of the 17th Asian Pacific Conference on Lan-guage, Information and Computation, pages 152–158. Citeseer, 2003. URL http://aclweb.org/anthology/Y/Y03/Y03-1017.pdf.

E. T. Jaynes. Information theory and statistical mechanics. Physical Review Online Archive (Prola), 106(4):620–630, May 1957a. doi: 10.1103/PhysRev.106.620. URL http://dx.doi.org/10.1103/PhysRev.106.620.

E. T. Jaynes. Information theory and statistical mechanics. II. Physical Review Online Archive (Prola), 108(2):171–190, October 1957b. doi: 10.1103/PhysRev.

108.171. URL http://dx.doi.org/10.1103/PhysRev.108.171.

Zhihui Jin and Kumiko T. Ishii. Unsupervised segmentation of chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main confer-ence poster sessions, COLING-ACL ’06, pages 428–435, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://portal.acm.org/

citation.cfm?id=1273129.

Mark Johnson and Sharon Goldwater. Improving nonparameteric bayesian in-ference: experiments on unsupervised word segmentation with adaptor gram-mars. In Proceedings of Human Language Technologies: The 2009 Annual Con-ference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, pages 317–325, Stroudsburg, PA, USA, 2009. Asso-ciation for Computational Linguistics. ISBN 978-1-932432-41-1. URL http:

//portal.acm.org/citation.cfm?id=1620800.

Chunyu Kit and Yorick Wilks. Unsupervised learning of word boundary with de-scription length gain. In CoNLL-99, pages 1–6, Bergen, Norway, 1999.

Soloman Kullback. Information Theory and Statistics. Wiley, New York, 1959.

Gina-Anne Levow. The third international chinese language processing bakeoff:

Word segmentation and named entity recognition. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, volume 117. Sydney: July, 2006.

S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Trans-actions on, 28(2):129–137, March 1982. ISSN 0018-9448. doi: 10.1109/TIT.1982.

1056489. URL http://dx.doi.org/10.1109/TIT.1982.1056489.

J. B. Macqueen. Some methods of classification and analysis of multivariate observa-tions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967.

Brian MacWhinney and Catherine Snow. The child language data exchange system:

an update. Journal of child language, 17(2):457–472, June 1990. ISSN 0305-0009.

URL http://view.ncbi.nlm.nih.gov/pubmed/2380278.

Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural Language Processing of the AFNLP:

Volume 1 - Volume 1, ACL ’09, pages 100–108, Stroudsburg, PA, USA, 2009.

Association for Computational Linguistics. ISBN 978-1-932432-45-9. URL http://portal.acm.org/citation.cfm?id=1687894.

Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–

471, September 1978. ISSN 00051098. doi: 10.1016/0005-1098(78)90005-5. URL http://dx.doi.org/10.1016/0005-1098(78)90005-5.

Stephen Robertson. The probability ranking principle in IR. In Karen S. Jones and Peter Willett, editors, Reading in Information Retrieval, chapter The probability ranking principle in IR, pages 281–286. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-454-5. URL http://portal.acm.org/

citation.cfm?id=275701.

Richard W. Sproat and Chilin Shih. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4):

336–351, 1990.

Maosong Sun, Dayang Shen, and Benjamin K. Thou. Chinese word segmentation without using lexicon and hand-crafted training data. In Proceedings of the 17th international conference on Computational linguistics - Volume 2, COLING ’98, pages 1265–1271, Stroudsburg, PA, USA, 1998. Association for Computational Linguistics. doi: 10.3115/980432.980775. URL http://dx.doi.org/10.3115/

980432.980775.

Kumiko Tanaka-Ishii. Entropy as an indicator of context boundaries: An

exper-iment using a web search engine. In Robert Dale, Kam-Fai Wong, Jian Su, and Oi Kwong, editors, Natural Language Processing – IJCNLP 2005, volume 3651 of Lecture Notes in Computer Science, chapter 9, pages 93–105. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2005. ISBN 978-3-540-29172-5. doi:

10.1007/11562214 9. URL http://dx.doi.org/10.1007/11562214_9.

Hua Yu. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing, Beijin, China, 2000.

Hai Zhao and Chunyu Kit. An empirical comparison of goodness measures for un-supervised chinese word segmentation with a unified framework. In The Third International Joint Conference on Natural Language Processing (IJCNLP-2008), 2008. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.

135.6154.

Lei Zheng and Ingemar J. Cox. Entropy-Based static index pruning. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, chapter 72, pages 713–718. Springer Berlin / Heidelberg, Berlin, Heidel-berg, 2009. ISBN 978-3-642-00957-0. doi: 10.1007/978-3-642-00958-7 72. URL http://dx.doi.org/10.1007/978-3-642-00958-7_72.

Valentin Zhikov, Hiroya Takamura, and Manabu Okumura. An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. In ceedings of the 2010 Conference on Empirical Methods in Natural Language Pro-cessing, EMNLP ’10, pages 832–842, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URL http://portal.acm.org/citation.cfm?

id=1870739.

Appendix A

Supplementary Details

A.1 Representation System and Entropy

We define a representation system for a set of concepts. A representation system is defined as a 3-tuple R = (C, A, g), where C denotes the set of concepts, A denotes the alphabet, and g denotes the ruleset. The definitions for the three components are given below.

• Concepts: Let C be the set of concepts, in which each concept refers to a semantic units that we use to describe concrete ideas. There is no direct connection between a concept and any language construct in written or spoken form. We further assume that any form of higher-level knowledge can be expressed as a sequence of concepts.

• Alphabet: We also need to define a symbol system to carry information, for that exchange of knowledge takes place in a concrete form, e.g., as a piece of text or speech. This set of symbols is called the alphabet, which is denoted as A. The alphabet has to be finite but not necessarily closed. Any concept, in order to be understood, has to be represented as a non-empty sequence of symbols in A. Usually we call such a sequence a word.

• Ruleset: We assume that an set of rules g exists so as to maps a concept to a word. In fact, g acts as an injective function from C to A⁺; we choose the ruleset definition just to keep this notion flexible. The definition about g is shared among all the language users. To communicate ideas, therefore, one employs g to express concepts in her mind as a passage (i.e., a sequence of words) and passes it out; the recipient then recovers the sequence of concepts by consulting the g to interpret the passage. Note that the interpretation to a passage may not be unique when more than one concepts can map to the same word.

For any set of concepts C, one forms a trivial representation system R0 by letting the alphabet be of the as same size as that of C and the ruleset as an identity mapping. In mathematical terms, R0 = (C, C, g0) where g0(c) = c for all c ∈ C. In other words, this representation system corresponds to a language system, in which the number of symbols is as many as that of concepts.

On one hand, this representation system is efficient, because it takes only one symbol to represent any concept. On the other hand, this system is also very verbose, because its alphabet is unbearably large.

Recall that the empirical entropy for X with respect to a sequence of N observations is defined as:

Lemma 1. Let X be a random variable over a set of tokens χ. Suppose that we observe a sequence of tokens drawn from this distribution and nx be the number of occurrences for any x ∈ χ in the sequence. For any z ∈ χ in a sequence of N observations that is sufficiently long such that n_z/N approaches 0, the empirical

entropy for X decreases if we replace each occurrence of z in x with two tokens a and b where a, b ∈ χ.

Proof. Let x denote the original sequence and x^′ the new sequence after the replace-ment. We write the difference between the empirical entropy for x^′ and that for x as follows.

We show that the last two terms converge to 0 less rapidly than the first two by multiplying both sides by N + f (z) and pass nz/N to 0. As a result, the right hand side diminishes except the last two terms; now the equation reads:

nzlim/N →0(N + nz)( ˜H(X)^x^′ − ˜H(X)^x) =

The difference is obviously less than 0.

A.2 Rewrite-Update Procedure

The proposed algorithm relies on an efficient implementation in Steps 2a and 2b to achieve satisfactory performance. To explore this issue, we make a simplifying

assumption here that all the translation rule considered is of the form:

w → xy,

where w ∈ W and x, y ∈ C. Note that this assumption is made for ease of discussion;

it is possible to tailor the aforementioned algorithm in a more general respect.

In the following paragraphs, we motivate the need for developing a rewrite-update procedure to further enhance the performance:

• In order to solve the optimization problem in Step 2b, we need to iterate through all the possible bigram sequences, gather required statistical quan-tities, and compute the objective value for each sequence. Specifically, to compute entropy we need access to unigram and bigram frequencies; these quantities, however, do not stay constant throughout iterations.

• To alter the sequence in Step 2b, we need faster access to reach the desired positions in which the proposal xy occur. An usual solution is to employ an indexing structure to book-keep the set of positions that a specific bigram occurs in the text stream. In this case, we need something more than a static indexing structure for doing this job, since in each iteration new tokens (and new bigrams, accordingly) are introduced into the sequence.

It is immediately clear that the major challenge resides in data management. Static data store does not fit into this scenario since tiny changes are introduced and applied to the sequence in every iteration, and to reflect that change back to the data becomes the key to computational efficiency. In the first place, it may seem that a two-pass scan through the sequence is inevitable. We notice that, however, the number of changes to the statistical quantities is linear in the number of occurrences of the subsequence xy. In other words, only a limited number of tokens and bigrams are affected by the change we introduce in Step 2b.

Consider the following snippets of the sequence. Let a denote the token that precedes

x and b the token that follows y. The original sequence is as:

. . . a x y b . . .

In Step 2b, we introduce a new token z to replace this occurrence of bigram xy. The resulting sequence becomes:

. . . a z b . . .

We can then divide the changes needed to reflect this change into the following four classes.

1. Decrease the unigram frequencies for x and y by 1, respectively. Remove the corresponding positions in the posting lists for x and y.

2. Decrease the bigram frequencies for ax, xy, and yb by 1, respectively. Revise the corresponding posting lists for these bigrams as well.

3. Increase the unigram frequency for z by 1 and add add this position to the posting list for z.

在文檔中資訊保存與自然語言處理的應用 (頁 81-0)