Conclusion and Future Work - 概念表徵及其應用

In this dissertation, we focus on concept representation in building intelligent systems. For this purpose, we define concept a continuation, which is a kind of temporary state in concept computation process of human. We put continuation in the context of the evolutionary language game. Based on this setting, we discuss some theoretical aspects of the definition.

We also consider the concept theories, which relate to the world behind mathematical models.

We derive some conclusions on three kinds of stability: input stability, test stability, and dogma stability.

We propose a concept representation scheme, which contains static frame structure and dynamic explicitization process. This concept representation scheme features transparency and flexibility as its core advantages. We want the concept representation can be adopted in many different tasks.

To demonstrate the application of our concept definition, we apply our concept definition in two problems: commonsense knowledge classification and word sense disambiguation. We use commonsense knowledge classification to demonstrate how to use our definition in traditional machine learning process. We further use our definition to derive new concepts to handle WSD problem. We investigate concept appropriateness and concept fitness in the relation between concept and its context, which is similar to continuation and its context. We use these two concepts to formulate new algorithms to learn models for WSD.

In addition to the concept definition, we conduct experiments and assert that the texts contain commonsense knowledge, therefore, texts is a good source for mining proofs to test system’s understanding level. We also conduct experiments and assert that ClueWeb09 is a good knowledge source although it contains small part of the whole web.

We preprocess ClueWeb09 and produce three datasets for researchers. The first dataset is

POS-tagged and phrase-chunked English datasets. The second dataset is segmented, POS-tagged, and discourse marker identified Chinese dataset. The third dataset is NTU PN-Gram corpus, a Chinese n-gram dataset with POS information (n in [1, 5]). We have design a web system for general users to access this NTU PN-Gram corpus. The system is designed to boost the speed of query this big n-gram dataset.

In the future, we want to deeply investigate our concept definition in many aspects such as developing internal architecture of a continuation and grounding these structures in philosophical viewpoints. We want to use our concept representation scheme in more problems, and finally, we hope this concept definition can boost researchers to build a real intelligent system in the future.

REFERENCE

Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms.

In Proceedings of the 18th Annual Conference on Learning Theory (pp. 32–47). Berlin, disambiguation: Algorithms and Applications. Agirre, Eneko and Edmonds, Philip (Eds.).

Springer.

Ando, R. K. (2006). Applying alternating structure optimization to word sense disambiguation. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 77–84. Association for Computational Linguistics.

Barabanov, N. E., & Prokhorov, D. V. (2002). Stability analysis of discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, 13(2), 292–303.

Barker, C. (2004). Continuations in natural language. In Proceedings of the Fourth ACM SIGPLAN Continuations Workshop.

Bengio, Y. (2008). Neural net language models. In Scholarpedia, 3(1):3881.

Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine

Cankaya, H. C., & Moldovan, D. (2009). Method for extracting commonsense knowledge. In Proceedings of the Fifth International Conference on Knowledge Capture (pp. 57–64).

ACM.

Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E. R., & Mitchell, T. M. (2010).

Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010).

Carpuat, M., & Wu, D. (2005). Word sense disambiguation vs. statistical machine translation.

In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 387–394. Stroudsburg, PA, USA: Association for Computational Linguistics.

Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007) , 61–72.

Chklovski, T. (2003). Learner: A system for acquiring commonsense knowledge by analogy.

In Proceedings of the 2nd International Conference on Knowledge Capture (pp. 4–12).

ACM.

Chklovski, T., & Gil, Y. (2005). An analysis of knowledge collected from volunteer contributors. In Proceedings of the 20th International Conference on Artificial Intelligence, 564–570. AAAI Press.

Chomsky, N. (1986). Knowledge of Language: Its Nature, Origin, and Use. Praeger.

Clark, P., & Harrison, P. (2009). Large-scale extraction and use of knowledge from text. In Proceedings of the Fifth International Conference on Knowledge Capture (pp. 153–160).

ACM.

525–564.

Dhillon, P. S., & Ungar, L. H. (2009). Transfer learning, feature selection and word sense disambiguation. In Proceedings of the ACL-IJCNLP 2009 Conference, 257–260.

Stroudsburg, PA, USA: Association for Computational Linguistics.

Edmonds, P., & Cotton, S. (2001). Senseval-2: Overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (pp. 1–5).

Association for Computational Linguistics.

Erk, K., & McCarthy, D. (2009). Graded word sense assignment. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 440–449. Association for Computational Linguistics.

Erk, K., McCarthy, D., & Gaylord, N. (2009). Investigations on word senses and word usages.

In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 10–18.

Association for Computational Linguistics.

Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 897–906. Association for Computational Linguistics.

Escudero, G., Màrquez, L., & Rigau, G. (2000). Naive Bayes and exemplar-based approaches to word sense disambiguation. Proceedings of the 14th European Conference on Artificial Intelligence, 421–425.

Etzioni, O., Banko, M., Soderland, S., & Weld, D. S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.

Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.

Felleisen, M. (1988). The theory and practice of first-class prompts. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp.

180–190).

Florian, R., & Yarowsky, D. (2002). Modeling consensus: Classifier combination for word sense disambiguation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, 25–32.

Friedman, M. (1974). Explanation and scientific understanding. Journal of Philosophy, 71(1), 5–19.

Geng, X., Liu, T.-Y., Qin, T., Arnold, A., Li, H., & Shum, H.-Y. (2008). Query dependent ranking using K-nearest neighbor. In Proceedings of the 31st Annual International Conference on Research and Development in Information Retrieval (SIGIR), 115–122.

Girju, R., Badulescu, A., & Moldovan, D. (2006). Automatic discovery of part-whole relations. Computational Linguistics, 32(1), 83–135.

Gold, E. M. (1967). Language identification in the limit. Information and Control, 10(5), 447–474.

Gonzalo, J., & Verdejo, F. (2006). Automatic acquisition of lexical information and examples.

In Word sense disambiguation: Algorithms and Applications. Agirre, Eneko and Edmonds, Philip (Eds.). Springer.

Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.

Hjørland, B. (2009). Concept theory. Journal of the American Society for Information Science and Technology, 60(8), 1519–1536.

Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, 133–142.

Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37.

Jurafsky, D., & Martin, J. H. (2009a). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.

Pearson Prentice Hall.

Jurafsky, D., & Martin, J. H. (2009b). The representation of meaning. In Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.

Lee, Y. K., & Ng, H. T. (2002). An empirical evaluation of knowledge sources and learning

algorithms for word sense disambiguation. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, 41–48. Association for Computational Linguistics.

Lee, Y. K., Ng, H. T., & Chia, T. K. (2004). Supervised word sense disambiguation with Support Vector Machines and multiple knowledge sources. In R. Mihalcea & P. Edmonds (Eds.), Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (pp. 137–140). Association for Computational Linguistics.

Lenat, D. B., & Guha, R. V. (1989). Building Large Know-ledge-Based Systems:

Representation and Inference in the Cyc Project. Addison-Wesley Longman Publishing Co., Inc.

Liu, F., Yang, M., & Lin, D. (2010). Chinese Web 5-gram version 1. Linguistic Data Consortium.

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

Margolis, E., & Laurence, S. (2011). Concepts. In The Stanford Encyclopedia of Philosophy.

Markert, K., & Nissim, M. (2007). SemEval-2007 Task 08: Metonymy resolution at SemEval-2007. In Proceedings of the 4th International Workshop on Semantic Evaluations (Vol. 36–41).

Martinez, D., de Lacalle, O. L., & Agirre, E. (2008). On the use of automatically acquired examples for all-nouns word sense disambiguation. Journal of Artificial Intelligence Research, 33(1), 79–107.

McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004). Finding predominant word senses in untagged text. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics.

Mihalcea, R., Chklovski, T., & Kilgarriff, A. (2004). The Senseval-3 English lexical sample task. In R. Mihalcea & P. Edmonds (Eds.), Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (pp. 25–28).

Mihalcea, R. F. (2002). Bootstrapping large sense tagged corpora. In Proceedings of the 3rd International Conference on Language Resources and Evaluations (LREC), Las Palmas.

Mihalcea, R., & Moldovan, D. I. (1999). An automatic method for generating sense tagged corpora. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative

Applications of Artificial Intelligence, 461–466.

Minsky, M. (1986). The Society of Mind. New York: USA: Simon & Schuster, Inc.

Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, 236–244.

Mueller, E. T. (2010). Commonsense Reasoning. Elsevier Science.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2), 10:1–10:69.

Navigli, R., & Lapata, M. (2010). An experimental study of graph connectivity for unsupervised word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4), 678–692.

Navigli, R., Litkowski, K. C., & Hargraves, O. (2007). SemEval-2007 Task 07:

Coarse-grained English all-words task. In Proceedings of the 4th International Workshop on Semantic Evaluations (pp. 30–35).

Ng, H. T., & Lee, H. B. (1996). Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, 40–47.

Nowak, M. A., Plotkin, J. B., & Krakauer, D. C. (1999). The evolutionary language game.

Journal of Theoretical Biology, 200(2), 147 – 162.

Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., & Dang, H. T. (2001). English tasks:

all-words and verb lexical sample. In Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (pp. 21–24).

Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, 613–619. ACM.

Plate, T. A. (1995). Holographic Reduced Representations. IEEE Transactions on Neural Networks, 6(3), 623–641.

Plate, T. A. (2003). Holographic Reduced Representation: Distributed Representation for Cognitive Structures. Stanford, CSLI Publications.

Plotkin, J. B., & Nowak, M. A. (2000). Language evolution and information theory. Journal of Theoretical Biology, 205, 147–159.

Priss, U. (2004). Linguistic applications of Formal Concept Analysis. In Proceedings of the

Sanderson, M. (1994). Word sense disambiguation and information retrieval. In Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, 142–151.

Schubert, L. (2009). From generic sentences to scripts. In Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence, Workshop: Logic and the Simulation of Interaction and Reasoning (LSIR).

Schubert, L., & Tong, M. (2003). Extracting and evaluating general world knowledge from the Brown corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Text Meaning - Volume 9 (pp. 7–13).

Schuemie, M. J., Kors, J. A., & Mons, B. (2005). Word sense disambiguation in the biomedical domain: An overview. Journal of Computational Biology, 12(5), 554–565.

Schwartz, H. A., & Gomez, F. (2009). Acquiring applicable common sense knowledge from the Web. In Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics (pp. 1–9).

Searle, J. (1980). Minds, brains and programs. Brains and Programs. Behavioral and Brain Sciences, 3(3), 417–457.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423.

Singh, P., Lin, T., Mueller, E. T., Lim, G., Perkins, T., & Zhu, W. L. (2002). Open Mind Common Sense: Knowledge acquisition from the general public. In On the Move to Meaningful Internet Systems (pp. 1223–1237). Springer-Verlag.

Snyder, B., & Palmer, M. (2004). The English all-words task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (pp.

41–43).

Sowa, J. F. (1984). Conceptual Structures: Information Processing in Mind and Machine.

Boston: Addison-Wesley Longman Publishing Co., Inc.

Stevenson, M., & Guo, Y. (2010). Disambiguation in the biomedical domain: The role of ambiguity type. Journal of Biomedical Informatics, 43(6), 972–981.

Stevenson, M., Guo, Y., & Gaizauskas, R. (2008). Acquiring sense tagged examples using relevance feedback. In Proceedings of the 22nd International Conference on Computational Linguistics, 809–816.

Stokoe, C., Oakes, M. P., & Tait, J. (2003). Word sense disambiguation in information retrieval revisited. In Proceedings of the 26th Annual International Conference on Research and Development in Informaion Retrieval, 159–166.

Thater, S., Fürstenau, H., & Pinkal, M. (2010). Contextualizing semantic representations using syntactically enriched vector models. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 948–957.

Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 173–180.

Towell, G., & Voorhees, E. M. (1998). Disambiguating highly ambiguous words.

Computational Linguistics, 24(1), 125–145.

Trapa, P. E., & Nowak, M. A. (2000). Nash equilibria for an Evolutionary Language Game.

Journal of Mathematical Biology, 41(2), 172–188.

Tseng, H., Chang, P., Andrew, G., Jurafsky, D., & Manning, C. (2005). A conditional random field word segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing.

Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 384–394.

Turney, P. D., & Pantel, P. (2010). From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research, 37(1), 141–188.

Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.

Yu, C.-H., & Chen, H.-H. (2010). Commonsense knowledge mining from the Web. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 1480–1485.

Yu, C.-H., & Chen, H.-H. (2012a). Chinese web scale linguistic datasets and toolkit. In

Proceedings of the 24th International Conference on Computational Linguistics, 501–508.

Yu, C.-H., & Chen, H.-H. (2012b). Detecting word ordering errors in Chinese sentences for learning Chinese as a foreign language. In Proceedings of the 24th International Conference on Computational Linguistics, 3003–3018.

Yu, C.-H., Tang, Y., & Chen, H.-H. (2012). Development of a web-scale Chinese word n-gram corpus with parts of speech information. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), 320–324.

Zhong, Z., & Ng, H. T. (2012). Word sense disambiguation improves information retrieval. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 273–282.

APPENDICES

APPENDIX I. The Definition of definition

We propose that “a definition connects object to other objects and use criteria to rate the goodness of this connection” in text. We denote this viewpoint as Connection Rating. Here, we argue that this explanation of term definition can subsume three viewpoints in Friedman's (1974) article.

Friedman's article gives three viewpoints of scientific explanations: (quotes below are from Friedman's article)

1. D-N model: “According to the D-N model, a description of one phenomenon can explain a description of a second phenomenon only if the first description entails the second.”

2. Familiarity: “scientific explanations give us understanding of the world by relating (or reducing) unfamiliar phenomena to familiar ones.”

3. Intellectual Fashion: “the phenomenon doing the explaining must have a special epistemological status ... this status varies from scientist to scientist and from historical period to historical period. At any given time certain phenomena are regarded as somehow self-explanatory or natural.”

Because “explanation” is also a concept, we can interpret these three views by using four families of concept theories. For example, Intellectual Fashion obviously uses concept theories from rationalism and historicism.

We can see that if the connection is restricted to having entailment property, Connection Rating subsumes D-N model viewpoint. If the connection is restricted to having reduction property and the explanatory objects must be familiar ones, Connection Rating subsumes

Familiarity viewpoint. If the explanatory objects have special epistemological status, and the explanatory objects and rating function are time-variant and scientist-depend, Connection Rating subsumes Intellectual Fashion viewpoint.

Actually, philosophers have proposed many criteria for the rating function. For Karl Popper (1902 – 1994), the criterion is falsifiability when we judge the goodness of a science theory. In logical positivism, the criterion is verifiability of explanatory objects. For Thomas Kuhn (1922 – 1996), the author of The Structure of Scientific Revolutions, the criteria are changing for different scientist societies and different historical periods. For scientists whose believe Galilean style, the criterion gives mathematical models higher priority to the reality.

These criteria can be subsumed by the Connection Rating viewpoint, which gives a unified viewpoint of scientific explanation.

APPENDIX II. The Filtering of Noise Texts

To have a cleaner dataset for knowledge extraction, it is necessary to filter out noises as much as we can, especially when the ClueWeb09 dataset are composed by web pages. The filtering procedures are as follows.

1. Convert a HTML page to a sequence of Java String.

2. For each string, we filter out it if it did not contain knowledge we want.

We use Jericho HTML Parser²³ to convert HTML pages to pure text in RFC3676 format.

In the converting procedure, we write the converted strings in Java’s serialization format. The serialization format can preserve the order of texts as they appear in the web pages. All extracted elements in HTML page are converted to a sequence of Java strings. For example, the content in <P> tag will be transform to single string, as well as the content of table’s cell (<TD> tag) will be in a single string. HTML tags, script codes, and other HTML elements which are for formatting purpose are removed, and the result of transformed HTML page is a sequence of text string. A single string may be a word, a phrase, a sentence, or a paragraph which depends on the author of a web page. Not all strings are helpful for knowledge extraction, so we design a simple and fast approach to filter out useless strings.

We filter out many types of strings that obviously did not help for knowledge extraction.

These filtered strings included script codes (due to the mal-formed HTML page), words for site's functions link, number, and named entity such as proper name, date, and time. Some of filtered strings are shown as follows.

23 http://jericho.htmlparser.net/docs/index.html

When processing the huge ClueWeb09 dataset, the speed is the most important consideration. We investigate two possible approaches, rule-based and linear SVM classifier (Fan, Chang, Hsieh, Wang, & Lin, 2008), for deciding a string of being selected or not.

Rule-based approach uses a list of if-else decisions which is simple and fast, but is hard to find out the best filtering rules. Linear SVM classifier, on the other hand, is more theoretical valid, and it is fast because it uses linear decision function , where w is the learned weights and x is feature vector. The hard problem here is to construct a test set for evaluating both approaches. We adopt a blended approach.

We first use a simple rule-based approach as bootstrap step to construct training set. We process 4 ClueWeb09 data files first, and use heuristic rules to decide invalid strings, which are the strings we want to filter. Each data file contains about 33000 HTML pages, and resulting 1893,512 valid strings and 10,860,623 invalid strings. We then adopt Liblinear²⁴ to train a classifier with L2-regularized L2-loss support vector classification.

The features we used include number of sentences, number of period, string length, number of sentence markers (.!?), the ratio of maximum word length to string length, and the

ratios of different character categories, such as digit, space, A-to-Z characters, and special chars used in scripts (@+-&%*/~#;,.!?|$^\"\:=`). For some scalar features in range [0,1], we add four flags to indicate the scalar quantization result. The number of total features is 92, and all features are normalized to [-1, 1]. These features can be extracted in linear time O(L) where L is the string length. The best inside test performance of linear SVM is about 97.50%.

The third step is applying the learned model to training set, and we manually induce simple rules from these errors. The final rules we used for filtering are as follows.

1. The minimum string length must larger than 30 characters. This will filter out most list

在文檔中概念表徵及其應用 (頁 89-0)