This paper has proposed a practical Web-based approach to organizing text seg-ments into a hierarchical structure of topic classes. Although the hierarchical clus-tering of text segments is in essence considered very difficult, the Web provides an alternative way to deal with this problem. With huge amounts of available docu-ments on the Web been indexed by real-world search engines, most of text segdocu-ments can get adequate topic-relevant contextual information. Hence, the approach is de-signed to be combined with the search processes of real-world search engines to extract features from the retrieved highly ranked search-result snippets for each
text segment. Also, a clustering algorithm for generating a natural multi-way-tree cluster hierarchy is developed; the algorithm is a hierarchical agglomerative clus-tering algorithm, which generates a binary-tree hierarchy at first, followed by a hierarchical cluster partitioning technique to generate a multi-way-tree hierarchy.
The partitioning technique is designed based on a min-max objective function to measure the quality of clusters, and is combined with a heuristically acceptable pref-erence function on the number of clusters to ensure that the produced hierarchy is natural for humans. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the feasibility of our approach. The approach is believed useful in various Web information appli-cations. Future work encourages us to investigate the possibility of our approach on more types of text segments. For example, dealing with polysemous text seg-ments is not well explored in our current stage of study. In addition, providing a more sophisticated cluster naming technique is another urgent demand in order to provide users a more comprehensive result topic hierarchy.
The authors would like to thank the associate editor and the anonymous reviewers.
Their valuable comments and suggestions greatly improved the quality of this paper.
Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the WWW. In Proceedings of ECAI 2000 Workshop on Ontology Learning. Berlin, Germany.
Agrawal, R. and Srikant, R.2001. On integrating catalogs. In Proceedings of the 10th Inter-national World Wide Web Conference. Hong Kong, ACM Press, New York, NY, 603–612.
Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. 1999. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of IJCAI’99 Workshop on Text Mining: Foundations, Techniques and Applications. Stockholm, Sweden, 1–9.
Baker, L. D. and McCallum, A. K.1998. Distributional clustering of words for text classifi-cation. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, ACM Press, New York, NY, 96–103.
Beeferman, D. and Berger, A.2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, MA, USA, ACM Press, New York, NY, 407–416.
Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. 1991. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, CA, USA, 264–270.
Buckley, C., Salton, G., and Allan, J. 1992. Automatic retrieval with locality information using smart. In Proceedings of the 1st Text REtrieval Conference (TREC-1). Gaithersburg, MD, 59–72.
Chakrabarti, S.2002. Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science & Technology.
Chakrabarti, S., Dorm, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of 1998 ACM SIGMOD International Conference on Management of Data. Seattle, USA, ACM Press, New York, NY, 307–318.
Chuang, S.-L. and Chien, L.-F.2002. Towards automatic generation of query taxonomy: A hier-ACM Journal Name, Vol. V, No. N, March 2005.
archical query clustering approach. In Proceedings of the 2002 IEEE International Conference on Data Mining. Maebashi City, Japan, IEEE Computer Society Press, 75–82.
Chuang, S.-L. and Chien, L.-F. 2003. Enriching web taxonomies through subject categoriza-tion of query terms from search engine logs. Decision Support System, Special Issue on Web Retrieval and Mining 35, 1, 113–127.
Dhillon, I. S., Mallela, S., and Kumar, R. 2002. Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, Edmonton, Canada.
Feldman, R. and Dagan, I.1995. Knowledge discovery in textual databases (KDT). In Proceed-ings of the 1st International Conference on Knowledge Discovery and Data Mining. Montreal, Canada, AAAI Press, 112–117.
Glover, E., Pennock, D. M., Lawrence, S., and Krovetz, R. 2002. Inferring hierarchical de-scriptions. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM). McLean, Virginia, USA, 4–9.
Hearst, M.1999. Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Maryland, USA.
Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. THE TAGGED LOB CORPUS:
Koller, D. and Sahami, M.1997. Hierarchically classifying documents using very few words.
In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, USA, 170–178.
Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CA, USA, ACM Press, New York, NY, 16–22.
Lawrie, D., Croft, W. B., and Rosenberg, A. L. 2001. Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orlean, LA, USA, ACM Press, New York, NY, 349–357.
Li, T., Zhu, S., and Ogihara, M. 2003. Topic hierarchy generation via linear discriminant pro-jection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. Toronto, Canada, 421–422.
Manning, C. D. and Schutze, H.1999. Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA, USA.
McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning, J. W. Shavlik, Ed. Morgan Kaufmann Publishers, San Fran-cisco, US, Madison, US, 359–367.
Milligan, G. W. and Cooper, M. C.1985. An examination of procedures for detecting the number of clusters in a data set. Psychometrika 50, 159–179.
Mirkin, B.1996. Mathematical Classification and Clustering. Kluwer.
Moldovan, D. I. and Girju, R.2001. An interactive tool for the rapid development of knowledge bases. International Journal on Artificial Intelligence Tools 10, 1-2 (Mar & Jun), 65–86.
Muller, A., Dorre, J., Gerstl, P., and Seiffert, R. 1999. The TaxGen framework: Automat-ing the generation of a taxonomy for a large document collection. In ProceedAutomat-ings of the 32nd Hawaii International Conference on System Sciences. Maui, Hawaii.
Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words.
In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics.
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval.
Information Processing and Management 24, 513–523.
Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, CA, USA, ACM Press, New York, NY, 206–213.