• 沒有找到結果。

Conclusions and Future Work

This thesis proposes a domain-space weighting scheme to represent documents in domain-space and incrementally construct a classifier to resolve the document representation and categories adaptation problems. The scheme consists of three major phases: Training Phase, Discrimination Phase and Tuning Phase, and each of them has been successfully implemented according to corresponding algorithms. The training algorithm incrementally extracts and weights features from each individual category, and then integrates the results in a feature-domain association weighting table. The discrimination algorithm diminishes feature weights with lower discriminating powers. Consequently, the classifier is constructed according to the above two algorithms. Finally, the tuning algorithm strengthens the classifier by the feedback information of tuning documents to reduce the number of false positives for the constructed classifier. If the constructed classifier confronts with a set of newly inserting documents in which some belong to newly categories and the others belong to trained categories in the feature-domain association weighting table, the scheme will separate them first, and then apply the training algorithm in Section 4.1 to extract and weight features from documents of newly categories; as for the others of trained categories, the scheme will apply the tuning algorithm in Section 4.3 to extract newly information from them and integrate the result into the feature-domain association weighting table. Therefore, the scheme can deal with all newly inserting documents no matter what they belong to.

Experiments over the standard Reuters-21578 benchmark are carried out and the results show that the classifier with enough training documents is rather effective.

And with the tuning algorithm, the classifier is getter stronger. In the future, we attempt to experiment over other document collections to validate our classifier construction algorithm. We will also try to employ other refined functions for the discrimination algorithm to enhance the performance. As for the tuning parameter in the tuning algorithm, we hope to construct a scheme to automatically learn an appropriate numeral for the tuning algorithm. We attempt to adapt our classifier construction algorithm on multi-label document classification in the future.

Bibliography:

[1] Antonie, M.L. and Zaiane, O.R. Text document categorization by term association.

International conference on data mining, IEEE. 2002.

[2] Baker, L. and McCallum, A. Distributional clustering of words for text classification. In SIGIR-98, 1998.

[3] Berry, M.W., Dumais, S.T. and O’Brien, G.W. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595, 1995.

[4] Chickering D., Heckerman D., and Meek, C. A Bayesian approach for learning Bayesian networks with local structure. Proc. of 13th Conf. on Uncertainty in Artifical Intelligence, 1997.

[5] Chu, W.W., Liu, Z. and Mao, Z. Textual document indexing and retrieval via knowledge sources and data mining. Communication of the institute of information and computing machinery, 2002.

[6] Dagan, I., Karov, Y., and Roth, D. Mistake-driven learning in text categorization.

Proc. of EMNLP-97, 2nd Conf. on Empirical Methods in Neural Language Processing, 1997.

[7] Debole, F. and Sebastiani, F. Supervised term weighting for automated text categorization. ACM SAC, 2003.

[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Hashman, R.

Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41(6), 1990.

[9] Dumais, S.T. and Chen, H. Hierarchical classification of web content. Proc.

ACM-SIGIR Interformational Conference on Research and Development in Information Retrieval, pp. 256-263, Athens, 2000.

[10] Dumais, S., Platt, J., Heckerman, D. and Sahami, M. Inductive learning algorithms and representations for text categorization. ACM CIKM, 1998.

[11] Fuketa, M., Lee, S., Tsuji, T., Okada, M. and Aoe, J. A document classification method by using field association words. Jour. of information sciences, Elsevier Science. 2002.

[12] Galavotti, L., Sebastiani, F. and Simi, M. Experiments on the use of feature selection and negative evidence in automated text categorization. Proc. of ECDL-00, 4th European Conf. on Research and Advanced Technology for Digital Libraries, 2000.

[13] George H, J., Ron K. and Karl, P. Irrelevant features and the subset selection problem. In Proceedings of the 11 Machine Learning (1994) pp. 121-129.

[14] Han, E.H. Text categorization using weight adjusted k-Nearest Neighbor classification. PhD thesis, University of Minnesota, October 1999.

[15] Joachims, T. Text categorization with support vector machines: Linearing with many relevant features. Proc. of the 10th European Conference on Machine Learning, vol. 1938, pp. 137-142, Berlin, 1998, Springer.

[16] Joachims, T. Making large-scale SVM learning practical. Advances in Kernel Methods-Support Vector Learning, Chapter 11, pp. 169-184. The MIT Press, 1999.

[17] Karypic, G. and Han, E.H. Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval & categorization. CIKM, 2000.

[18] Kim, Y.H. and Zhang, B.T. Document indexing using independent topic extraction. Proc. of the International Conference on Independent Component Analysis and Signal Separation (ICA). 2001.

[19] Lam, S.L. and Lee, D.L. Feature reduction for neural network based text categorization. Proc. of DASAA-99, 6th IEEE International Conf. on Database Advanced Systems for Advanced Application, 1999.

[20] Lweis, D.D. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.

[21] Lewis, D.D. Reuters-21578 text categorization test collection distribution 1.0.

http://www.research.att.com/~lewis/reuters21578.html, 1999.

[22] Lewis, D.D., Schapire, R.E., Callan, J.P. and Papka, R. Training algorithms for linear text classifiers. Proc. of ACM-SIGIR. 1996.

[23] Lewis, D. Naïve (bayes) at forty: The independence assumption in information retrieval. In Tenth European Conference on Machine Learning. 1998.

[24] Liu, R.L. and Lu, Y.L. Incremental context mining for adaptive document classification. ACM SIGKDD, 2002.

[25] Quinlan, J.R. C4.5: Programs for machine learning. Moran Kaufmann, San Mateo, CA, 1993.

[26] Rocchio, J.J. Relevance feedback in information retrieval. In The Smart Retrieval System-Experiments in Automatic Document Processing. P.313-323.

Prentice-Hall, Englewood, Cliffs, New Jersey, 1971.

[27] Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1-47, 2002.

[28] Shankar, S. and Karypis, G. Weight adjustment schemes for a centroid based classifier. TextMining Workshop, KDD, 2000.

[29] Wang, B.B., McKay R.I., Abbass H.A., and Barlow M. A comparative study for domain ontology guided feature extraction. ACSC 2003.

[30] Wibowo, W. and Williams, H.E. Simple and accurate feature selection for hierarchical categorization. Proc. of the symposium on document engineering, ACM. 2002.

[31] Yang, Y. and Liu, X. A re-examination of text categorization. In Proc. of the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Morgan Kaufmann, 1999.

[32] Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. Proc. of ICML-97, 14th International Conference on Machine Learning, pp. 412-420, 1997.

[33] Yang Y. An evaluation of statistical approaches to MEDLINE indexing. In Proceedings of the American Medical Informatic Association (AMIA), pp. 358- 362, 1996.

[34] Yang Y. An evaluation of statistical approaches to Text Categorization.

Technical Report, CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997.

[35] Zhou, S., Ling, T.W., Guan, J., Hu, J. and Zhou, A. Fast text classification: a training corpus pruning based approach. Proc. of the Eighth International Conf. on Database System for Advanced Applications, IEEE, 2003.

相關文件