CHPATER 5 CONCLUSIONS AND FUTURE WORKS
5.2 FUTURE WORKS
We address two parts for future improvements of our system, the first part is the sampling step.
In performing SV clustering, because the clustering time is too long so we adopt the way that we do sampling and dimension reduction by FCM and PCA for every category. By doing so the clustering time is decreased very much but at the same time we must perform another strategy to compensate it. We must face the problem that it may happen that the clustering result is not good when comparing to the case that we use the original raw data to perform SV clustering.
We hope that by using techniques like SV mixture or other methods, we can use the original data to perform SV clustering and use the clustering result to do the text categorization.
The second part of our future work is the training of the expert node classifiers.
We see that category like commodity contains 53 sub-categories, this results in poor performance in classification. Perhaps we can find another useful and powerful method to train these kind of classifiers and promote its accuracy in classification.
References
Aas K. and Eikvil L., Text categorization: A survey. Report No 941, Norwegian Computing Center, ISBN 82-539-0425-8, June, 1999.
Apte C., Damerau F., Weiss S. M., Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, pages 233-251, 1994.
Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., A support vector clustering method. In International Conference on Pattern Recognition, 2000.
Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., Support vector clustering.
Journal of Machine Learning Research, volume 2 pages 125-137, 2001.
Blake C. L., and Merz C. J., UCI repository of machine learning databases, 1998.
Brill E., Transformation-based error-driven learning and natural language processing:
a case study in part of speech tagging. Computational Linguistics. 1995.
Bishop, C., Neural networks for pattern recognition. Oxford University Press, Walton Street, Oxford OX2 6DP, 1995.
Boser B.E., Guyon I., and Vapnik V. N., A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory, volume 5, pages 144-152, Pittsburg, ACM, 1992.
Brill E., Rule based part of speech tagger. version 1.14, 1994.
Chih-Wei Hsu, Chih-Jen Lin, A comparison of methods for multiclass support vector machines. IEEE transactions on neural networks. volume 13:2. March, 2002.
Cortes C. and Vapnik V., Support vector networks. Machine Learning, volume 20:1, 25, 1995.
D’Alessio S., Kershenbaum A., Murray K., Schiaffino R., Category levels in hierarchical text categorization. Proceedings of the Third Conference of Empirical Methods in Natural Language Processing EMNLP-3, 1998.
D’Alessio Stephen, Aaron Kershenbaum, Keitha Murray, and Robert Schiaffino., The effect of using hierarchical classifiers in text categorization. In Proceedings of 6th International Conference Recherche d’Information Assistee par Ordinateur(RIAO-00), pages 302-313, Paris, France, 2000.
Duda R. P., Hart P. E., and Stork D. G., Pattern Classification, 2nd ed. Wiley, 2000.
Ellis Horwitz, Sartaj Sahni, and Dinesh Mehta, Fundamentals of Data Structures in C++. New York :Computer Science Press, 1995.
Frakes W. B. and Baeza- Yates R., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
Georgej Klir and Bo Yuan, Fuzzy Sets and Fuzzy Logic. Prentice Hall International Editions., 1995.
Golub G. and Loan C. Van, Matrix Computations,3rd edition. Johns Hopkines, Baltimore, 1996.
Hayes P. and Weinstein S., Constre/tis: a system for content-based indexing of a database of news stories. In Annual conference on Innovative Applications of AI, 1990.
Japkowicz N., Myers C. and Gluck M., A novelty detection approach to classification.
In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 518-523, 1995.
Joachims Thorsten., Text categorization with Support Vector Machines: learning with many relevant features. LS-8 report 23, 1998.
Jolliffe I.T., Principal Component Analysis. Spring Verlag, 1986.
Koller D. and sahami M., Hierarchically classifying documents using very few words.
International Conference on Machine Learning, volume 14, Morgan-Kauffman, 1997.
Lang, K., Newsweeder : Learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.
Lewis D. D., An evaluation of phrasal and clustered representations on a text categorization task. In Proc. Of the 15th Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval. pages 37-50, 1992a.
Lewis. D. D., Representation and learning in information retrieval, Ph.D. thesis, Computer Science Dept, Univ. of Massachusetts at Amherst, February. Technical report pages 91-93, 1992b.
Lewis D. D., Reuters-21578 collection, 1996.
Manevitz Larry M., Malik Uousef, One-class SVMS for document classification.
Journal of Machine Learning Research volume 2 pages 139-154, 2001.
Meisel W. S., Computer-oriented approaches to pattern recognition. New York and London, 1972.
Miguel Á . Carreira-Perpinán, A review of dimension reduction techniques. Technical report CS-96-09, 1997.
Moya M., Koch M. and Hosterler L., One-class classifier networks for target recognition applications. In Proceedings world congress on neural networks, pages 797-801, Portland, OR. International Neural Network Society, INNS, 1993.
Ng H.-T., Goh W.-B. and Low K.-L., Feature selection, perception learning and a usability case study. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 27-31, pages.67-73, 1997.
Platt J. C., Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning. MIT Press, 1998.
Hao P. Y. and Chiang J. H., Support vector clustering: a new geometrical grouping approach, Proceedings of the 9-th Bellman Continuum International Workshop on Uncertain Systems and Soft Computing, volume 2, pages. 312-317, Beijing, China, July, 2002.
Porter M. F., An algorithm for suffix stripping. program: automated library and information systems, volume 14(1), pages 130-137, 1980.
Ricardo B. Y., Berthier R. N., Modern information retrieval. Addison-Wesley, ACM Press, New York, 1999.
Rijsbergen C. J. V. , Information Retrieval. London: Butterworths, 2nd edition, 1979.
Ritter G. and Gallegos M., Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters, volume 18 pages 525-539, 1997.
Scholkopf B., Williamson R., Smola A., and Shawe-Taylor J., Single-class support vector machines. In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsupervised Learning, Dagstuhl-Seminar-Report 235, pages 19-20, 1999.
Scholkopf B., Platt J.C., Shawe-Tayer J., Smola A. J., and Williamson R. C., Estimating the support of a high dimensional distribution. In Proceedings of the Annual Conference on Neural Information Systems. MIT Press, 2000.
Schutze H., Hull D., and Pedersen, J., A comparison of classifiers and document representations for the routing problem. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995.
Sebastiani F., Machine learning in automated text categorization: a survey. Technical report IEI-B4-31-1999, Istituto di Elaborazione dell’informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999, Revised version, 2001.
Tax D. and Duin R., Support vector domain description. Pattern Recognition Letters volume 20 pages 11-13 , 1999.
Wang K., Zhou S., and He Y., Hierarchical classification of real life documents. In Proceedings of the 1st SIAM Int. Conference on Data Mining, Chicago, 2001.
Weigend A. S., Wiener E. D., and Pedersen J. O., Exploiting hierarchy in text categorization. Information Retrieval, volume 1(3) pages 193-216, 1999.
Weiss S. M., Apte C., Damerau F.J., Johnson D.E., Oles, F.J. Goetz, Hampp T., Maximizing text- mining performance, IEEE Intelligent Systems, volume 14(4), July-Aug, 1999.
Weston J., Watkins C., Multi-class support vector machines. Technical Report CSD-TR-98-04 May 20, 1998.
Yang Y. and Wilbur, J., Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, volume 47(5) pages 357-369, 1996.
Yang Y., An evaluation of statistical approaches to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997a.
Yang Y., and Pedersen J. O., A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997b.
黃曲江, 計算機演算法設計與分析 格致, 1989.
APPENDIX A Stop-word List
There are totally 306 stop words used in this thesis.
a about above across after afterwards again against albeit all almost alone along already also although always among amongst an
and another any anyhow anyone anything anywhere are around as at b be became because become becomes becoming been before
beforehand behind being below beside besides between beyond both but by c can cannot co could d down during
e
each eg eight either eleven else elsewhere enough etc even ever every everyone everything everywhere except f few five for
four former formerly from further g h had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however i
ie if in inc indeed into is it its itself j k l last latter latterly least less ltd
m many may me meanwhile might more moreover most mostly much must my myself n namely neither never nevertheless next
nine no nobody none nor
not nothing now nowhere o of often on once one only onto or other others otherwise our ours ourselves out over own p per perhaps
q r rather s said same seem seemed seeming seems seven several she should since six so some somehow someone something sometime sometimes somewhere still
such t ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they this those though three through throughout
thru thus to today together too toward towards two twelve u under until up upon us v v very via w was we well were
what whatever whatsoever when whence whenever whensoever where whereafter whereas whereat whereby wherefrom wherein whereinto whereof whereon whereto whereunto whereupon wherever wherewith whether which whichever
whichsoever while whilst whither who whoever whole whom whomever whomsoever whose whosoever why will with within without would x y year years yes yesterday yet
you your yours yourself yourselves z
APPENDIX B PART-OF-SPEECH TAGS
There are totally 37 part-of-speech tags.
Part-of-Speech Tag Meaning
1 CC Coordinating Conjunction
2 CD Cardinal number
3 DT Determiner
4 EX Existential there
5 FW Foreign word
6 IN Preposition subordinating conjunction
7 JJ Adjective
8 JJR Adjective, comparative
9 JJS Adjective, superlative
10 LS List item marker
11 MD Modal
12 NN Noun, singular or mass
13 NNS Noun, plural
14 NNP Proper noun, singular
15 NNPS Proper noun, plural
16 PDT Predeterminer
17 POS Possessive ending
18 PRP Personal pronoun
19 PRP$ Possessive pronoun
20 RB Adverb
21 RBR Adverb, comparative
22 RBS Adverb, superlative
23 RP Particle
24 SYM Symbol
25 TO To
26 UH Interjection
27 VB Verb, base form
28 VBD Verb past tense
29 VBG Verb gerund or present participle
30 VBN Verb, past participle
31 VBP Verb, non-3rd person singular present 32 VBZ Verb, 3rd person singular present
33 WDT Wh-determiner
34 WP Wh-pronoun
35 WP$ Possessive wh-pronoun
36 WRB Wh-adverb
37 . Period