We address two parts for future improvements of our system, the first part is the sampling step.

In performing SV clustering, because the clustering time is too long so we adopt the way that we do sampling and dimension reduction by FCM and PCA for every category. By doing so the clustering time is decreased very much but at the same time we must perform another strategy to compensate it. We must face the problem that it may happen that the clustering result is not good when comparing to the case that we use the original raw data to perform SV clustering.

We hope that by using techniques like SV mixture or other methods, we can use the original data to perform SV clustering and use the clustering result to do the text categorization.

The second part of our future work is the training of the expert node classifiers.

We see that category like commodity contains 53 sub-categories, this results in poor performance in classification. Perhaps we can find another useful and powerful method to train these kind of classifiers and promote its accuracy in classification.


Aas K. and Eikvil L., Text categorization: A survey. Report No 941, Norwegian Computing Center, ISBN 82-539-0425-8, June, 1999.

Apte C., Damerau F., Weiss S. M., Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, pages 233-251, 1994.

Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., A support vector clustering method. In International Conference on Pattern Recognition, 2000.

Ben-Hur A., Horn D., Siegelmann H. T., and Vapnik V., Support vector clustering.

Journal of Machine Learning Research, volume 2 pages 125-137, 2001.

Blake C. L., and Merz C. J., UCI repository of machine learning databases, 1998.

Brill E., Transformation-based error-driven learning and natural language processing:

a case study in part of speech tagging. Computational Linguistics. 1995.

Bishop, C., Neural networks for pattern recognition. Oxford University Press, Walton Street, Oxford OX2 6DP, 1995.

Boser B.E., Guyon I., and Vapnik V. N., A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory, volume 5, pages 144-152, Pittsburg, ACM, 1992.

Brill E., Rule based part of speech tagger. version 1.14, 1994.

Chih-Wei Hsu, Chih-Jen Lin, A comparison of methods for multiclass support vector machines. IEEE transactions on neural networks. volume 13:2. March, 2002.

Cortes C. and Vapnik V., Support vector networks. Machine Learning, volume 20:1, 25, 1995.

D’Alessio S., Kershenbaum A., Murray K., Schiaffino R., Category levels in hierarchical text categorization. Proceedings of the Third Conference of Empirical Methods in Natural Language Processing EMNLP-3, 1998.

D’Alessio Stephen, Aaron Kershenbaum, Keitha Murray, and Robert Schiaffino., The effect of using hierarchical classifiers in text categorization. In Proceedings of 6th International Conference Recherche d’Information Assistee par Ordinateur(RIAO-00), pages 302-313, Paris, France, 2000.

Duda R. P., Hart P. E., and Stork D. G., Pattern Classification, 2nd ed. Wiley, 2000.

Ellis Horwitz, Sartaj Sahni, and Dinesh Mehta, Fundamentals of Data Structures in C++. New York :Computer Science Press, 1995.

Frakes W. B. and Baeza- Yates R., Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.

Georgej Klir and Bo Yuan, Fuzzy Sets and Fuzzy Logic. Prentice Hall International Editions., 1995.

Golub G. and Loan C. Van, Matrix Computations,3rd edition. Johns Hopkines, Baltimore, 1996.

Hayes P. and Weinstein S., Constre/tis: a system for content-based indexing of a database of news stories. In Annual conference on Innovative Applications of AI, 1990.

Japkowicz N., Myers C. and Gluck M., A novelty detection approach to classification.

In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 518-523, 1995.

Joachims Thorsten., Text categorization with Support Vector Machines: learning with many relevant features. LS-8 report 23, 1998.

Jolliffe I.T., Principal Component Analysis. Spring Verlag, 1986.

Koller D. and sahami M., Hierarchically classifying documents using very few words.

International Conference on Machine Learning, volume 14, Morgan-Kauffman, 1997.

Lang, K., Newsweeder : Learning to filter netnews. In International Conference on Machine Learning (ICML), 1995.

Lewis D. D., An evaluation of phrasal and clustered representations on a text categorization task. In Proc. Of the 15th Annual Int. ACM SIGIR Conf. On Research and Development in Information Retrieval. pages 37-50, 1992a.

Lewis. D. D., Representation and learning in information retrieval, Ph.D. thesis, Computer Science Dept, Univ. of Massachusetts at Amherst, February. Technical report pages 91-93, 1992b.

Lewis D. D., Reuters-21578 collection, 1996.

Manevitz Larry M., Malik Uousef, One-class SVMS for document classification.

Journal of Machine Learning Research volume 2 pages 139-154, 2001.

Meisel W. S., Computer-oriented approaches to pattern recognition. New York and London, 1972.

Miguel Á . Carreira-Perpinán, A review of dimension reduction techniques. Technical report CS-96-09, 1997.

Moya M., Koch M. and Hosterler L., One-class classifier networks for target recognition applications. In Proceedings world congress on neural networks, pages 797-801, Portland, OR. International Neural Network Society, INNS, 1993.

Ng H.-T., Goh W.-B. and Low K.-L., Feature selection, perception learning and a usability case study. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 27-31, pages.67-73, 1997.

Platt J. C., Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning. MIT Press, 1998.

Hao P. Y. and Chiang J. H., Support vector clustering: a new geometrical grouping approach, Proceedings of the 9-th Bellman Continuum International Workshop on Uncertain Systems and Soft Computing, volume 2, pages. 312-317, Beijing, China, July, 2002.

Porter M. F., An algorithm for suffix stripping. program: automated library and information systems, volume 14(1), pages 130-137, 1980.

Ricardo B. Y., Berthier R. N., Modern information retrieval. Addison-Wesley, ACM Press, New York, 1999.

Rijsbergen C. J. V. , Information Retrieval. London: Butterworths, 2nd edition, 1979.

Ritter G. and Gallegos M., Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters, volume 18 pages 525-539, 1997.

Scholkopf B., Williamson R., Smola A., and Shawe-Taylor J., Single-class support vector machines. In J. Buhmann, W. Maass, H. Ritter, and N. Tishby, editors, Unsupervised Learning, Dagstuhl-Seminar-Report 235, pages 19-20, 1999.

Scholkopf B., Platt J.C., Shawe-Tayer J., Smola A. J., and Williamson R. C., Estimating the support of a high dimensional distribution. In Proceedings of the Annual Conference on Neural Information Systems. MIT Press, 2000.

Schutze H., Hull D., and Pedersen, J., A comparison of classifiers and document representations for the routing problem. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995.

Sebastiani F., Machine learning in automated text categorization: a survey. Technical report IEI-B4-31-1999, Istituto di Elaborazione dell’informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999, Revised version, 2001.

Tax D. and Duin R., Support vector domain description. Pattern Recognition Letters volume 20 pages 11-13 , 1999.

Wang K., Zhou S., and He Y., Hierarchical classification of real life documents. In Proceedings of the 1st SIAM Int. Conference on Data Mining, Chicago, 2001.

Weigend A. S., Wiener E. D., and Pedersen J. O., Exploiting hierarchy in text categorization. Information Retrieval, volume 1(3) pages 193-216, 1999.

Weiss S. M., Apte C., Damerau F.J., Johnson D.E., Oles, F.J. Goetz, Hampp T., Maximizing text- mining performance, IEEE Intelligent Systems, volume 14(4), July-Aug, 1999.

Weston J., Watkins C., Multi-class support vector machines. Technical Report CSD-TR-98-04 May 20, 1998.

Yang Y. and Wilbur, J., Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, volume 47(5) pages 357-369, 1996.

Yang Y., An evaluation of statistical approaches to text categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997a.

Yang Y., and Pedersen J. O., A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning ICML97, pages 412-420, 1997b.

黃曲江, 計算機演算法設計與分析 格致, 1989.

APPENDIX A Stop-word List

There are totally 306 stop words used in this thesis.

a about above across after afterwards again against albeit all almost alone along already also although always among amongst an

and another any anyhow anyone anything anywhere are around as at b be became because become becomes becoming been before

beforehand behind being below beside besides between beyond both but by c can cannot co could d down during


each eg eight either eleven else elsewhere enough etc even ever every everyone everything everywhere except f few five for

four former formerly from further g h had has have he hence her here hereafter hereby herein hereupon hers herself

him himself his how however i

ie if in inc indeed into is it its itself j k l last latter latterly least less ltd

m many may me meanwhile might more moreover most mostly much must my myself n namely neither never nevertheless next

nine no nobody none nor

not nothing now nowhere o of often on once one only onto or other others otherwise our ours ourselves out over own p per perhaps

q r rather s said same seem seemed seeming seems seven several she should since six so some somehow someone something sometime sometimes somewhere still

such t ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they this those though three through throughout

thru thus to today together too toward towards two twelve u under until up upon us v v very via w was we well were

what whatever whatsoever when whence whenever whensoever where whereafter whereas whereat whereby wherefrom wherein whereinto whereof whereon whereto whereunto whereupon wherever wherewith whether which whichever

whichsoever while whilst whither who whoever whole whom whomever whomsoever whose whosoever why will with within without would x y year years yes yesterday yet

you your yours yourself yourselves z


There are totally 37 part-of-speech tags.

Part-of-Speech Tag Meaning

1 CC Coordinating Conjunction

2 CD Cardinal number

3 DT Determiner

4 EX Existential there

5 FW Foreign word

6 IN Preposition subordinating conjunction

7 JJ Adjective

8 JJR Adjective, comparative

9 JJS Adjective, superlative

10 LS List item marker

11 MD Modal

12 NN Noun, singular or mass

13 NNS Noun, plural

14 NNP Proper noun, singular

15 NNPS Proper noun, plural

16 PDT Predeterminer

17 POS Possessive ending

18 PRP Personal pronoun

19 PRP$ Possessive pronoun

20 RB Adverb

21 RBR Adverb, comparative

22 RBS Adverb, superlative

23 RP Particle

24 SYM Symbol

25 TO To

26 UH Interjection

27 VB Verb, base form

28 VBD Verb past tense

29 VBG Verb gerund or present participle

30 VBN Verb, past participle

31 VBP Verb, non-3rd person singular present 32 VBZ Verb, 3rd person singular present

33 WDT Wh-determiner

34 WP Wh-pronoun

35 WP$ Possessive wh-pronoun

36 WRB Wh-adverb

37 . Period

In document Hierarchical Text Categorization Using One-Class SVM (Page 67-78)

Related documents