Hierarchically SVM Classification Based on Support Vector Clustering Method and Its Application to Document Categorization

(1)

Hierarchically SVM classiﬁcation based on support vector

clustering method and its application to document categorization

Pei-Yi Hao

a,*

, Jung-Hsien Chiang

b

, Yi-Kun Tu

b

a_{Department of Information Management, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, ROC} b_{Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan}

Abstract

Automatic categorization of documents into pre-defined topic hierarchies or taxonomies is a crucial step in knowledge and content management. Standard machine learning techniques like support vector machines and related large margin methods have been success-fully applied for this task, albeit the fact is that they ignore the inter-class relationships. Unfortunately, in the context of document cate-gorization, we face a large number of classes and a huge number of relevant features needed to distinguish between them. The computational cost of training a classifier for a problem of this size is prohibitive. It has also been observed that obtaining a classifier that discriminates between two groups of classes is much easier than distinguishing simultaneously among all classes. This has prompted substantial research in using hierarchical classifiers to address single multi-class problems. In this paper, we propose a novel hierarchical classification method that generalizes support vector machine learning that is based on the results of support vector clustering method, and are structured in a way that mirrors the class hierarchy. Compared to previous non-hierarchical SVM classifier and famous docu-ments categorization systems, the proposed hierarchical SVM classification has a better improvement in classification accuracy in the standard Reuters corpus.

Keywords: Information retrieval; Document categorization; Hierarchical classiﬁcation; Support vector machines; Support vector clustering method; Machine learning

1. Introduction

Due to the rapid growth in textual data, automatic methods for organizing the data are needed. Automatic document categorization is one of these methods. It auto-matically assigns the documents to a set of pre-deﬁned clas-ses based on its textual content. Document categorization is a crucial and well-proven instrument for organizing large volumes of textual information. In most cases, the use of statistical or machine learning techniques has been proven to be successful as per this context, since it is typically more feasible to induce categorization rules based on example

documents, than to elicit such rules from domain experts. The wide range of methods applied to this problem include multivariate regression models (Schu¨tze, Hull, & Pedersen, 1995), probabilistic Bayesian models (Koller & Sahami, 1997; Lewis & Ringuette, 1994), decision trees (Lewis & Ringuette, 1994; Weiss et al., 1999), neural networks (Schu¨tze et al., 1995; Weigend, Wiener, & Pedersen, 1999), symbolic rule learning (Apte, Damerau, & Weiss, 1994), nearest neighbor classiﬁers (Xie & Beni, 1991), and – more recently – boosting (Schapire, Singer, & Sing-hal, 1998) and support vector machines (SVMs) (Joachims, 1998). Extensive experimental comparisons (e.g.Joachims, 1998; Sebastiani, 2002; Yang & Liu, 1999) have evidenced that among the methods available today, SVMs are highly competitive in their classiﬁcation accuracy and can there-fore be considered as the state-of-the art in document categorization.

*

Corresponding author. Tel.: +886 7 3814526x6117; fax: +886 7 3831332.

E-mail addresses: [email protected] (P.-Y. Hao), jchiang@ mail.ncku.edu.tw(J.-H. Chiang),[email protected](Y.-K. Tu).

www.elsevier.com/locate/eswa Expert Systems with Applications 33 (2007) 627–635

Expert Systems with Applications

(2)

A potential drawback of all of the above mentioned classification methods is that they treat the category struc-ture as ‘flat’ where the pre-defined categories are treated in isolation and there is no structure defining the relationships among them (D’Alessio, Murray, Schiaffino, & Kershen-baum, 2000; Yang, 1999). Unfortunately, in the context of document categorization, we came across a large num-ber of classes and a huge numnum-ber of relevant features that are needed to be distinguished. The computational cost of training a classifier for a problem of this size is prohibitive. It has also been observed that obtaining a classifier that discriminates between two groups of classes is much easier than that distinguishes simultaneously among all classes. This has prompted substantial research in using hierarchi-cal classifiers to address single multi-class problems.

The idea of hierarchical classification is that solving a set of small problems with fewer classes can be achieved faster and more effectively than solving one large-scale classifica-tion problem distinguishing a large amount of classes. Hierarchical classification allows us to address a large clas-sification problem using a divide-and-conquer approach. It decomposes the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately and efficiently. At the root level in the category hierarchy, a document can be first classified into one or more sub-categories using some flat classification meth-od(s). The classification can be repeated on the document in each of the sub-categories until the document reaches some leaf categories or cannot be further classified into any sub-categories. To do so, several approaches have been introduced (Cai & Hofmann, 2004; D’Alessio et al., 2000; Dumais & Chen, 2000; Koller & Sahami, 1997; Larkey, 1998; McCallum, Rosenfeld, Mitchell, & Ng, 1998; Sun & Lim, 2001; Vaithyanathan, Mao, & Dom, 2000). Most of them achieved a big performance improvement and some gained classification accuracy.

The hierarchical classification offers a lot of flexibility in designing the classifier system. For instance, one can replace the classifier at the internal nodes of the generated hierarchical structure with stronger classifiers such as SVM (Cortes & Vapnik, 1995; Vapnik, 1995). Moreover, differ-ent feature selection methods can be used at each node that is specific to the domain of the input data. Each sub-prob-lem is smaller than the original probsub-prob-lem, and it is some-times possible to use a much smaller set of features (Dumais & Chen, 2000; Koller & Sahami, 1997).

In this paper, we propose a novel hierarchical classiﬁca-tion method that generalizes Support Vector Machine learning and is based on the results of support vector clustering method that are structured in a way that mirrors the class hierarchy. The grouping of individual classes into the meta-class is determined by the class distributions described by the support vector clustering method. The rest of this paper is structured as follows. In Sections2 and 3we give a brief review of the hierarchical classiﬁcation and sup-port vector clustering method, respectively. In Section 4,

we illustrate the proposed hierarchical SVM classiﬁcation where the hierarchical structure is constructed by the sup-port vector clustering method. We provide our experimen-tal methodology with a variety of results supporting our approach in Section 5, and the concluding remarks are given in Section6.

2. The hierarchical classiﬁcation approach

In the hierarchical classification approach, one or more classifiers are constructed at each level of the category tree and each classifier works as a flat classifier at that level. A document will first be classified by the classifier at the root level into one or more lower level categories. It will then be further classified by the classifier(s) of the lower level cate-gory(ies) until it reaches a final category, which could be a leaf category or an internal category. Given the above gen-eral definition of a hierarchy, two basic cases can be distin-guished: (i) A tree structure, where each class (except the root class) has exactly one parent classes, and (ii) a directed acyclic graph structure where a class can have more than one parent classes.

2.1. (Virtual) category tree

In the virtual category tree structure, categories are organized as a tree. Each category can belong to at most one parent category and the documents can only be assigned to the leaf categories (Dumais & Chen, 2000). The category tree structure is an extension of the virtual category tree that allows documents to be assigned into both internal and leaf categories (Wang, Zhou, & He, 2001).

A famous example of the virtual category tree classiﬁca-tion is the Binary Hierarchical Classiﬁer (BHC) architecture

Kumar, Ghosh, and Crawford, 2002that addresses multi-class multi-classification problems using a set of binary multi-classifiers. The Binary Hierarchical Classifier recursively decompose a multi-class (N-classes) problem into N 1 two meta-class problems, resulting in N 1 classifiers arranged as a binary tree, as shown inFig. 1. The given set of classes is first par-titioned into two disjoint meta-classes and each meta-class

Fig. 1. An example of Binary Hierarchical Classiﬁer architecture which consists of N 1 classiﬁers arranged as a binary tree.

(3)

thus obtained is partitioned recursively until it contains only one of the original classes. The tree thus has a number of leaf nodes equal to the number of classes in the output space. As we know, a binary tree of n nodes has 1

nþ1

2n n diﬀerent structures. In BHC, the structure of the binary tree is constructed through a deterministic annealing process that encourages similar classes to remain in the same parti-tion. As a direct consequence of the BHC algorithm, classes that are similar to each other in the input feature space are thus lumped into the same meta-class higher up in the tree. Interested readers are referred toKumar et al. (2002) for details of the algorithm.

2.2. (Virtual) directed acyclic category graph

In the virtual directed acyclic category graph structure, categories are organized as a Directed Acyclic Graph (DAG) where a class can have more than one parent clas-ses. Similar to the virtual category tree, documents can only be assigned to leaf categories. The directed acyclic cat-egory graph structure is an extension of the virtual directed acyclic category graph structure. It is perhaps the most commonly used structure in the popular web directory ser-vices such as Open Directory Project and Yahoo. Docu-ments can be assigned to both internal and leaf categories in the directed acyclic category graph structure.

Recently, a novel multi-class SVM approach, called DAGSVM, has been proposed (Platt, Cristianini, & Shawe-Taylor, 2000). Assuming the number of classes as n, its training phase is by solving n(n 1)/2 binary SVMs. In the testing phase, it uses a rooted binary directed acyclic graph which has n(n 1)/2 internal nodes and n leaves, as shown inFig. 2. Each node is a binary SVM of ith and jth classes. Given a test sample x, starting at the root node, the binary decision function is evaluated. Then it moves to either left or right depending on the output value. There-fore, we go through a path before reaching a left node, which indicates the predicted class. Although the

hierarchi-cal structure of DAGSVM is fixed, the assignment of the binary SVM corresponding to each internal node is not unique. Fig. 3 shows different assignments of the Binary SVM to each internal node. It is easy to see that when the number of classes is n, there are n(n 2)! different assignments in the hierarchical structure. To our knowl-edge, not many studies address the issue of how to obtain a better assignment when the number of classes is huge. 3. Support vector machines for clustering method

Support vector (SV) clustering has been recently derived from the single-class support vector machine (Chiang & Hao, 2003; Scho¨lkopf, Platt, Shawe-Taylor, Smola, & Wil-liamson, 2001; Tax & Duin, 1999) for estimating the under-lying probability distribution. Ben-Hur et al. generalize support vectors as the boundary of clusters (Ben-Hur, Horn, Siegelmann, & Vapnik, 2000; Ben-Hur, Horn, Siegelmann, & Vapnik, 2001). We now illustrate the sup-port vector clustering method as shown inFig. 4.

To begin, let U denotes a nonlinear transformation, which maps the original input space onto a high-dimen-sional feature space. Clustering may be viewed as ﬁnding

Fig. 2. An example of DAGSVM architecture which uses a rooted binary directed acyclic graph with n(n 1)/2 internal nodes and n leaves.

Fig. 3. Three diﬀerent ways in assigning the binary SVM corresponding to each internal node where the number of classes is three.

Fig. 4. Schematic diagram of the SV clustering.

(4)

References

Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Report No 941, Norwegian Computing Center. ISBN 82-539-0425-8.

Apte, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Informa-tion Systems, 233–251.

Ben-Hur, A., Horn, D., Siegelmann, H.T., & Vapnik, V.N. (2000). A support vector clustering method. In International conference on pattern recognition (pp. 728–732).

Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. N. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137. Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with

support vector machines. In Proceedings of the thirteenth ACM international conference on information and knowledge management. (pp. 78–87).

Chiang, J.-H., & Hao, P.-Y. (2003). A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing. IEEE Trans-actions on Fuzzy Systems, 11(4), 518–527.

Cortes, C., & Vapnik, V. N. (1995). Support vector network. Machine Learning, 20, 1–25.

D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. (2000). The effect of using hierarchical classifiers in text categorization. In Proceedings of the 6th internaational conference ‘‘Recherche d’Informa-tion Assistee par Ordinateur (pp. 302–313). Paris: FR.

Dumais, S. T., & Chen, H., (2000). Hierarchical classiﬁcation of web content. In Proceedings of 23rd international conference on research and development in information retrieval (SIGIR’00) (pp. 256–263). Joachims, T. (1998). Text categorization with support vector machines:

Learning with many relevant features. In C. N’edellec & C. Rouveirol (Eds.). Proceedings of the 10th European conference on machine learning (ECML’98) (No. 1398). Springer-Verlag.

Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of 14th international conference on machine learning (ICML’97). (pp. 170–178). Nashville, TN. Kumar, S., Ghosh, J., & Crawford, M. M. (2002). Hierarchical fusion of

multiple classiﬁers for hyperspectral data analysis. Pattern Analysis and Applications, 5(2), 210–220, Spl. Issue on Fusion of Multiple Classiﬁers.

Larkey, L. (1998). Some issues in the automatic classiﬁcation of US patents. In Learning for text categorization. Papers from the 1998 workshop. AAAI Press (pp. 87–90).

Lewis, D.D., & Ringuette, M., 1994. A comparison of two learning algorithms for text categorization. In Third annual symposium on document analysis and information retrieval (SDAIR’94) (pp. 81–93). Luenberger, D. G. (1984). Linear and nonlinear programming.

Massachu-sets: Addison-Wesley Pub.

Manevitz, L. M., & Yousef, M. (2001). One-class SVMS for document classiﬁcation. Journal of Machine Learning Research, 2, 139–154. McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. (1998). Improving

text classiﬁcation by shrinkage in a hierarchy of classes. In Proceedings

of 15th international conference on machine learning (ICML’98) (pp. 359–367). Madison, WI.

ODP – Open Directory Project.http://dmoz.org/.

Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass classiﬁcation. In Proceedings of neural informa-tion processing systems, NIPS’99 (pp. 547–553). MIT Press.

Schapire, R. E., Singer, Y., Singhal, A. (1998). Boosting and Rocchio applied to text ﬁltering. In Proceedings of 21th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’98) (pp. 215–223).

Scho¨lkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribu-tion. Neural Computation, 13, 1443–1471.

Schu¨tze, H., Hull, D., & Pedersen, J. O. (1995). A comparison of classiﬁers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95) (pp. 229–237). Sebastiani, F. (2002). Machine learning in automated text categorization.

ACM Computing Surveys, 34(1), 1–47.

Sun, A., & Lim, E.-P. (2001). Hierarchical text classiﬁcation and evaluation. In Proceedings of the ﬁrst IEEE international conference on data mining (pp. 521–528). California, USA, Nov 2001.

Tax, D., & Duin, R. (1999). Support vector domain description. Pattern Recognition Letters, 20, 11–13.

Vaithyanathan, S., Mao, J., & Dom, B. (2000). Hierarchical Bayes for Text Classiﬁcation. In Proceedings of international workshop on text and web mining (PRICAI’00) (pp. 36–43). Melbourne, Australia. Vapnik, V. N. (1995). The nature of statistical learning theory. New York:

Springer-Verlag.

Wang, K., Zhou, S., & He, Y. (2001). Hierarchical classiﬁcation of real life documents. In Proceedings of the 1st SIAM international conference on data mining, Chicago.

Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193–216.

Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., J Oles, F., Goetz, H., et al. (1999). Maximizing text-mining performance. IEEE Intelli-gent Systems, 14(4), 2–8.

Xie, X. L., & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions Pattern Analysis and Machine Intelligence, PAMI-13(8), 841–847.

Yahoo!http://www.yahoo.com.

Yang, Y. (1999). An evaluation of statistical approaches to text catego-rization. Information Retrieval, 1(1–2), 69–90.

Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 42– 49).

Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In International conference on machine learning (ICML) (pp. 412–420).