Improving Linear Classifier for Chinese Text Categorization

全文

(1)Improving Linear Classifier for Chinese Text Categorization. ¡¢££¤¥¦§¨¥¦©ª«¬ . Jyh-Jong Tsay Dept. of Comp. Sci. Natl. Chung Cheng Univ. tsay@cs.ccu.edu.tw. .

(2) Æ

(3) !"#$%&'( )* Rcchio + k. ,-.- /0123 45%67 89:;()* k ,-.- <=> ?@AB CDEFG"HI JK BLMN6OPQRS T UV" Abstract In this paper we increase the number of representatives for each class to compensate for the potential weakness of linear classifier which compute one representative for each class. To evaluate the effectiveness of our approach, we compared with linear classifier produced by Rocchio algorithm and the k-Nearest Neighbor(kNN) classifier. Experimental results show that our approach improved linear classifier and achieved microaveraged accuracy similar to that of kNearest Neighbor(kNN), with much less classification time. Furthermore, we could provide a suggestion to reorganize the structure of classes when identify new representatives for linear classifier. ∗ Lecturer, Department of Computer Science and Engineering, National Penghu Institute of Technology. . ∗. Jing-Doo Wang Dept. of Comp. Sci. Natl. Chung Cheng Univ. jdwang@cs.ccu.edu.tw. : ¡¢£¤¥¦§¨©¤ª«§ ¨¬ Keywords: Information Retrieval, Linear Classifier, Text Categorization.. 1 Introduction Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers[6]. The main idea of linear classifier is to construct a prototype vector G as one representative for a class C using a training set of documents. To determine whether or not class C is assigned to the request document X, it usually computes the cosine similarity δ between X and G. If δ is greater than a given threshold value, class C is assigned to X. In this study, we assign only the class with the highest δ to X, and the behavior of linear classifier is conceptually like to determine which region a point belongs to in a two dimensional Voronoi diagram[9]. The assumption of one representative per class results in the restriction of hypothesis space stretched by documents to the set of linear separable hyperplane regions[5, 16]. However, it is very difficult to construct a set of hyperplanes to separate classes from each other because the shape of each class is irregular and is hard to predict in the high.

(4) dimensional vector space. In this study, we increased the number of representatives for each class to compensate for the potential weakness of linear classifiers that compute one representative for each class. First, we classify the documents in the training set with the representatives derived from the original classes. Secondly, we partition the classified documents which are classified into the same class into s partitions via hypergraph partition package[4], where s is determined manually in this study. Thirdly, we find new representatives derived from the subclasses which consist of the miss-classified documents and the correct-classified documents. Then, we select the representatives of the subclasses whose classification precision evaluated by the validation set is greater than a given threshold. Finally, we classify the documents in the testing set with both these new representatives and those derived from the original classes. Note that the training data are divided into a training set and a validation set to avoid the overfitting problem[8]. The training set is used to find the representatives of the original classes and the validation set is used to choose useful representatives of the subclasses whose classification precision are greater than a given threshold. To evaluate the effectiveness of our approach, we compared it with the linear classifier produced by Rocchio algorithm [2, 6], and the k-Nearest Neighbor(kNN) classifier[5, 16]. Experimental results show that the micro-averaged accuracy of our approach is better than that of linear classifier, and is similar to that of kNN, with much less classification time. Note that kNN is a wellknown statistical approach, and is one of the best performers in text categorization[17]. Furthermore, we could observe the ambiguities between classes after the process of new representatives identification, and could provide a suggestion to reorganize the structure of classes in the future. The remainder of this paper is organized as follows. Section reviews linear classifier. Section describes our approaches. Section gives experimental results. Section gives conclusion and discussion.. 2 Linear Classifier Linear classifier is a simple approach for classification[6]. The main idea of linear classifier is to construct a feature vector as one representative for each class(category). For each class Ci , linear classifier computes prototype vector Gi = (gi,1 , . . . , gi,n ), where n is the dimension of the vector space, and each element gi,j corresponds to the weight of the jth feature of Gi . The elements in vector Gi are learned from positive examples tuned by negative examples. Positive examples are those documents belonging to that class while negative examples are those documents not belonging to that class. To classify a request document X, we compute the cosine similarity between X and each prototype vector Gi , and assign to X the class whose prototype vector has the highest degree of cosine similarity with X. Cosine similarity is defined as follows: n j=1 xj · gi,j CosSim(X, Gi ) = n n 2 2 x j=1 j j=1 gi,j In this study, we use the Rocchio algorithm[6] to construct the representative Gi for class Ci . Let W be a document in the training collection, represented as a vector (w1 , w2 , · · · , wn ), where wj is the weight assigned to the jth term. To determine wj , we use the TF-IDF weighting method[10], which has been shown to be effective when used in the vector space model. Let tfj be the term frequency of the jth term in document W , and let dfj be the document frequency of the jth term in training collection. In this study, the TF-IDF weight is defined as dj = log2 (tfj + 1) ∗ log2 ( |D| dfj ), where D is the set of documents in the training collection and |D| is the number of documents in D. Let P and N be the set of positive and negative examples with respect to class Ci in the training corpus. |P | and |N | are the number of examples in P and N , respectively. The prototype vector Gi is defined as follows: W W Gi = W ∈P − η W ∈N |P | |N |.

(5) where η is the parameter that adjusts the relative impact of positive and negative examples. In this study, we choose η = 0.25 according to the experiments in [13, 11].. 3 Our Approach In this study, we increase the number of representatives for each class to compensate for the potential weakness of linear classifier which compute one representative for each class. We give an outline of our approach as follows. step 1. compute the representatives of the original classes with documents in the training set. step 2. classify documents in the training set with the representatives computed in step 1. step 3. identify the subclasses by partitioning the documents which are classified into the same class in step 2 into s partitions, where s is determined manually in this study.. 3.1. Definitions and Notations. We give definitions and notations for the identification of subclasses as follows. Let C be the set of predefined classes, and |C| be the number of predefined classes. Let Ci be the set of documents in the training set that belong to the ith class, and Fj be the set of documents in the training set which are classified to the jth class. |Ci | and |Fj | are the number of documents in Ci and Fj , respectively. Let Hi,j be the set of documents in Ci that is classified to Fj . That is, Hi,j = Ci ∩ Fj . Let hi,j = |Hi,j |. Note that i=|c| Ci = j=|c| j=1 Hi,j and Fj = i=1 Hi,j . The confusion matrix H = (hi,j ), as shown in Table 1, consists of the statistics of the classified documents in the training set. We identify the subclasses by dividing Fj into s partitionr s as Fj1 ,Fj2 , . . . , Fjs . Define Hi,j = Hi,j ∩ Fjr r r and hi,j = |Hi,j |, as shown in Table 2.. 3.2. Subclass Identification. r We isolated the subclass Hi,j to form a new representative in step 3. We describe the process of the identification of the subclasses in detail as follows.. step 4. compute the representatives of the substep 3.1 transfer the documents in Fj into a hypergraph such that a vertex v represents classes identified in step 3. one document and a hyperedge e represtep 5. classify documents in the validation set sents the set of documents in which term with the representatives computed in t appears. step 4. step 3.2 partition the vertices(documents) in the step 6. select the representatives of the subhypergraph constructed in step 3.1 into classes whose classification precision s partitions, where s is determined manachieved in step 5 is greater than a given ually. threshold. step 3.3 gather the vertices(documents) which Step 1, 2 and 5 are standard processes of belong to Ci and are classified to Fj and linear classifier as described in Section . In are in the rth partition as a new substep 3, we obtain the subclasses by partitionr . classes Hi,j ing the documents which are classified into the same class in step 2 into s partitions, step 3.4 compute the representatives of those where s is determined manually in this study. subclasses constructed in step 3.3 usIn step 4, we modify the Rocchio algorithm ing the formula of subclass representato compute the representatives of the subtive described in Section . classes. In step 6, we select the representaIn step 3.1, we constructed a hypergraph tives of the subclasses according to their clasin which a vertex v represents one document sification precision achieved in step 5. We and a hyperedge e represents the set of docnext explain the details of step 3, 4 and 6 in uments in which term t appears. The weight Section , Section and Section , respectively..

(6) C1. C2. Cj. C|c|. C2. h1,1 h2,1. h1,2 h2,2. h1,j h2,j. h1,|c| h2,|c|. Ci. hi,1. hi,2. hi,j. hi,|c|. C|C|. h|c|,1. h|c|,2. h|c|,j. h|c|,|c|. C1. Table 1: The confusion matrix H.. H1,j H2,j. H11, j H12, j. H21, j H22, j. Hr1, j Hr2, j. Hs1, j Hs2, j. Hi,j. H1i, j. H2i, j. Hri, j. Hsi, j. H|c|,j H1|c|, j. H2|c|, j. Hr|c|, j. Hs|c|, j. F2j. F rj. F1j. Table 2: Partition Fj into s partitions.. Fsj.

(7) of the hyperedge e is determined by tf · idf of term t[10] and is defined as follows. W eight(t) = log2 (tf + 1) ∗ log2 (. |D| ) df. where |D| is the number of training documents; tf is the term frequency of term t and df is the document frequency of term t in training collection. In step 3.2, we partition the vertices(documents) of a hypergraph into s roughly equal parts using hypergraph partition package[4] such that the total weight of hyperedges connecting vertices in different parts was minimized. Note that the hypergraph partitioning algorithm is an effective and scalable clustering method. Intuitively, the documents which had common terms would be clustered together[14]. In step 3.3, as shown in Table 2, we identified the subr that belonged to the Hi,j and was class Hi,j in the rth partition of Fj . In this study, we only take the subclass whose hri,j >= 5 into consideration.. 3.3. Ci from the other classes Cj (i = j), but didn’t use that representative to distinguish the r from the other subclasses which derived Hi,j form class Ci .. 3.4 tion. Representative Qualifica-. In this study, we selected those new representatives whose classification precision evaluated by the validation set is greater than a given threshold θ. We classified the documents in the validation set with the representatives obtained in Section and computed the precision of each representative. Then, we selected the representatives whose classification precision evaluated by the validation set was greater than θ. The choice for the value of θ was according to the micro-level accuracy θ1 achieved by the linear classifier in the testing set. In this study, we had θ > θ1 in order to achieve higher precision and better performance than that linear classifier did.. Subclass Representative. We modified the Rocchio algorithm to construct the representatives derived from the subclasses which were identified in Section . As shown in Figure 1, the documents in the training set D consists of C1 , · · · , Ck and each class Ci consists of |c| ∗ s subclasses at most before representative qualification described in Section . We describe the modification of the Rocchio algorithm to construct r in the the representative of the subclass Hi,j following. As shown in Figure 1, let P be the set of r , documents that belong to the subclass Hi,j r and P = Ci − Hi,j be the set of documents that belong to class Ci but do not belong to r . Let N = D − Ci be the set subclass Hi,j of documents in the training set D that does not belong to class Ci . The representative r Gri,j of subclass Hi,j was given as follows. W W ∈P W W ∈P W r −β −η W ∈N Gi,j = α |P | |P | |N | In this study, we chose α = 1, β = 0 and η = 1. Note that we chose β as 0 in above equation because we used the representative r to distinguish class Gri,j of the subclass Hi,j. 4 Experiments 4.1. Data Source. In our experiment, we used Chinese news articles from the Central News Agency(CNA). We used news articles spanning a period of one year, from 1/1/1991 to 12/31/1991, to extract terms. News articles from the sixmonth period 8/1/1991 to 1/31/1992 were used as training data to train classifiers. The testing data consisted of news articles from the one-month period 2/1/1992 to 2/28/1992. To avoid overfitting problem[8], the training data are partitioned into a training set and a validation set. In this study, the training set consists of two-thirds of the training data and the validation set consists of the remaining data. All the news articles were preclassified into 12 classes, as listed in Table 3.. 4.2. Document Representation. The representation of Chinese texts consists of the following steps: term extraction, term.

(8) The Training Set D. C1. C2. Ci. C|c|. N. N Hi,1 Hi,2 '. Hi,j. Hi,|c| '. P. H. 1. 2 i,jH i,j. P'. P H. r. i, j. H. s i, j. P'. r Figure 1: The modification of the Rocchio algorithm for the subclass Hi,j .. C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10 C 11 C 12. CNA News Group cna.politics.* cna.economics.* cna.transport.* cna.edu.* cna.l* cna.judiciary.* cna.stock.* cna.military.* cna.argriculture.* cna.religion.* cna.finance.* cna.health-n-welfare.* Total. Train Data. Test Data. 1991/8-1992/1. 1992/2/1-2/28 Test Set. Training Set Validation Set 8988 4494 3846 1922 1200 601 2136 1067 1852 926 2088 1044 1186 593 1212 606 997 499 471 236 1306 652 1158 580 26440 13220. Table 3: CNA news statistics.. 1225 776 279 379 415 492 200 261 238 74 151 305 4795.

(9) selection and term clustering. In term extraction, we adopt a scalable approach[15] to extract significant terms, which is based on String B-trees(SB-trees)[3]. In term selection, we adopt χ2 statistics[18] to select the most representative terms from the extracted terms. In term clustering, terms which are highly correlated are clustered into the same group. Distributional clustering[1] can reduces the dimension of the vector space to a practical level for Chinese text categorization[13, 12]. In our experiment, we use one year news, 1/1/1991-12/31/1991, to extract Chinese frequent strings(CFS)[7] and the number of significant terms extracted is 548363. We select 90000 of the extracted terms, and then group them into 4800 clusters because the choice of 90000 and 4800 achieves the best performance as indicated in [13, 11]. Therefore, each document Di is transformed into a vector as (di,1 , . . . , di,n ), where n is 4, 800 and di,j is the tf · idf weight[10] of the jth term in Di .. 4.3. Performance Measures. We measure the classification accuracy at both micro and macro levels. Three performance measures are used to evaluate the performance of each classifier: MicroAccuracy, MacroAccuracy and AccuracyVariance. Let |C| be the number of predefined classes, and let |Ci | be the number of testing news that are preclassified to the ith class, i=|C| and let N = i=1 |Ci | be the total number of testing news articles. Let |Hi,j | be the number of testing news in Ci that are classified to Cj . Let Acc(i) = |Hi,i |/|Ci | be the classification accuracy within class Ci . i=|C|. |Hi,i |. i=1 , MicroAccuracy is defined as N which represents the overall average of classification accuracy. MacroAccuracy is dei=|C|. Acc(i). i=1 , which represents the fined as |C| average of the classification accuracy within classes. AccuracyVariance is defined as i=|C| i=1. (Acc(i)−MacroAccuracy)2 , |C|. which represents the variance of accuracy among classes. Note that we measured the classification time on a PC with Pentium III 450(CPU) and 192MB RAM.. In order to discuss the biased situation that some classifiers prefer large classes than small classes, we also adopt the performance measures, recall, precision and F1 measure. Recall(R) is the percentage of the documents for a given class(category) that are classified correctly. Precision(P) is the percentage of the classified documents for a given class that are classified correctly. The F1 measure is one of the common measures to combine the recall and precision, and is defined 2RP as F1 = (R+P ).. 4.4 4.4.1. Experimental Results Improving Linear Classifier. First of all, as shown in Table 4, we had the confusion matrix H by preclassifying the documents in the training set. Secondly, we r identified the subclass Hi,j by partitioning the documents in class Fj which were classified to the jth class into s partitions, where s was determined manually and the value s experimented with included 2, 4, 8, 16, 32, 64, 128 and 256 in this study. As shown in Table 5, we isolated 1017 subclasses before representative qualification when s = 64. Note that we only took those subclasses whose hri,j were greater than or equal to 5 into consideration in this study. Thirdly, we used the representatives of the identified subclasses to classify the documents in the validation set, and had representative qualification by selecting the representatives whose precision was greater than a given threshold 80%, where the value of 80% was according to the MicroAccuracy, about 75%, achieved by the Rocchio linear classifier in this study. Finally, we classified the news in the testing set with the representatives which consisted of qualified representatives and those derived from original classes. The comparison of the performance with respect to different value s is shown in Table 6. The best MicroAccuracy our approach achieved was 77.54% when s = 64, and its corresponding MacroAccuracy and AccuracyVariance were 77.22% and 75.30, respectively, and the number of representatives was 580, 568 derived from subclasses and 12 derived form original class. We chose the case when s = 64 for further discussions..

(10) C1. C2. C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10 C 11 C 12. C3. C4. C5. C6. C7. C8. C9. C 10. C 11. C 12. Table 4: The confusion matrix H : the statistics of the classified news in the training set.. C1. C2 C3. C4. C5. C6 C7. C8. C9. C 10 C 11 C 12 # of subclasses. C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10 C 11 C 12 Total. Table 5: The distribution of subclasses(s=64).. s : the number of partitions MicroAccuracy MacroAccuracy AccuracyVariance # of representatives Classification Time Table 6: The comparison of different number of partitions..

(11) 4.4.2. Overall Comparison. To evaluate the effectiveness of our approach, we compared with the linear classifier produced by Rocchio algorithm and the k-Nearest Neighbor(kNN) classifier. Our approach improved linear classifiers and achieved the MicroAccuracy similar to that of kNN did, with much less classification time. Our approach also avoided the biased situation[14] that prefers large classes than small classes. We briefly describe kNN classifier for completeness as follows. Given an arbitrary request document X, kNN ranks its nearest neighbors among the training documents, and uses the classes of the k top-ranking neighbors to predict the classes of the X. The similarity score of each neighbor document to the X is used as the weight of the class of the neighbor document, and the sum of class weights over the k nearest neighbors are used for class ranking[16]. Note that kNN is a well-known statistical approach, and is one of the best performers in text categorization[17]. We have performed an experiment using different values of k, including 5, 10, 15, 20, 30, 50, 100 and 200. The best choice of k in our experiment is 50. As shown in Table 7, the value 77.54% of MicroAccuracy our approach achieved was better than the value 75.20% of that Rocchio did, and was similar to the value 77.62 of that kNN did; the MacroAccuracy and AccuracyVariance our approach achieved were similar to that the Rocchio did, and are better than that kNN did. Furthermore, the classification time of our approach, about 9 minutes, was much less than that of kNN, about 1 hours and 29 minutes. On the other hand, as shown in Table 8, most of the values of F1 measure our approach achieved were better than or equal to that Rocchio did, except class C8 . That is, our approach improved the performance of linear classifier while avoided the biased situation[14] that prefers large classes than small classes. 4.4.3 Suggestions for Reorganizing Class Structure We might provide a suggestion to reorganize the structure of classes with the rep-. resentatives whose classification precision evaluated in the validation was low. We could observe the ambiguities between classes due to the characteristic of linear classifier via those subclasses whose representatives achieved low precision in the validation set. As shown in Table 9, there were the distribution of the number of the subclasses whose precision was lower than 50%. Class C1 (Politics), for example, was a confused class that highly corrected with the other classes because there were 45 representatives derived from class C1 to distinguish from the other classes but failed to pass representative qualification. Class C2 (Economics) highly correlated with class C11 (Finance) as there were 11 representatives derived from class C2 to distinguish from class C11 as shown in Table 5, but there were 5 representatives as shown in Table 9. Similarly, there were 3 representatives derived from C11 to distinguish from C2 as shown in Table 5, but all failed to pass representative qualification as shown in Table 9.. 5 Conclusions In this paper, we have improved linear classifiers by increasing the number of representatives for each class to compensate the potential weakness of linear classifier which compute one representative for each class. We identify new representatives derived from the subclasses which are respectively isolated from the miss-classified documents and the correct-classified documents via hypergraph partition package. Then, we select the representatives of the subclasses whose classification precision evaluated by the validation set is greater than a given threshold. Finally, we classify the documents in the testing set with the representatives which consist of these new representatives and those derived from original classes. To evaluate the effectiveness of our approach, we have compared with linear classifier produced by Rocchio algorithm and the k-Nearest Neighbor(kNN) classifier. Experimental results show that our approach improves linear classifier and achieves the MicroAccuracy similar to that.

(12) Rocchio. Our Approach. kNN. MicroAccuracy MacroAccuracy AccuracyVariance # of representatives Classification Time Table 7: Performance comparison.. C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10 C 11 C 12 Average. Rocchio Precision Recall. F1. Our Approach F1 Precision Recall. kNN Precision Recall. F1. Table 8: Precision(%)/recall(%)/F1 measure Comparison.. C1. C2 C3. C4. C5. C6 C7. C8. C9. C 10 C 11 C 12 # of subclasses. C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10 C 11 C 12 Total. Table 9: The distribution of the number of the representatives whose precision < 50%..

(13) of k-Nearest Neighbor(kNN) did, but takes much less classification time. Our approach also avoids the biased situation that prefers large classes than small classes. Furthermore, we might provide a suggestion to reorganize class structure via the subclasses whose representatives achieved low precision in the validation set. Acknowledgment. We would like to thank Dr. Chien, Lee-Feng and Mr. Lee, Min-Jer for kind help in gathering the CNA news articles.. References [1] Douglas Baker and Kachites McCallum. Distributional clustering of words for text classification. In Proceedings of the 21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98), pages 96–103, 1998.. In Research on Computational Linguistics Conference(ROCLING XI), pages 217–226, 1998. [8] Tom M. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc, 1997. [9] Ketan Mulmuley. Computational Geometry : An Introduction Through Randomized Algorithm. Prentice Hall, 1994. [10] Amitabh Kumar Singhal. Term Weighting Revisited. PhD thesis, Cornell University, 1997. [11] Jyh-Jong Tsay and Jing-Doo Wang. Comparing classifiers for automatic Chinese text categorization. In 1999 National Computer Symposium, Taiwan, R.O.C, pages B–274– B–281, 1999. [12] Jyh-Jong Tsay and Jing-Doo Wang. Term selection with distributional clustering for Chinese text categorization using n-grams. In Research on Computational Linguistics Conference XII, pages 151–170, 1999.. [2] Willian W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization. In Proceedings of the 19th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’96), pages 307–315, 1996.. [13] Jyh-Jong Tsay and Jing-Doo Wang. Design and evaluation of approaches for automatic chinese text categorization. International Journal of Computational Linguistics and Chinese Language Processing(CLCLP), 5(2):43–58, August 2000.. [3] Paolo Ferragina and Roberto Grossi. The String B-tree : A new data structure for string search in external memory and its application. Journal of ACM, 46(2):236–280, 1999.. [14] Jyh-Jong Tsay and Jing-Doo Wang. Improving automatic Chinese text categorization by error correction. In The Fifth Internaltional Workshop on Information Retrieval with Asian Languages(IRAL2000), pages 1–8, 2000.. [4] George Karypis and Vipin Kumar. hmetis : A hypergraph partitioning package. Technical report, University of Minnesota, Department of Computer Science and Engineering, 1998. [5] Wai Lam. Using a generalized instance set for automatic text categorization. In Proceedings of the 21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98), pages 81–89, 1998. [6] David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’96), pages 298–306, 1996. [7] Yih-Jeng Lin, Ming-Shing Yu, Shyh-Yang Hwang, and Ming-Jer Wu. A way to extract unknown words without dictionary from Chinese corpus and its applications.. [15] Jyh-Jong Tsay and Jing-Doo Wang. A scalable approach for Chinese term extraction. In 2000 International Computer Sympoyium(ICS2000), Taiwan, R.O.C, pages 246– 253, 2000. [16] Yiming Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67–88, 1999. [17] Yiming Yang and Xin Liu. A reexamination of text categorization methods. In Proceedings of the 22th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99), pages 42–49, 1999. [18] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 412–420, 1997..

(14)