Document Labeling by the Constructed Classifier

Chapter 4 Incremental Mining Algorithms for Document Classifiers

4.5 Document Labeling by the Constructed Classifier

According to the constructed classifier, a document vector is easily calculated by summarizing related feature vectors in the feature-domain weighting table. The larger the weight assigned to a document vector entry is the more relevant the entry is. Thus, the classifier can assign a category label to an undefined document on the basis of its entry weights.

Given an undefined document d, the document labeling algorithm, shown in Figure 4-7, first uses the constructed classifier C to obtain the document vector Vd by summarizing the feature vectors of features occurred in d from feature-domain weighting table (Step 2 and Step 3). The document labeling algorithm then assigns a category label to d according to the entry with the maximum weight in Vd (Step 4).

Document Labeling Algorithm:

Input:

d: An undefined document.

C: The classifier constructed by the classifier construction algorithm.

Output:

l: The category label for d.

Begin

(1) Vd←0; //Vd is the document vector of d and |Vd| equals the number of categories (2) For each feature fk in d, do

(2.1) Extract the feature vector fvk from C;

(2.2) Vd = Vd + fvk;

(3) ;

) ( count _k

d f

V = V //count(fk) is the number of features in d (4) Return the category label l of the maximum weight in Vd. End

Figure 4-7: The document labeling algorithm

4.6 Experimental Results

Our experiments were conducted in Java on a personal computer with a Pentium 1.7GHz processor and 512MB of main memory running Windows 2000, and using the Reuters-21578 benchmark text collection standard (REUTERS-21578, Distribution 1.0) experimental dataset [58] based on the “ModApte” split version. This dataset consists of 118 categories in 12,902 documents, of which 9,603 are for training and 3,299 are for testing. The following groups of categories were used to evaluate classification accuracy:

(1) the 10 categories with the largest number of training documents (Reuters-21578(10));

(2) the 90 categories, each of which contains at least one training document and one test document (Reuters-21578(90));

(3) the 115 categories, each of which contains at least one training document (Reuters-21578(115)).

We tested our classifier on four aspects of micro- and macro-averaging F1

evaluation functions (shown in Formula 4-1):

(1) the classification accuracy of our classifier construction algorithm compared to the algorithms shown in [23];

(2) the influence of the training document threshold φ and the discrimination threshold δ on classification accuracy;

(3) the influence of the number of tuning documents on classification accuracy;

(4) the time performance of our classifier construction algorithm compared to a batch-based mining algorithm.

In [23], Debole and Sebastiani utilized six supervised term weighting functions, chi-square, information gain, and gain ratio, globally and locally, e.g., χ²(g), IG(g), GR(g), χ²(l), IG(l), and GR(l), in the Rocchio, k-NN, and SVM classifier construction algorithms to compare their average classification accuracy on the Reuters-21578(10), Reuters-21578(90), and Reuters-21578(115) datasets. The comparison results are shown in Table 4-5.

Table 4-5: Micro- and macro-averaging F1 values shown in [23]

χ² (g) IG(g) GR(g) χ² (l) IG(l) GR(l) Reuters-21578(10) 0.852 0.843 0.857 0.810 0.816 0.816 Reuters-21578(90) 0.795 0.750 0.803 0.758 0.767 0.767 Micro F1

Reuters-21578(115) 0.793 0.747 0.800 0.756 0.765 0.765 Reuters-21578(10) 0.725 0.707 0.739 0.674 0.684 0.684 Reuters-21578(90) 0.542 0.377 0.589 0.527 0.559 0.559 Macro F1

Reuters-21578(115) 0.596 0.458 0.629 0.581 0.608 0.608

We set the discrimination threshold δ in our classifier construction algorithm to 0.5 for the Reuters-21578(10) dataset, and to 0.04 for the Reubters-21578(90) and Reuters-21578(115) datasets; the number of tuning documents was set to 0. Table 4-6 shows the classification accuracy of our classifier at various training document thresholds φ. The φ was to determine the availability of categories in the training documents for our training algorithm. Thus, if the number of training documents in a category was less than the specified φ, the category was omitted from the training algorithm. For example, only 39 categories in Reuters-21578(90) satisfying φ = 25 were used in the training algorithm.

Tables 4-5 and 4-6 show the classification accuracy of our classifier construction algorithm was always better than those in [23] on Reuters-21578(10), whereas the results on Reuters-21578(90) and Reuters-21578(115) were worse when φ was less than 15. We may therefore conclude that the classification accuracy of the classifier

constructed by the domain-space weighting scheme will be getting better with sufficient training documents.

Table 4-6: Micro- and macro-averaging F1 values at φ =1, φ =15 and φ =25

φ=1 φ =15 φ=25 Reuters-21578(10) 0.903 0.903 0.903 Reuters-21578(90) 0.751 0.784 0.815 Micro F1

Reuters-21578(115) 0.737 0.784 0.815 Reuters-21578(10) 0.824 0.824 0.824 Reuters-21578(90) 0.490 0.569 0.660 Macro F1

Reuters-21578(115) 0.616 0.569 0.660

Details of training document threshold φ and discrimination threshold δ affected classification accuracy on Reuters-21578(10), Reuters-21578(90), and Reuters-21578(115) are shown in Tables 4-7 to 4-11. Since each category in Reuters-21578(10) contains more than 50 training documents, the influence of φ is ignored in Table 4-7. As mentioned before, the scale of δ is determined according to the number of categories. Thus, the scale range of δ in Table 4-7 is [1/10, 1], and the scale ranges of δ in Tables 4-8, 4-9 and in Tables 4-10, 4-11 are [1/90, 1] and [1/115, 1], respectively.

In Tables 4-7 to 4-11, we can see that the influence of δ is not evident even on Reuters-21578(10), perhaps because the one-normalization of the discrimination algorithm has achieved the purpose of discrimination such that setting δ has less influence on the classification accuracy. By contrast, setting φ had a decisive influence on classification accuracy: the larger the number of training document included, the better classification accuracy will be. Table 4-12 shows the number of remaining categories at various φ on Reuters-21578(10), Reuters-21578(90), and Reuters-21578(115). When φ was 15 or greater, the training algorithm considered the same numbers of categories on Reuters-21578(90) and Reuters-21578(115).

Table 4-7: Micro-and macro-averaging F1 values at various δ for Reuters-21578(10)

δ Micro F1 Macro F1

0.9 0.902511370 0.814721475 0.8 0.901324896 0.813716994 0.7 0.903302353 0.820149529 0.6 0.903302353 0.819969831 0.5 0.902906862 0.823657403 0.4 0.898951948 0.815825122 0.3 0.901324896 0.817534791 0.2 0.895788017 0.804622957 0.1 0.898160965 0.806951786

Table 4-8: Micro-averaging F1 values at various δ and φ for Reuters-21578(90) φ

δ 1 5 15 25 35 45

0.1 0.74739 0.75360 0.78372 0.81300 0.82566 0.84547 0.08 0.74827 0.75389 0.78403 0.81269 0.82631 0.84447 0.06 0.75033 0.75478 0.78464 0.81458 0.82695 0.84681 0.04 0.75063 0.75300 0.78433 0.81521 0.82824 0.84681 0.02 0.74974 0.75271 0.78555 0.81553 0.8289 0.84681 0.01 0.74974 0.75330 0.78555 0.81584 0.8289 0.84681

Table 4-9: Macro-averaging F1 values at various δ and φ for Reuters-21578(90) φ

δ 1 5 15 25 35 45

0.1 0.46830 0.52281 0.56963 0.66335 0.67258 0.71811 0.08 0.48881 0.54619 0.57344 0.65812 0.67529 0.71390 0.06 0.48748 0.53001 0.57152 0.66360 0.67395 0.71542 0.04 0.48997 0.52214 0.56868 0.65998 0.67747 0.71738 0.02 0.48467 0.51960 0.57205 0.66281 0.67783 0.71738 0.01 0.48922 0.52176 0.57205 0.66312 0.67783 0.71738

Table 4-10: Micro-averaging F1 values at various δ and φ for Reuters-21578(115) φ

δ 1 5 15 25 35 45

0.1 0.73593 0.74885 0.78372 0.81300 0.82566 0.71811 0.08 0.73505 0.74915 0.78403 0.81269 0.82631 0.71390 0.06 0.73711 0.7506 0.78464 0.81458 0.82695 0.71542 0.04 0.73681 0.74944 0.78433 0.81521 0.82824 0.71738 0.02 0.73652 0.74855 0.78555 0.81553 0.8289 0.71738 0.01 0.73711 0.74915 0.78555 0.81553 0.8289 0.71738

Table 4-11: Macro-averaging F1 values at various δ and φ for Reuters-21578(115) φ

δ 1 5 15 25 35 45

0.1 0.62378 0.53231 0.56963 0.66335 0.67258 0.71811 0.08 0.60384 0.55127 0.57344 0.65812 0.67529 0.71390 0.06 0.60474 0.53598 0.57152 0.66360 0.67395 0.71542 0.04 0.61597 0.53130 0.56868 0.65998 0.67747 0.71738 0.02 0.61526 0.53057 0.55990 0.66312 0.67783 0.71738 0.01 0.61526 0.52903 0.57205 0.66281 0.67783 0.71738

Table 4-12: Numbers of remaining categories at various φ

φ =1 φ =5 φ =15 φ =25 φ =35 φ =45 Reuters-21578(10) 10 10 10 10 10 10 Reuters-21578(90) 90 69 51 39 34 27 Reuters-21578(115) 115 70 51 39 34 27

Reuters-21578(10)

0.85 0.9 0.95

0 100 200 300 400 500 600 700 800 900 1000

Tuning documents

Micro-averaging F1

Figure 4-8: Micro-averaging F1 value vs. number of tuning documents for Reuters-21578(10)

The influence of tuning document number on classification accuracy for Reuters-21578(10), Reuters-21578(90), and Reuters-21578(115) is shown in Figures 4-8, 4-9 and 4-10, respectively. Since the tuning documents in our experiments were selected from the test documents, the original test document dataset was divided into

tuning and test sets. Experimental results showed that setting the tuning parameter ζ to 0.000005 yielded a stably increasing trend. Too low the ζ value may lead to a tuning adjustment so tiny that the tuning effect is insignificant, and too large the ζ value may lead to an unstable and oscillatory tuning adjustment with unpredictable tuning effects. Figures 4-8 to 4-10 show that the classification accuracy of the constructed classifier improved as the number of tuning documents was increased and tended toward convergence when the number exceeded 700.

Reuters-21578(90)

0.7 0.75 0.8 0.85

0 100 200 300 400 500 600 700 800 900 1000 Tuning documents

Micro-averaging F1

φ=1 φ=15 φ=25

Figure 4-9: Micro-averaging F1 values vs. number of tuning documents at φ =1, φ =15 and φ =25 for Reuters-21578(90)

Reuters-21578(115)

0.7 0.75 0.8 0.85

0 100 200 300 400 500 600 700 800 900 1000 Tuning documents

Micro-averaging F1

φ=1 φ=15 φ=25

Figure 4-10: Micro-averaging F1 values vs. number of tuning documents at φ =1, φ

=15 and φ =25 for Reuters-21578(115)

We evaluated the efficiency of our classifier construction algorithm in comparison with a batch-based classifier construction approach, excluding the tuning algorithm. The computation time of our classifier construction algorithm contains three major portions when a new category is added in the i-th run: (1) time to extract and weight features from a given category, denoted as ti1; (2) time to integrate the training results into the feature-domain weighting table, denoted as ti2; and (3) time to reduce the weights of features in the feature-domain weighting table having lower discriminating powers, denoted as ti3. Since ti1 > ti2 >> ti3, total computation time can be simplified to O(ti1+ti2) in the i-th run. However, when our classifier construction algorithm mimicked a batch-based approach, and needed to re-process all previous categories to reconstruct its classifier for each run, the total computation time was O(

∑

ⁱ_j=₁(tj₁+tj₂)) for the i-th run. Figure 4-11 shows the computation times spent by our classifier construction algorithm respectively in batch and in incremental for Reuters-21578(10) with increasing numbers of considered categories.

0 200000 400000 600000 800000

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10