The Comparison of Discretization Scheme - Ordered Multi-valued and Multi-labeled Discretization

4 Concept Drift Rule Mining Tree Algorithm

5.2 Ordered Multi-valued and Multi-labeled Discretization Algorithm

5.3.1 The Comparison of Discretization Scheme

We implement the following seven discretization algorithms in Microsoft Visual C++ 6.0 to evaluate the performance of Equation 5.1. Among the seven discretization algorithms, CACC is our approach. CACC has a same main framework to that of CAIM except that it uses a different discretization metric.

1. Equal Width and Equal Frequency: two typical unsupervised top-down methods;

2. CACC: our approach;

3. CAIM: the newest top-down method;

4. IEM: a famous and widely used top-down method;

5. ChiMerge: a typical bottom-up method;

6. Extended Chi2: the newest bottom-up approach.

Among the seven discretization algorithms, Equal Width, Equal Frequency and ChiMerge require the user to specify in advance some parameters of discretization. For the ChiMerge algorithm, we set the level of significance to 0.95. For the Equal Width and Equal Frequency methods, we adopted the heuristic formula used in CAIM to estimate the number of discrete interval. All experiments were run on a PC equipped with Windows XP operating system, Pentium IV 1.8GHz CPU, and 512mb SDRAM memory.

Our experimental data includes thirteen UCI real datasets. Seven of them were used in CAIM and the rest were gathered from the U.C. Irvine repository [59]. The details of the thirteen UCI experimental datasets are listed in Table 5.9. The 10-fold cross-validation test

method was applied to all experimental datasets. The discretization was done using the training sets and the testing sets were discretized using the generated discretization scheme. In addition, we also used C5.0 to evaluate the generated discretization schemes. In our experiments, C5.0 was chosen since it was conveniently available and widely used as a standard for comparison in machine learning literature. Finally, we used the Friedman test and the Holm’s post-hoc tests with significance level α = 0.05 to statistically verify the hypothesis of improved performance.

Table 5.9 The summary of thirteen UCI real datasets

Dataset Number of

The comparisons of the generated discretization schemes are shown in Figure 5.1. Due to the content limit, we only showed for each dataset the mean of cair value, the mean of execution time and the mean number of discrete intervals. We used the Friedman test to check if the measured mean ranks reached statistically significant differences. If the Friedman test showed that there was a significant difference, the Holm’s post-hoc test was used to further analyze the comparisons of all the methods against CACC. Although we also showed the

number of discrete intervals in this experiment, it was not our main concern. Recall that in the Introduction, we stated that the general goals of a discretization algorithm should be: a) generate a better discretization scheme (measured by cair value in Equation 2.2; b) the generated discretization scheme should lead to the improvement of accuracy and efficiency of a learning algorithm; and, c) the discretization process should be as fast as possible. A discretization scheme with fewer intervals may not only lead to a worse quality of discretization scheme and a decrease in the accuracy of a classifier, but also increase the produced rules in a classifier.

In Figure 5.1 the top line in the diagram is the axis on which we plotted the average ranks of all the methods while a method on the right side means that it performs better. A method with rank outside the marked interval in Figure 5.1 means that it is significantly different from CACC. The comparison results in Figure 5.1(a) showed that on the average, CACC reached the highest cair value from among the seven discretization algorithms. This was a very exhilarating result that demonstrated that the CACC criterion can indeed produce a high quality discretization scheme. The corresponding value of Friedman test was 58.714 (p-value

< 0.0001), which was larger than the threshold 12.592. From Figure 5.1(a) we can see that the mean cair of CACC was statistically comparable to that of CAIM and significantly better than that of all the other five methods. The comparison between CAIM and CACC did not achieve significant difference since we compared all seven algorithms. If we removed the two unsupervised algorithms from this comparison, we can obtain Figure 5.1(b) in which CACC performed significantly better than all of the other four methods. It is also worth noting that although we only showed the mean cair in the present paper, for all of the 228 continuous attributes in Table 5.9, the cair value of CACC is always equal to or better than that of CAIM.

Regarding the number of discrete intervals, on the average CAIM generated the least number of intervals. This result was not surprising since CAIM usually generated a simple discretization scheme in which the number of intervals was very close to the number of

classes. The corresponding value of Friedman test was 8.192 (p-value = 0.228), which was smaller than the threshold 12.592, and meant that there were no significant differences among the number of generated intervals of the seven algorithms. However, if we removed the two unsupervised algorithms, in which the number of generated intervals was decided in advance, from this comparison, the Friedman test reached statistical significance and we obtained Figure 5.1(c). From Figure 5.1(c), we can see that the generated number of intervals of CACC was significantly less than that of ChiMerge and comparable to that of CAIM, IEM and Extended chi2.

Finally, the two unsupervised methods were the fastest since they did not consider the processing of any class related information. The discretization time of CACC was a little longer than that of CAIM but the difference did not reach statistical significance. If we compare all seven algorithms, the Holm’s post-hoc test in Figure 5.1(d) showed that CACC was significantly faster than Extended Chi2, significantly slower than Equal Width and Equal Frequency, and comparable to CAIM, IEM and ChiMerge. When we removed the two unsupervised algorithms from this comparison, we obtained a little different result as shown in Figure 5.1(e). In Figure 5.1(e), CACC was significantly faster than both bottom-up approaches Extended Chi2 and ChiMerge, and comparable to CAIM, IEM. This result corresponded to our previous discussions that the computational complexity of the bottom-up methods is usually worse than that of the top-down methods. It is also worth noting that compared to the ChiMerge algorithm, although the Extended Chi2 algorithm had a better discretization quality and generated fewer intervals, it required more execution time to check the merged inconsistency rate in every step.

Figure 5.1 The comparison of CACC against the other discretization methods with the Holm’s post-hoc tests (α = 0.05): (a) and (b) cair value; (c) number of intervals; (d) and (e) execution time.

To evaluate the effect of generated discretization schemes on the performance of the classification algorithm, we used the discretized datasets to train C5.0. The testing datasets were then used to calculate the accuracy, the number of rules, and the execution time.

Similarly, the Friedman test and the Holm’s post-hoc tests with significance level α = 0.05 were used to check if these comparisons reached significant differences.

The visualizations of the Holm’s post-hoc test are illustrated in Figure 5.2. The comparison results in Figure 5.2(a) show that on the average, CACC reached the highest accuracy from among the seven discretization algorithms. This was a very exhilarating result that demonstrated that the discretization schemes generated by CACC can indeed improve the accuracy of classification. In Figure 5.2(a) we can see that the accuracy of CACC was significantly better than Equal Width, Equal Frequency and ChiMerge, and comparable to

CAIM, IEM and Extended Chi2. However, when we removed the two unsupervised methods and the two slowest bottom-up methods from this comparison, we obtained a little different result. The mean rank of CACC, CAIM and IEM was 1.2, 2.3, and 2.5 respectively. The Friedman test and the Holm’s post-hoc tests in Figure 5.2(b) showed that among the tree top-down approaches, the accuracy of CACC was significantly better than that of CAIM and IEM.

As regards to the number of generated rules of C5.0, the CAIM reached the best performance and CACC was ranked secondly. The Friedman test and the Holm’s post-hoc tests in Figure 5.2(c) showed that C5.0 produced significantly more rules when it used the discretization schemes of ChiMerge, Equal Width and Equal Frequency. Figure 5.2(c) also showed that C5.0 generated statistically comparable numbers of rules when it used the discretization schemes of CACC, CAIM, IEM and Extended Chi2. When we only compared the three top-down approaches, the Holm’s post-hoc tests also showed that there were no significant differences among them as shown in Figure 5.2(d). Note that we have stated that a discretization scheme with fewer intervals does not mean that it will result to a simpler decision tree. On the contrary, it might even increase the produced rules. Our inference can be found in this experiment. For example, CACC generated more intervals than CAIM but resulted to fewer rules in the datasets thy, wav and hea.

Finally as illustrated in Figure 5.2(e), when C5.0 used the training data discretized by CACC, CAIM, IEM and Extended Chi2, the training times were statistically comparable.

C5.0 required significantly more training time when the training data were discretized by ChiMerge, Equal Width and Equal Frequency. When we only compared the three top-down approaches, the Holm’s post-hoc tests also showed that there were no significant differences among CACC, CAIM and IEM.

Figure 5.2 The comparison of C5.0 performance on CACC against C5.0 performance on the other discretization methods with the Holm’s post-hoc test (α = 0.05): (a) and (b) accuracy; (c) and (d) number of rules; (e) and (f) execution time.

在文檔中一個提昇分類演算法探勘概念漂移資料效能之研究 (頁 97-103)