• 沒有找到結果。

CHAPTER 5 EXPERIMENT AND RESULTS

5.3 Results

We design several experiments to examine ROUSER’s performance in different situations. We use 10-fold cross-validation to evaluate the classification performance.

The results for the data sets which are original nominal are summarized in Table XII.

The numbers reported in Table XII are accuracy rates in percentage, and the maximum values are in bold, and the minimum values are underlined. As we mentioned in section 3.2 that ROUSER has seven choices to search for the attribute-value pair to grow a rule, and the results of all the seven choices are shown in Table XII. Four out of nine accuracy results of ROUSER_6 are better than or the same as both J48 and JRip. On two data sets ROUSERs are outperformed by JRip and J48. ROUSER_1 and ROUSER_6 are the most stable versions among these seven versions, and their accuracy rates are comparable to J48 and JRip.

However, ROUSER does not perform well on the data sets car and splice. We think that there are no optimization stage and pruning methods in ROUSER (but there are in RIPPER) and overfitting occurs. The car data set is a data set with hierarchy structure which is easily captured by a tree structure, and we think that this is the reason that J48 outperforms JRip and ROSUER. The embedded feature selection method of ROUSER performs well on the splice data set (as shown in experiment results later), but ROUSER itself does not perform well on this data set. We think that this might be the overfitting problem. A deeper investigation of this will be part of the future work.

We design an experiment to examine ROUSER’s capability to handle missing values. We choose three data sets: Kr-vs-kp, Nursery, and Tic-tac-toe, to produce artificial data sets with missing values. The missing values are distributed randomly in each attribute with the same percentage (10%, 20%, 30%), while the distributions of missing values are different between attributes. The class attribute has no missing values, and besides the class attribute, no

contradictions in each data set is shown in Table XIII.

Table XII. Results for original nominal data sets.

Data sets

Table XIII. Number of contradictions in artificial missing values in data sets.

Data sets total number of contradictions type1 type2 type3 type4

The results are shown in Table XIV. The numbers reported in Table XIV are accuracy rates in percentage, and except those of the original (0%) data sets, each accuracy rate is the average accuracy rate of 10 different artificial data sets with the same rate of missing value.

The results indicate that the performances of ROUSER are similar to JRip in Kr-vs-kp

Nursery data set when missing value percentage rises. We speculate that the Nursery data set with missing values have too many type 3 contradictions, which will be ignored by ROUSER as we mentioned in section 3.2, or there are too many type 1 contradictions and this makes ROUSER miscalculate the potential boundary region. To address the problem, we may adapt probability theory and assign contradicted instances to the class with higher probability.

Table XIV. Results for artificial missing values in data sets.

data sets Accuracy (%) experimental results are given in Table XV. Each result is presented in two numbers, and the upper number is the original accuracy with ascending order rules, and the lower number is the difference after we switch to descending ordered rules. Ascending order is apparently better than descending order only in the Audiology.standardized data set, which has 24 class labels

with imbalanced distribution. However descending order is better in the Car data set and the accuracy becomes comparable with the accuracy of JRip and J48. Descending order is better than ascending order in the Splice data set. However the accuracy is still not comparable with the accuracy of JRip and J48. We observe that imbalanced multi-class data sets are sensitive to the ordered rule strategy. A deeper investigation of this will be part of the future work.

Table XV. Results for ordered rule strategy.

Data sets Accuracy (%) We design an experiment to prove that Chi-Square value is useful in the rule growing phase of ROUSER. In our original design, we adapt the Chi-Square value to reduce the attributes iteratively. To make a contrast, we replace the Chi-Square value with Information Gain, which is provided by Weka, and the results of the experiments on such a replacement are

given in Table XVI. The numbers reported in Table XVI are accuracy rates in percentage. Each result is presented in two numbers, and the upper number is the original accuracy before we replace Chi-Square value with Information Gain, and the lower number is the difference after we do such a replacement. We discover that the performance of Chi-Square version is obviously better than Information Gain version on the Audiology.standardized data set. The performances on the Promoters data set which are originally bad become better after we replace Chi-Square value with the Information Gain. However, the performances which are originally good become worse. Our conclusion is that ROUSER_1 and ROUSER_6 with Chi-Square feature selection are still more stable than all the other combinations.

Table XVI. Results of replacing Chi-Square value with Information Gain in ROUSER.

Data sets Accuracy (%) with Information Gain feature selection

1 2 3 4 5 6 7

The experimental results of the discretized data set are summarized in Table XVII. The numbers reported in Table XVII are accuracy rates in percentage. The performances of ROUSER on discretized data set are not as well as the performances in the original nominal data sets, and we speculate the reason is that discretization may assign the same value to different real numbers, and this may make instances indiscernible and be considered as contradictions by ROUSER. As we mentioned before, ROUSER simply discards the contradictions, and hence it shows poor performance on these discretized data sets.

Similarly, we can adapt probability theory and assign contradicted instances to the class with higher probability. We can also design an embedded discretization method for ROUSER, like what is done in JRip or J48, to handle real number data directly. From Table XI we discover that there are many contradictions in Abalone_dis, Adult_dis, and Australian_dis data sets, and hence ROUSER performs not as well as JRip and J48 on these data sets.

Table XVII. Results for discretized data sets.

Data sets The execution time in millisecond of ROUSER, JRip and J48 are shown in Table XVIII.

We measure the training time of entire data set, instead of 10-fold. It is clear that tree-based strategy overwhelms the separate-and-conquer strategy in execution time, and the reason is simple: Unlike the tree-based strategy which ignores the data divided away, although the

separate-and-conquer strategy “separates” positive data in each iteration of building a rule, it needs all the negative data to stay in memory to complete this mission, and hence the same negative data will be executed for several times.

There are two more reasons for ROUSER’s high execution time. First, DiscPow itself is not so “greedy”. To explain this, we make a comparison with Information Gain, which is adapted by C4.5 and RIPPER. When calculating the Information Gain for choosing an attribute, only the attribute itself and the class attribute are involved in the calculation.

However, when calculating the DiscPow of an attribute, the whole decision table is involved, since we need to compare all the values between each pair of records. Second, ROUSER has no pruning methods and may build precise rules to explain only a few data records, and hence too many rules are built and time is wasted. discovered by examining the rule set size, as shown in Table XIX, where each rule set is built

calculated several times when generating a rule, the rule set size dominates the execution time.

We also discover that JRip’s rule set size is usually smaller than ROUSER, and that is because JRip adapts some pruning methods and optimization methods, and hence it makes the rule set size smaller. Some rule set size are extremely high, while their accuracy is low, and this could be considered as an overfitting problem, and we think this might be the reason why ROUSER performs not well on Car and Splice data sets.

Table XIX. Rule set size.

We design an experiment to prove the feature selection method embedded in ROUSER is useful. The feature selection method is in the grow function of ROUSER, which iteratively ignores an attribute with DiscPow=0 and the lowest Chi-Square value. This method selects attributes for one class, and hence we perform the feature selection method on all classes and union each result as the final result. We name it DiscPow_Chi method for convenience.

DiscPow_Chi is a deterministic feature selection method which returns a fixed number of

selected attributes and needs no additional threshold settings, while the Information Gain method simply returns the rank of all attributes and an appropriate threshold is needed. Thus we first choose CfsSubsetEval method together with BestFirst search method provided by Weka, which is also a deterministic feature selection, as the comparison. We compare the accuracy of JRip and J48 between the original data sets and the feature selected data sets. The results of the experiment are shown in Table XX. In data sets car and nursery, CfsSubsetEval method chooses only 1 attribute, which is obviously not able to represent the original data sets.

In data set splice, DiscPow_Chi selects half amount of attributes than CfsSubsetEval, while the accuracy rates of both JRip and J48 are merely the same. In the data sets house-votes-84 CfsSubsetEval outperforms DiscPow_Chi by choosing fewer attributes while keeping the high accuracy rate, and in the data set promoters CfsSubsetEval outperforms DiscPow_Chi by higher accuracy rate. Both DiscPow_Chi and CfsSubsetEval failed in the tic-tac-toe data set, but the problem of CfsSubsetEval is far more serious. To sum up, it is more possible for DiscPow_Chi than for CfsSubsetEval to avoid accuracy loss.

We also make a comparison to Information Gain feature selection provided by Weka, which ranks each attribute from high to low. We choose attributes with higher rank, and the amount is the same with what DiscPow_Chi chose. The results are shown in Table XXI. On data sets Agaricus-lepiota, Audiology.standardized and Kr-vs-kp, DiscPow_Chi performs better than Information Gain feature selection, while Information Gain feature selection performs better on the Promoters data set. The other results are similar. The accuracy results show that DiscPow_Chi is no worse than Information Gain feature selection, but even better, since DiscPow_Chi is deterministic, and save the work of determining the number of selected attributes. The idea is very different between DiscPow_Chi and Information Gain feature selection. DiscPow_Chi iteratively removes the attributes that we do not need, while the idea

of Information Gain feature selection is to select what we want.

Table XX. The comparison of DiscPow_Chi and CfsSubsetEval.

Data name Feature selection method Number of attributes selected

Table XXI. The comparison of DiscPow_Chi and Information Gain feature selection.

Data sets

DiscPow_Chi InfoGain DiscPow_Chi InfoGain

Agaricus-lepiota 5/22 100.0 99.9 100.0 99.9

Audiology.standardized 13/69 69.0 66.5 76.0 70.5

Car 6/6 88.3 88.3 92.7 92.7 contradictions which are originally in the data, since ROUSER simply ignores contradictions.

The embedded feature selection method is deterministic and more possible to avoid accuracy loss. ROUSER has some good properties, and how to keep these good properties while avoiding the shortcomings would be the focus of the feature work.

相關文件