CHAPTER 3 DESIGN OF THE PROPOSED METHOD
3.2 ROUSER
than one element from 𝑈 can be considered as iterally removing an element, and inserting more than one element from 𝑈 can be considered as iterally inserting an element. By all above, removing elements from 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or drop, and inserting elements to 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or rise, and this is the monotonicity property of DiscPow.
Discernibility Power is one of the search heuristics of the proposed rule-based algorithm: ROUSER, which will be introduced in the next subsection.
3.2 ROUSER
ROUSER follows the separate-and-conquer algorithm as the framework. Our contribution here is connecting the proposed DiscPow as the search heuristic used by the GROW function in the separate-and-conquer algorithm. The GROW function of ROUSER is shown in Figure 14. ROUSER removes attributes whose values of DiscPow are zero in each iteration, and it updates DiscPow of every attribute until all values of DiscPow of the remaining attributes are not zero. If multiple attributes need to be removed, the current version of ROUSER simply removes the one that is independent of the class entered as a parameter to the separate-and-conquer algorithm in Figure 2. We use Chi-Squared value to decide the
‧
degree of independence. Chi-Squared value was first used in feature selection in [9]. Feature selection with Chi-Square test together with rough set theory was proposed in [19].
Figure 14. The GROW function.
Once an attribute is removed in an iteration when the GROW function is running, we no longer need to compute its DiscPow value anymore because of the monotonicity property of DiscPow. When elements are removed from the rough set covered by current rule, the DiscPow value of an attribute will be the same or a smaller value. Once the DiscPow value of the attribute is zero, it will no longer increase and hence the attribute can be removed. The DISCPOW function is shown in Figure 15.
Figure 15. The DISCPOW function. return the cardinality of PotBound(ai ) GROW(Rule,Covered):
do:
for every attribute ai:
DiscPowi = DISCPOW(ai,Covered) ChiSquaredi = CHISQUARED(ai ,Covered) Among attributes with DiscPowi =0, ignore ai with minimum ChiSquaredi
while exist ai with DiscPowi = 0 (a,v) = CHOOSE_ATTR&VALUE() grow the rule with (a,v) as an antecedent
‧
The CHOOSE_ATTR&VALUE function in GROW function searches for an attribute-value pair, i.e. (ai ,vi), that will be used to grow a rule. We use the idea of purity value [9][20] as the search heuristics. In our algorithm we provide 3 types of purities as options:
PurityOverAll, PurityPotBound, and PurityHybrid. The first is the same as the original definition of purity, and the others are proposed by us. The definitions of these purities are given below:
PurityOverAll = |pall|/(|pall|+|nall|),
where pall is the positive records covered by the candidate attribute and value, and nall is the negative records covered by the candidate attribute and value;
PurityPotBound = |ppb|/(|ppb|+|npb|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value, and npb is the negative records in the potential boundary region of the candidate attribute, and npb is covered by the candidate attribute and value;
PurityHybrid = |ppb|/(|ppb|+|nall|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value; nall is the negative records covered by the candidate attribute and value.
In addition to purity, we provide weighted Information Gain as an option for search heuristic, which is defined as:
WInfoGain = (p2all/p1all)*( log(|p2all|/(|p2all|+|n2all|)) - log(|p1all|/(|p1all|+|n1all|)) )
where p1all and n1all is the positive and negative records respectively from the original set of data records, and p2all and n2all is the positive and negative records respectively from the chosen subset of data records. The “log(|p1all|/(|p1all|+|n1all|))” is the information content of the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
original set of data records, while “log(|p2all|/(|p2all|+|n2all|))” is the information content of the chosen subset. “(p2all/p1all)” is the weight of the Information Gain.
We also provide 2 methods, and the first is called “Max”, which finds the maximum (i.e.
purity) from all possible attribute-value pairs. The second is called “Frequent Max”, which finds the most frequent value in each attribute and then finds the maximum (i.e. purity) from them.
At last, our CHOOSE_ATTR&VALUE function can choose an attribute-value pair in 7 different ways:
1. PurityOverAll, Max 2. PurityPotBound, Max 3. PurityHybrid, Max
4. PurityOverAll, Frequent Max 5. PurityPotBound, Frequent Max 6. PurityHybrid, Frequent Max 7. WInfoGain, Max
ROUSER generates a set of rules for each class. As soon as a rule set is generated, it is concatenated to the bottom of the rule list. The BUILD_CLASSIFIER algorithm of ROUSER is shown in Figure 16. The class list is sorted by ascending frequency order as RIPPER does. For an unseen case, ROUSER searches down the rule list and uses the first rule that covers the case to classify it.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ROUSER has to decide if two records are indiscernible to determine the boundary and potential boundary regions. Consider the examples in Figure 17, where there are two records j and k. If we want to know if record j and record k are indiscernible, we have to check every attribute’s value. If each attribute has the same value in record j and k, we say that the two records are indiscernible.
Figure 17. The example for checking if two records are indiscernible.
It is a simple task to decide if two records are indiscernible or not. However, missing values make the task complicated. We define four types of indiscernibility between two values, as shown in Table III, Table IV, Table V, and Table VI. These tables show how we treat a missing value for an attribute when we try to check if two records are indiscernible. From type 1 to type 4, the determination of indiscernibility becomes stricter. Currently, ROUSER uses type 3 to find boundary region, and it uses type 1 to find potential boundary region. Part of our study in the future is to consider other types of indiscernibility.
BUILD_CLASSIFIER( ):
build a ClassList by ascending frequency order for each Class in ClassList:
RuleSet=SEPERATE&CONQUER(Class,TrainData) concatenate the RuleSet to the bottom of the RuleList return RuleList
Figure 16. The BUILD_CLASSIFIER function.
‧
Records with same conditions and different decisions are considered as contradictions.
Based on the four types of indiscernibility, there will be four types of contradictions.
ROUSER simply ignores the contradictions (type 3) in the training data.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 4
IMPLEMENTATION OF THE PROPOSED METHOD
4.1 WEKA
Weka[8] is an open source data mining software, which provides free Java code for machine learning task. Weka is developed by and updated by the University of Waikato in New Zealand. We use Weka 3.6.5 as our developing environment.
4.1.1 Import Data
Weka accepts several data formats, including the simplest format named Comma-Separated Values (CSV), and Attribute Relationship File Format (ARFF). After data is imported, it is stored by the Weka-defined data structures. Each data record is stored by an Instance object, and the whole data set is stored by an Instances object, which contains multiple Instance objects. An Attribute object contains all the details about an attribute, like the data type is nominal or real number, and how many values are in the attribute. Multiple Attribute objects are also contained in one Instances object.
4.1.2 Classifier
To develop a classifier under Weka’s environment, an abstract class weka.classifiers.Classifier() must be extended. After that, an abstract method buildClassifier() must be implemented, and this method is called every time when the
‧
training data. After the model is built, one of the two methods is called for classifying testing data: classifyInstance() and distributionForInstance(), which utilize the model built by buildClassifier() to generate the classification result for every single data record. The difference between these two functions is that, the former one returns exact one class label for prediction, while the latter one returns an array of probabilities with respect to class labels.4.1.3 Cross-Validation
Weka offers several evaluation methods, and they are easy to implement. Here we introduce how to realize a cross-validation method. First an evaluator must be built by invoking weka.classifiers.Evaluation(), and then we choose the provided method crossValidateModel().
4.2 Data Structure
The data structure used in the implementation of ROUSER is partially learned from the JRip provided by Weka.
4.2.1 Rough Set
In order to implement rough set intuitively, a data structure for rough set is built, as in Figure 18. A data set is split as a partition of 3 blocks, namely positive, boundary and negative, with respect to the definition of a rough set in Section 3.4, and each block is actually an Instances object as mentioned in Section 4.1.1. For the convenience, the blocks filled by black color is empty, while white is not empty. Some necessary information is stored in the structure, such as DiscPow, Chi-square value, Purity For several further use, such as choosing the best attribute and value to build a rule. This data structure never appears in Weka.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 18. The data structure of a rough set.
4.2.2 Decision Rule
As mentioned in Section 2.1.1, a decision rule is a logic statement with the following form:
condition1 ∧ condition2 ∧…→class,
hence there can be multiple antecedents. We define a data structure named RAntd to store each condition, and some necessary information is contained in the structure,, such as the DiscPow, the number of instances covered by the rule so far (from the 1st condition to this condition), and the number of instances explained by the rule so far, as shown in Figure 19.
This data structure is learned from JRip provided by Weka, however some of the information stored in it are different.
Figure 19. The data structure of the antecedent of a rule: RAntd.
Another structure learned from JRip provided by Weka is RouserRule, which stores a rule, as in Figure 20, and it contains two parts: The queue of the antecedents and a class label.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
When a rule is grown, RAntd is generated one after another, and they are stored in a queue in order.
Figure 20. The data structure of a rule: RouserRule.
After a rule is generated, it is stored in the rule set in the growing order, as shown in Figure 21. The rule set is a queue. This is also learned from JRip provided by Weka.
Figure 21. A data structure of the rule set: m_Ruleset.
The whole data structure of the rule model built by ROUSER is shown in Figure 22:
Figure 22. The data structure of ROUSER’s rule model.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3 ROUSER
Following the separate-and-conquer algorithm, ROUSER is implemented under the Weka environment. As mentioned in Section 4.1.1, the BuildClassifier() function must be implemented
4.3.1 BuildClassifier( )
In the BuildClassifier() function shown in Figure 23, the oneClassRule() is an implement of the separate-and-conquer algorithm, which build rules for one chosen class. The oneClassRule() function is called for each class by ascending class order, since we adapt the ascending ordered rules strategy here, which is also adapted by RIPPER.
Figure 23. The flow chart of BuildClassifier().
4.3.2 OneClassRule( )
The function OneClassRule() shown in Figure 24 is an implement of the separate-and-conquer algorithm. The training data is first transformed into a rough set of the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
chosen class, which split the original data into three parts, and we make the boundary region empty to accelerate further processes. If there are contradicted instances in the data set, they will be in the boundary region, and there are many methods to handle the contradictions, such as assigning the most frequent class label to the contradicted instances. We choose t simple method: Deleting the instances in the boundary region. After that we build a rule from the rough set by the grow() function. The rule is concatenated at the end of the rule set right after it is built. After a rule is built, the positive instances explained by the rule are removed from the positive region. The remaining instances in the rough set will then be used to build another rule iteratively until all instances are explained.
Figure 24. The flow chart of OneClassRule().
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3.3 grow( )
The grow() function shown in Figure 25 builds a rule that explains some of the positive instances and none of the negative instances in the rough set. At the beginning an empty rule is built. DiscPow and Chi-Squared value of each attribute are calculated, and the rule is enriched by the antecedents built by the bestAntd() function. The longer the rule grows, the fewer the negative instances are covered. The rule is finally done when none of the negative instances are covered by the rule.
Figure 25. The flow chart of grow().
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3.4 BestAntd( )
BestAntd() chooses the best pair of attribute and value to grow the rule, and is the same as the CHOOSE_ATTR&VALUE() function in the pseudo code in grow() function in Figure 14.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 5
EXPERIMENT AND RESULTS
5.1 Environmental Setting
The experiment is executed on a computer with Windows7 32bit operating system. The memory is 4GB DDR3 SDRAM 1333Mhz, and the chipset is Intel Q67 Express, the CPU is Intel Core i7 -2600, 3.4GHz. The Weka’s version is 3.6.5.
5.2 Data Sets
The data sets used for experiments are all available from UCI Machine Learning Repository [23], and the data sets which are originally nominal data are shown in Table VII, and the discretized data sets which originally contain some real number data are shown in Table VIII. They are collected from different application domains, such as biology, gaming, politics, and marketing; the number of their attributes ranges from 5 to 69; the number of their classes ranges from 2 to 24; since the class numbers are different in each data set, we use bar charts to visualize the class distributions, for some of them, the class distributions are imbalanced; and some data sets are with missing values on some attributes.
The data set names with the “_dis” concatenated behind are not pure nominal data originally. We perform discretization on these data sets, and the details about what attributes are discretized and how they are discretized are shown in Table IX.
‧
Table VII. Original nominal data sets.
Data name #instances #attributes
including class Class distribution missing value
Agaricus-lepiota 8124 23 yes
Audiology.standardized 226 69 yes
Car 1728 6 no
Data name #instances #attributes
including class Class distribution missing value
‧
Data name Supervised discretization Equal bean discretization
(number of bean) Numerical to nominal
Abalone 2,3,4,5,6,7,8 9(5)
Adults 1,5,11,12,13 3(10)
Australian 2,3,7,10,13,14 1,4,5,6,8,9,11,12,15
Balance-scale 1,2,3,4
German 2 5(10),13(10) 8,11,16,18,21
Heart 8,10 1(5),4(5),5(5) 2,3,6,7,9,11,12,13,14
We defined four types of contradictions in section 3.2, and the number of contradictions in each data set is shown in Table X and Table XI.
Table X. Number of contradictions in original nominal data sets
Data sets Number of instances Number of contradictions type 1 type2 type3 type4
Table XI. Number of contradictions in discretized data sets.
Data sets number of instances Number of contradictions type 1 type2 type3 type4
‧
We design several experiments to examine ROUSER’s performance in different situations. We use 10-fold cross-validation to evaluate the classification performance.
The results for the data sets which are original nominal are summarized in Table XII.
The numbers reported in Table XII are accuracy rates in percentage, and the maximum values are in bold, and the minimum values are underlined. As we mentioned in section 3.2 that ROUSER has seven choices to search for the attribute-value pair to grow a rule, and the results of all the seven choices are shown in Table XII. Four out of nine accuracy results of ROUSER_6 are better than or the same as both J48 and JRip. On two data sets ROUSERs are outperformed by JRip and J48. ROUSER_1 and ROUSER_6 are the most stable versions among these seven versions, and their accuracy rates are comparable to J48 and JRip.
However, ROUSER does not perform well on the data sets car and splice. We think that there are no optimization stage and pruning methods in ROUSER (but there are in RIPPER) and overfitting occurs. The car data set is a data set with hierarchy structure which is easily captured by a tree structure, and we think that this is the reason that J48 outperforms JRip and ROSUER. The embedded feature selection method of ROUSER performs well on the splice data set (as shown in experiment results later), but ROUSER itself does not perform well on this data set. We think that this might be the overfitting problem. A deeper investigation of this will be part of the future work.
We design an experiment to examine ROUSER’s capability to handle missing values. We choose three data sets: Kr-vs-kp, Nursery, and Tic-tac-toe, to produce artificial data sets with missing values. The missing values are distributed randomly in each attribute with the same percentage (10%, 20%, 30%), while the distributions of missing values are different between attributes. The class attribute has no missing values, and besides the class attribute, no
‧
contradictions in each data set is shown in Table XIII.Table XII. Results for original nominal data sets.
Data sets
Table XIII. Number of contradictions in artificial missing values in data sets.
Data sets total number of contradictions type1 type2 type3 type4
The results are shown in Table XIV. The numbers reported in Table XIV are accuracy rates in percentage, and except those of the original (0%) data sets, each accuracy rate is the average accuracy rate of 10 different artificial data sets with the same rate of missing value.
The results indicate that the performances of ROUSER are similar to JRip in Kr-vs-kp
‧
Nursery data set when missing value percentage rises. We speculate that the Nursery data set with missing values have too many type 3 contradictions, which will be ignored by ROUSER as we mentioned in section 3.2, or there are too many type 1 contradictions and this makes ROUSER miscalculate the potential boundary region. To address the problem, we may adapt probability theory and assign contradicted instances to the class with higher probability.Table XIV. Results for artificial missing values in data sets.
data sets Accuracy (%) experimental results are given in Table XV. Each result is presented in two numbers, and the upper number is the original accuracy with ascending order rules, and the lower number is the difference after we switch to descending ordered rules. Ascending order is apparently better than descending order only in the Audiology.standardized data set, which has 24 class labels
‧
with imbalanced distribution. However descending order is better in the Car data set and the accuracy becomes comparable with the accuracy of JRip and J48. Descending order is better than ascending order in the Splice data set. However the accuracy is still not comparable with the accuracy of JRip and J48. We observe that imbalanced multi-class data sets are sensitive to the ordered rule strategy. A deeper investigation of this will be part of the future work.
with imbalanced distribution. However descending order is better in the Car data set and the accuracy becomes comparable with the accuracy of JRip and J48. Descending order is better than ascending order in the Splice data set. However the accuracy is still not comparable with the accuracy of JRip and J48. We observe that imbalanced multi-class data sets are sensitive to the ordered rule strategy. A deeper investigation of this will be part of the future work.