CHAPTER 2 PRELIMINARY
2.2 The Rough Set Theory
2.2.5 Indiscernibility Matrix
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2.2.5 Indiscernibility Matrix
Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), a discernibility matrix 𝑀𝐷(𝐶) of 𝑨 is a 𝑛 × 𝑛 matrix, and the entry of the matrix is defined as follows:
𝑐𝑖𝑗 = {𝑎 ∈ 𝐶 ∶ 𝑎(𝑥𝑖) ≠ 𝑎(𝑥𝑗) ∧ d(𝑥𝑖) ≠ 𝑑(𝑥𝑗)} 𝑓𝑜𝑟 𝑖, 𝑗 = 1, 2, . . . , 𝑛.
where 𝑛 is the number of elements in 𝑈 and 𝑥𝑖, 𝑥𝑗 ∈ 𝑈.
Discernibility function 𝑓𝐷(𝑨) is defined as follows:
𝑓𝐷(𝐶) = ⋀{⋁ 𝑎 : 𝑎 ∈ 𝑐𝑖𝑗 , 1 ≤ 𝑖 < 𝑗 ≤ 𝑛, 𝑐𝑖𝑗 ≠ ∅}
A discernibility function 𝑓𝐷(𝐴) is a boolean function, all constituents in the disjunctive normal form of 𝑓𝐷(𝐶) are all 𝐷-reducts of 𝐶, and all prime implecants of the conjunctive normal form of 𝑓𝐷(𝐶) are also all 𝐷-reducts of 𝐶.
An indiscernibility matrix of decision table in Figure 6 (a) is given in Figure 9 below.
Figure 9. An example of indiscernibility matrix.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 3
DESIGN OF THE PROPOSED METHOD
3.1 Potential Boundary Region and Discernibility Power
One of the contributions of this thesis is presenting a new search heuristics named discernibility power based on the rough set theory. Before introducing discernibility power, we have to redefine the rough set for disambiguation and convenience.
Redefining a Rough Set
Guided by the original definition of rough set theory, we redefine a rough set. Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), for each block X of partition 𝑈/𝐷, the rough set of X is redefined below:
The positive region of X:
POSA (X) = {xC(x) X}.
The negative region of X:
NEGA (X) = {x | C(x) ∩ X=∅}.
The boundary region of X:
BOUNDA (X) = X POSA (X) NEGA (X).
Notice that the positive region here is the same as the definition of the lower-approximation of a rough set, but it differs from the one mentioned in 2.2.3, which is the positive region of 𝐷 respect to 𝐶. As sketched in Figure 10, a rough set is redefined by 3 disjunctive traditional sets, positive region, negative region and boundary region. The redefined rough set is also a
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
partition of 𝑼. The purpose of the redefinition is to connect the rough set theory with the separate-and-conquer algorithm, which iteratively grows a rule by rejecting as many negative data records as possible and accepting as many positive data records as possible. Based on the redefinition of a rough set, we introduce two concepts: Potential boundary region (PotBound) and discernibility power (DiscPow).
Figure 10. The redefined rough set.
Potential Boundary Region
Consider the rough set of 𝑋 defined above, the meaning of the potential boundary region of attribute ai is the set of elements which will become indiscernible without ai. The definition of PotBound of X with respect to attribute ai is given below:
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) = 𝐵𝑜𝑢𝑛𝑑𝑨′(𝑋) − 𝐵𝑜𝑢𝑛𝑑𝑨(𝑋), where 𝑨 = (𝑈, 𝐶 ∪ 𝐷) and 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}).
Here is an example of PotBound. Consider sdfsff, the original decision table is 𝑨 = (𝑈, 𝐶 ∪ 𝐷), if the attribute a2 is removed, the new decision table becomes 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}). x6, x7 become indiscernible, and the boundary region of Y expands. The expanded part of the boundary region {x6, x7} is the PotBound of a2.
negative boundary
positive
‧
DiscPow has the monotonicity property, which means that removing elements from a
rough set, or a partition of 𝑈, will never increase the DiscPow. Below is the proof.
Figure 12. The rough set of A = (U,C∪D ).
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 13. The rough set of A' = (U,C∪D {a2})
Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷) as shown in Figure 12, the DiscPow of 𝑎𝑖 ∈ 𝐶 with respect to the rough set of 𝑋 ∈ 𝑈/𝐷 is as below:
𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) = 𝐶𝑎𝑟𝑑(𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖)), and the 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) is given below:
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) = 𝐵𝑜𝑢𝑛𝑑𝑨′(𝑋) − 𝐵𝑜𝑢𝑛𝑑𝑨(𝑋), where 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}), as shown in Figure 13.
Below are definitions for 𝑋 with respect to 𝑨′:
The positive region of 𝑋:
𝑃𝑂𝑆𝑨′(𝑋) = {𝑥 | 𝐶(𝑥) 𝑋 } The negative region of 𝑋:
𝑁𝐸𝐺𝑨′(𝑋) = {𝑥 | 𝐶(𝑥) ∩ 𝑋 = ∅ } The boundary region of 𝑋:
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑋 − 𝑃𝑂𝑆𝑨′(𝑋) − 𝑁𝐸𝐺𝑨′(𝑋)
By the 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) given above, the boundary region of 𝑋 has another definition:
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) + 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋), 𝑁𝐸𝐺𝑨′(𝑋)
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋) + 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋)
𝑃𝑂𝑆𝑨′(𝑋)
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋)
‧
than one element from 𝑈 can be considered as iterally removing an element, and inserting more than one element from 𝑈 can be considered as iterally inserting an element. By all above, removing elements from 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or drop, and inserting elements to 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or rise, and this is the monotonicity property of DiscPow.Discernibility Power is one of the search heuristics of the proposed rule-based algorithm: ROUSER, which will be introduced in the next subsection.
3.2 ROUSER
ROUSER follows the separate-and-conquer algorithm as the framework. Our contribution here is connecting the proposed DiscPow as the search heuristic used by the GROW function in the separate-and-conquer algorithm. The GROW function of ROUSER is shown in Figure 14. ROUSER removes attributes whose values of DiscPow are zero in each iteration, and it updates DiscPow of every attribute until all values of DiscPow of the remaining attributes are not zero. If multiple attributes need to be removed, the current version of ROUSER simply removes the one that is independent of the class entered as a parameter to the separate-and-conquer algorithm in Figure 2. We use Chi-Squared value to decide the
‧
degree of independence. Chi-Squared value was first used in feature selection in [9]. Feature selection with Chi-Square test together with rough set theory was proposed in [19].
Figure 14. The GROW function.
Once an attribute is removed in an iteration when the GROW function is running, we no longer need to compute its DiscPow value anymore because of the monotonicity property of DiscPow. When elements are removed from the rough set covered by current rule, the DiscPow value of an attribute will be the same or a smaller value. Once the DiscPow value of the attribute is zero, it will no longer increase and hence the attribute can be removed. The DISCPOW function is shown in Figure 15.
Figure 15. The DISCPOW function. return the cardinality of PotBound(ai ) GROW(Rule,Covered):
do:
for every attribute ai:
DiscPowi = DISCPOW(ai,Covered) ChiSquaredi = CHISQUARED(ai ,Covered) Among attributes with DiscPowi =0, ignore ai with minimum ChiSquaredi
while exist ai with DiscPowi = 0 (a,v) = CHOOSE_ATTR&VALUE() grow the rule with (a,v) as an antecedent
‧
The CHOOSE_ATTR&VALUE function in GROW function searches for an attribute-value pair, i.e. (ai ,vi), that will be used to grow a rule. We use the idea of purity value [9][20] as the search heuristics. In our algorithm we provide 3 types of purities as options:
PurityOverAll, PurityPotBound, and PurityHybrid. The first is the same as the original definition of purity, and the others are proposed by us. The definitions of these purities are given below:
PurityOverAll = |pall|/(|pall|+|nall|),
where pall is the positive records covered by the candidate attribute and value, and nall is the negative records covered by the candidate attribute and value;
PurityPotBound = |ppb|/(|ppb|+|npb|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value, and npb is the negative records in the potential boundary region of the candidate attribute, and npb is covered by the candidate attribute and value;
PurityHybrid = |ppb|/(|ppb|+|nall|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value; nall is the negative records covered by the candidate attribute and value.
In addition to purity, we provide weighted Information Gain as an option for search heuristic, which is defined as:
WInfoGain = (p2all/p1all)*( log(|p2all|/(|p2all|+|n2all|)) - log(|p1all|/(|p1all|+|n1all|)) )
where p1all and n1all is the positive and negative records respectively from the original set of data records, and p2all and n2all is the positive and negative records respectively from the chosen subset of data records. The “log(|p1all|/(|p1all|+|n1all|))” is the information content of the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
original set of data records, while “log(|p2all|/(|p2all|+|n2all|))” is the information content of the chosen subset. “(p2all/p1all)” is the weight of the Information Gain.
We also provide 2 methods, and the first is called “Max”, which finds the maximum (i.e.
purity) from all possible attribute-value pairs. The second is called “Frequent Max”, which finds the most frequent value in each attribute and then finds the maximum (i.e. purity) from them.
At last, our CHOOSE_ATTR&VALUE function can choose an attribute-value pair in 7 different ways:
1. PurityOverAll, Max 2. PurityPotBound, Max 3. PurityHybrid, Max
4. PurityOverAll, Frequent Max 5. PurityPotBound, Frequent Max 6. PurityHybrid, Frequent Max 7. WInfoGain, Max
ROUSER generates a set of rules for each class. As soon as a rule set is generated, it is concatenated to the bottom of the rule list. The BUILD_CLASSIFIER algorithm of ROUSER is shown in Figure 16. The class list is sorted by ascending frequency order as RIPPER does. For an unseen case, ROUSER searches down the rule list and uses the first rule that covers the case to classify it.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ROUSER has to decide if two records are indiscernible to determine the boundary and potential boundary regions. Consider the examples in Figure 17, where there are two records j and k. If we want to know if record j and record k are indiscernible, we have to check every attribute’s value. If each attribute has the same value in record j and k, we say that the two records are indiscernible.
Figure 17. The example for checking if two records are indiscernible.
It is a simple task to decide if two records are indiscernible or not. However, missing values make the task complicated. We define four types of indiscernibility between two values, as shown in Table III, Table IV, Table V, and Table VI. These tables show how we treat a missing value for an attribute when we try to check if two records are indiscernible. From type 1 to type 4, the determination of indiscernibility becomes stricter. Currently, ROUSER uses type 3 to find boundary region, and it uses type 1 to find potential boundary region. Part of our study in the future is to consider other types of indiscernibility.
BUILD_CLASSIFIER( ):
build a ClassList by ascending frequency order for each Class in ClassList:
RuleSet=SEPERATE&CONQUER(Class,TrainData) concatenate the RuleSet to the bottom of the RuleList return RuleList
Figure 16. The BUILD_CLASSIFIER function.
‧
Records with same conditions and different decisions are considered as contradictions.
Based on the four types of indiscernibility, there will be four types of contradictions.
ROUSER simply ignores the contradictions (type 3) in the training data.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 4
IMPLEMENTATION OF THE PROPOSED METHOD
4.1 WEKA
Weka[8] is an open source data mining software, which provides free Java code for machine learning task. Weka is developed by and updated by the University of Waikato in New Zealand. We use Weka 3.6.5 as our developing environment.
4.1.1 Import Data
Weka accepts several data formats, including the simplest format named Comma-Separated Values (CSV), and Attribute Relationship File Format (ARFF). After data is imported, it is stored by the Weka-defined data structures. Each data record is stored by an Instance object, and the whole data set is stored by an Instances object, which contains multiple Instance objects. An Attribute object contains all the details about an attribute, like the data type is nominal or real number, and how many values are in the attribute. Multiple Attribute objects are also contained in one Instances object.
4.1.2 Classifier
To develop a classifier under Weka’s environment, an abstract class weka.classifiers.Classifier() must be extended. After that, an abstract method buildClassifier() must be implemented, and this method is called every time when the
‧
training data. After the model is built, one of the two methods is called for classifying testing data: classifyInstance() and distributionForInstance(), which utilize the model built by buildClassifier() to generate the classification result for every single data record. The difference between these two functions is that, the former one returns exact one class label for prediction, while the latter one returns an array of probabilities with respect to class labels.4.1.3 Cross-Validation
Weka offers several evaluation methods, and they are easy to implement. Here we introduce how to realize a cross-validation method. First an evaluator must be built by invoking weka.classifiers.Evaluation(), and then we choose the provided method crossValidateModel().
4.2 Data Structure
The data structure used in the implementation of ROUSER is partially learned from the JRip provided by Weka.
4.2.1 Rough Set
In order to implement rough set intuitively, a data structure for rough set is built, as in Figure 18. A data set is split as a partition of 3 blocks, namely positive, boundary and negative, with respect to the definition of a rough set in Section 3.4, and each block is actually an Instances object as mentioned in Section 4.1.1. For the convenience, the blocks filled by black color is empty, while white is not empty. Some necessary information is stored in the structure, such as DiscPow, Chi-square value, Purity For several further use, such as choosing the best attribute and value to build a rule. This data structure never appears in Weka.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 18. The data structure of a rough set.
4.2.2 Decision Rule
As mentioned in Section 2.1.1, a decision rule is a logic statement with the following form:
condition1 ∧ condition2 ∧…→class,
hence there can be multiple antecedents. We define a data structure named RAntd to store each condition, and some necessary information is contained in the structure,, such as the DiscPow, the number of instances covered by the rule so far (from the 1st condition to this condition), and the number of instances explained by the rule so far, as shown in Figure 19.
This data structure is learned from JRip provided by Weka, however some of the information stored in it are different.
Figure 19. The data structure of the antecedent of a rule: RAntd.
Another structure learned from JRip provided by Weka is RouserRule, which stores a rule, as in Figure 20, and it contains two parts: The queue of the antecedents and a class label.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
When a rule is grown, RAntd is generated one after another, and they are stored in a queue in order.
Figure 20. The data structure of a rule: RouserRule.
After a rule is generated, it is stored in the rule set in the growing order, as shown in Figure 21. The rule set is a queue. This is also learned from JRip provided by Weka.
Figure 21. A data structure of the rule set: m_Ruleset.
The whole data structure of the rule model built by ROUSER is shown in Figure 22:
Figure 22. The data structure of ROUSER’s rule model.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3 ROUSER
Following the separate-and-conquer algorithm, ROUSER is implemented under the Weka environment. As mentioned in Section 4.1.1, the BuildClassifier() function must be implemented
4.3.1 BuildClassifier( )
In the BuildClassifier() function shown in Figure 23, the oneClassRule() is an implement of the separate-and-conquer algorithm, which build rules for one chosen class. The oneClassRule() function is called for each class by ascending class order, since we adapt the ascending ordered rules strategy here, which is also adapted by RIPPER.
Figure 23. The flow chart of BuildClassifier().
4.3.2 OneClassRule( )
The function OneClassRule() shown in Figure 24 is an implement of the separate-and-conquer algorithm. The training data is first transformed into a rough set of the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
chosen class, which split the original data into three parts, and we make the boundary region empty to accelerate further processes. If there are contradicted instances in the data set, they will be in the boundary region, and there are many methods to handle the contradictions, such as assigning the most frequent class label to the contradicted instances. We choose t simple method: Deleting the instances in the boundary region. After that we build a rule from the rough set by the grow() function. The rule is concatenated at the end of the rule set right after it is built. After a rule is built, the positive instances explained by the rule are removed from the positive region. The remaining instances in the rough set will then be used to build another rule iteratively until all instances are explained.
Figure 24. The flow chart of OneClassRule().
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3.3 grow( )
The grow() function shown in Figure 25 builds a rule that explains some of the positive instances and none of the negative instances in the rough set. At the beginning an empty rule is built. DiscPow and Chi-Squared value of each attribute are calculated, and the rule is enriched by the antecedents built by the bestAntd() function. The longer the rule grows, the fewer the negative instances are covered. The rule is finally done when none of the negative instances are covered by the rule.
Figure 25. The flow chart of grow().
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4.3.4 BestAntd( )
BestAntd() chooses the best pair of attribute and value to grow the rule, and is the same as the CHOOSE_ATTR&VALUE() function in the pseudo code in grow() function in Figure 14.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 5
EXPERIMENT AND RESULTS
5.1 Environmental Setting
The experiment is executed on a computer with Windows7 32bit operating system. The memory is 4GB DDR3 SDRAM 1333Mhz, and the chipset is Intel Q67 Express, the CPU is Intel Core i7 -2600, 3.4GHz. The Weka’s version is 3.6.5.
5.2 Data Sets
The data sets used for experiments are all available from UCI Machine Learning Repository [23], and the data sets which are originally nominal data are shown in Table VII, and the discretized data sets which originally contain some real number data are shown in Table VIII. They are collected from different application domains, such as biology, gaming, politics, and marketing; the number of their attributes ranges from 5 to 69; the number of their classes ranges from 2 to 24; since the class numbers are different in each data set, we use bar charts to visualize the class distributions, for some of them, the class distributions are imbalanced; and some data sets are with missing values on some attributes.
The data set names with the “_dis” concatenated behind are not pure nominal data originally. We perform discretization on these data sets, and the details about what attributes are discretized and how they are discretized are shown in Table IX.
‧
Table VII. Original nominal data sets.
Data name #instances #attributes
including class Class distribution missing value
Agaricus-lepiota 8124 23 yes
Audiology.standardized 226 69 yes
Car 1728 6 no
Data name #instances #attributes
including class Class distribution missing value
‧
Data name Supervised discretization Equal bean discretization
(number of bean) Numerical to nominal
Abalone 2,3,4,5,6,7,8 9(5)
Adults 1,5,11,12,13 3(10)
Australian 2,3,7,10,13,14 1,4,5,6,8,9,11,12,15
Balance-scale 1,2,3,4
German 2 5(10),13(10) 8,11,16,18,21
Heart 8,10 1(5),4(5),5(5) 2,3,6,7,9,11,12,13,14
We defined four types of contradictions in section 3.2, and the number of contradictions in each data set is shown in Table X and Table XI.
Table X. Number of contradictions in original nominal data sets
Data sets Number of instances Number of contradictions type 1 type2 type3 type4
Table XI. Number of contradictions in discretized data sets.
Data sets number of instances Number of contradictions type 1 type2 type3 type4
‧
We design several experiments to examine ROUSER’s performance in different situations. We use 10-fold cross-validation to evaluate the classification performance.
The results for the data sets which are original nominal are summarized in Table XII.
The numbers reported in Table XII are accuracy rates in percentage, and the maximum values are in bold, and the minimum values are underlined. As we mentioned in section 3.2 that ROUSER has seven choices to search for the attribute-value pair to grow a rule, and the results of all the seven choices are shown in Table XII. Four out of nine accuracy results of ROUSER_6 are better than or the same as both J48 and JRip. On two data sets ROUSERs are outperformed by JRip and J48. ROUSER_1 and ROUSER_6 are the most stable versions
The numbers reported in Table XII are accuracy rates in percentage, and the maximum values are in bold, and the minimum values are underlined. As we mentioned in section 3.2 that ROUSER has seven choices to search for the attribute-value pair to grow a rule, and the results of all the seven choices are shown in Table XII. Four out of nine accuracy results of ROUSER_6 are better than or the same as both J48 and JRip. On two data sets ROUSERs are outperformed by JRip and J48. ROUSER_1 and ROUSER_6 are the most stable versions