CHAPTER 1 INTRODUCTION
1.5 Thesis Organization
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 1. Cooperative data analysis.
1.5 Thesis Organization
The rest of this thesis is organized in the following way: Chapter 2 will give the preliminaries, and the proposed classification algorithm will be introduced in Chapter 3.
Chapter 4 will be the implementation of the proposed algorithm. The experimental results are presented in Chapter 5. A case study is given in Chapter 6. The thesis will be concluded in Chapter 7 with potential directions for future work.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 2 PRELIMINARY
2.1 Rule-Based Classification Algorithms
2.1.1 The BasicsRule induction is to learn rules from the given training data, and a rule-based classification algorithm uses the learned rules to classify unseen data records. For classification, a decision rule is a logic statement with the following form:
condition1 ∧ condition2 ∧…→class
where a condition is usually an attribute-value pair, indicating a certain value of certain attribute that is required to trigger the condition.
If a training data record matches all conditions of the rule, we say that the rule covers the data record; if the rule covers a data record and classify the data record to the right class, we say that the rule explains the data record. Given a rule set R, for every possible data record, if there exists a rule which is able to cover the record, we say that the set of rules are exhaustive. If no two rules in R cover the same data record, we say that the rule set is mutually exclusive. If the rule set is not mutually exclusive, a data record can be covered by several rules and lead to contradicting results. Generally there are two approaches to overcome this problem: Ordered rules and unordered rules. Ordered rules rank the rules by a certain criteria (e.g. accuracy, coverage, description length), so only one rule will be chosen to classify a data record. Unordered rules allow multiple rules to be triggered to classify a single
‧
data record through voting or weighting methods.
RIPPER [3] is a popular rule-based classification algorithm. It has two stages: The generation stage and the optimization stage. The classification algorithm proposed in this thesis competes with it in the generation stage.
2.1.2 Separate-and-Conquer
The separate-and-conquer strategy, or sequential covering, first builds a rule that explains a part of the training data, separates them, and conquers the rest recursively until no data remains. It ensures that every data record is at least covered by one rule. Figure 2 gives the separate-and-conquer algorithm, the core of the proposed classification algorithm in this thesis.
Before the algorithm begins, one of the classes is chosen. POSITIVE chooses the data that should be classified to the chosen class, and NEGATIVE chooses the others. Every rule is empty in the beginning, and continues to grow until no negative data is covered by it.
Figure 2. The SEPERATE&CONQUER algorithm.
2.1.3 Search Heuristics
Search heuristics are used to evaluate the found hypotheses. The GROW function in the
Class = CHOOSE(ClassSet)
‧
attribute and the corresponding value in order to grow a rule. Examples of search heuristics include Entropy and used in ID3 [17] and C4.5 [18].Entropy
Entropy is the weighted average of information content of each class and originates from the ID3 decision tree learning system [7]. Given a set S, the Entropy of the set S is defined as:
𝐸(𝑆) = − ∑ 𝑃𝑟(𝑗)𝑙𝑜𝑔2𝑃𝑟 (𝑗)
𝑁
𝑗=1
where N is the number of different values of an attribute in S, and 𝑃𝑟(𝑗) is the proportion of the value j in the set S.
The definition of Entropy above is suitable for decision trees. To be suitable for a rule-based classification algorithm, the Entropy can be defined as:
𝐸(𝑆) = − 𝑝
𝑝 + 𝑛𝑙𝑜𝑔2 𝑝
𝑝 + 𝑛− 𝑛
𝑝 + 𝑛𝑙𝑜𝑔2 𝑛 𝑝 + 𝑛
where p is the number of positive instances covered by a given rule r, and n is the number of positive instances covered by the given rule r. It is obvious that this definition is a special binary case of the original definition.
Information Gain
Information Gain measures the expected reduction in Entropy caused by partitioning the instances according to an attribute [13]. The definition of Information Gain is:
𝐼𝐺(𝑆, 𝑎) = 𝐸(𝑆) − ∑ |𝑆𝑣|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎) |𝑆|
𝐸(𝑆𝑣)
where 𝑎 is the attribute, and 𝑆𝑣 is the subset of 𝑆 for which attribute 𝑎 has value 𝑣.
‧
We believe that a generated rule might be overfitting, which means that a rule is grown too precisely to achieve high accuracy, while few data records are explained by this strict rule.
To avoid overfitting, pruning methods were introduced to shorten the rule. In general there are two categories of pruning methods: Pre-pruning and post-pruning. Pre-pruning methods stop the growing of the rule by implementing some stopping criteria, such as Purity, Minimum Description Length, significance, etc. Post-pruning methods drop part of the conditions from a grown rule by testing if the pruned rule performs better than the original rule on some criteria or not. Currently the proposed ROUSER adapts no pruning methods, while implementing a pruning method suitable for ROUSER will be part of the future work.
Rules generated through pruning stage are usually perform well, and experiments show that the whole rule sets are significantly improved on both the size and the performance through global optimization, which is a post-induction optimization method on the whole rule set. Currently ROUSER adapts no optimization methods, while investigating an optimization method suitable for ROUSER will be part of future work.
2.2 The Rough Set Theory
The rough set theory is first introduced by Zdzisław I. Pawlak in 1982 as a mathematical tool to characterize imprecise knowledge [15][16]. The main difference between a rough set and a classic set is the appearance of a boundary “region” (not just a boundary), where the uncertain elements exist, in a rough set. The fuzzy set theory [22] is another tool to characterize imprecise knowledge. The main difference between a fuzzy set and a rough set is that a fuzzy set needs a predefined function to decide the “membership degree” of each element. . Practically speaking, such a membership function is defined under some assumptions and on a
‧
uncertain elements are located in the boundary region in a rough set.2.2.1 Information System and Decision Table
An information system 𝑨 is a pair, denoted by 𝑨 = (𝑈, 𝐶), where 𝑈 is the universe, and 𝐶 is the set of attributes. When we deal with classification or clustering issues, the elements of 𝑈 can be considered as instances. For each attribute 𝑎 ∈ 𝐶, the value set is 𝑉𝑎. For each problem. The value of decision 𝑑 in instance 𝑥 is denoted by 𝑑(𝑥).For example, the decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷) in Table II below, 𝑈 = {𝑥1, 𝑥2, 𝑥3, 𝑥4, 𝑥5, 𝑥6, 𝑥7, 𝑥8} , 𝐶 = {𝑎1, 𝑎2} ,
‧
Table II . Decision table A=(U,C∪D)
U a1 a2 d
Indiscernibility relation is an equivalence relation mathematically, but the meaning is different. When we say that two objects are indiscernible, we mean that the two objects have exact the same value on every attribute and hence we cannot distinguish the two objects.
However, we still cannot say that the two objects are the same, due to the limit of knowledge (attributes). A formal definition of indiscernibility relation is given below.
For every instance 𝑥, 𝑦 ∈ 𝑈, 𝑥, 𝑦 are indecernable if and only if for every 𝑎 ∈ 𝐶 , 𝑎(𝑥) = 𝑎(𝑦). For each subset 𝐶′ ⊆ 𝐶, 𝐶′ makes a partition on 𝑈, denoted by 𝑈/𝐶′, and 𝐶′(𝑥) ∈ 𝑈/𝐶′ denotes the block of the partition containing instance 𝑥, which means 𝑥 ∈ 𝐶′(𝑥).
For each 𝑦 ∈ 𝐶′(𝑥), 𝑎(𝑦) = 𝑎(𝑥), which means that instances in the same block of partition are indiscernible. 𝐶′ forms an indiscernibility relation and 𝐼(𝐶′) defines as follows:
𝑥 𝐼(𝐶′) 𝑦 if and only if 𝑎(𝑥) = 𝑎(𝑦) for every 𝑎 ∈ 𝐶′.
For example, consider the decision table in Table II above. All partitions are given below:
𝑈/𝐷 = {{𝑥1, 𝑥2, 𝑥3, 𝑥4}, {𝑥5, 𝑥6, 𝑥7, 𝑥8}},
‧
The main difference between a rough set and a classic set is the appearance of a boundary
“region” (not just a boundary), as shown in Figure 3 (a), (b). Given a decision table A=(U,C∪
D), as shown in Table II, where U={x1,x2,x3,x4,x5,x6,x7,x8} is the universe or the training data, C={a1,a2} is the condition or the attribute set of the training data, and D={d} is the decision or the set of class labels of the training data. A rough set of d=y is shown in Figure 3 (c). Since there is no difference between the condition of x4 and that of x5, they are in the boundary region.
The visualized rough set of 𝑨 = (𝑈, 𝐶 ∪ 𝐷) is shown in Figure 3 (c).
Figure 3. (a) Classic set. (b) Rough set. (c) Rough set for example.
We give an example to help understand a rough set. The set Y corresponding to the set of d = y is {𝑥1, 𝑥2, 𝑥3, 𝑥4}, as shown in Figure 4 (a), where the set is mapped to a 4x4 data space
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
that 𝑥6, 𝑥7, 𝑥8 do not belong to Y. Hence we can characterize the set Y by two crisp set, {𝑥1, 𝑥2, 𝑥3} 𝑎𝑛𝑑 {𝑥1, 𝑥2, 𝑥3, 𝑥4, 𝑥5}, the lower-approximation and upper-approximation of Y, respectively, as shown in Figure 4 (b) and (c). This example gives a sense to a rough set: A rough set is actually a combination of several traditional sets (crisp sets).
(a) (b) (c)
Figure 4. (a) The space of d=y. (b) The lower-approximation of d=y. (c) The upper-approximation of d=y.
Here we give a formal definition to a rough set. Consider a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), where 𝐷 forms a partition 𝑈/𝐷 and indiscernibility relation 𝐼(𝐷). For each subset 𝐶′ ⊆ 𝐶, 𝐶′ forms a partition 𝑈/𝐶′ and indiscernibility relation 𝐼(𝐶′). When dealing with a classification problem, 𝐼(𝐷) must be approximated by 𝐼(𝐶′). For each block of partition 𝑋 ∈ 𝑈/𝐷, the 𝐶′-lower approximation of 𝑋 is as follows:
𝐶′(𝑋) = {𝑥 ∈ 𝑈 ∶ 𝐶′(𝑥) ⊆ 𝑋}
The 𝐶′-upper approximation of 𝑋 is as follows:
𝐶′(𝑋) = {𝑥 ∈ 𝑈 ∶ 𝐶′(𝑥) ∩ 𝑋 ≠ ∅}
If 𝐶′(𝑋) = 𝐶′(𝑋), we say that 𝑋 is 𝐶′-definable. The rough set theory defines the set 𝑋 by both 𝐶′(𝑋) and 𝐶′(𝑋). If 𝑋 is 𝐶′-definable, we say 𝑋 is crisp, otherwise 𝑋 is rough.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
The positive region of the partition 𝑈/𝐷 with respect to 𝐶′ is expressed as 𝑃𝑂𝑆𝐶′(𝐷), which is a union of every block’s lower-approximation of the partition 𝑈/𝐷. The definition is given below:
𝑃𝑂𝑆𝐶(𝐷) = ⋃ 𝐶′(𝑋)
𝑋∈𝑈/𝐷
There are no contradicting data records in 𝑃𝑂𝑆𝐶′(𝐷). An example of a positive region is given in Figure 5.
(a) (b)
Figure 5. (a) The data space. (b) The positive region of U/D .
The dependency degree of 𝐷 respect to 𝐶′ is defined below:
𝛾𝐶′(𝐷) = 𝑐𝑎𝑟𝑑(𝑃𝑂𝑆𝐶′(𝐷)) 𝑐𝑎𝑟𝑑(𝑈)
If 𝛾𝐶′(𝐷) =1 we said that 𝑨 is consistent on 𝐶′, which means that there are no contradicting data records.
2.2.4 Reduct and Core
Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), an attribute 𝑎 ∈ 𝐶 is said to be dispensable if 𝛾𝐶−{𝑎}(𝐷) = 𝛾𝐶(𝐷). A subset 𝐶′ ⊆ 𝐶 is a reduct of 𝐶 with respect to 𝐷 if no attribute 𝑎 ∈ 𝐶′ is dispensible. There can be more than one reduct of 𝐶, and the set of reducts is denoted by 𝑅𝑒𝑑𝐶(𝐷). The core of 𝐶 with respect to 𝐷 is defined as below:
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
𝐶𝑜𝑟𝑒𝐶(𝐷) = ⋂ 𝑅
𝑅 ∈ 𝑅𝑒𝑑𝐶(𝐷)
Consider the new example in Figure 6, where a new attribute 𝑎3 is given, and two partitions are shown as follow:
𝑈/{𝑎1, 𝑎3} = {{𝑥1}, {𝑥2, 𝑥3}, {𝑥4}, {𝑥5}, {𝑥6, 𝑥7}, {𝑥8}}
𝑈/{𝑎2, 𝑎3} = {{𝑥2}, { 𝑥1, 𝑥3}, {𝑥4}, {𝑥5}, {𝑥6, 𝑥8}, {𝑥7}}
It is easy to understand that both {𝑎1, 𝑎3} and {𝑎2, 𝑎3} are reducts of the new decision table, and {𝑎3} = {𝑎1, 𝑎3} ∩ {𝑎2, 𝑎3} is the core. Graphs for visualization are given in Figure 7 and Figure 8.
(a)
(b) (c) (d)
Figure 6. (a) The new decision table with a3. (b) Data space of a3=1 . (c) Data space of a3=2. (d) Data space of a3=3.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 7. a1, a3 as the reduct.
Figure 8. a2, a3 as the reduct.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2.2.5 Indiscernibility Matrix
Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), a discernibility matrix 𝑀𝐷(𝐶) of 𝑨 is a 𝑛 × 𝑛 matrix, and the entry of the matrix is defined as follows:
𝑐𝑖𝑗 = {𝑎 ∈ 𝐶 ∶ 𝑎(𝑥𝑖) ≠ 𝑎(𝑥𝑗) ∧ d(𝑥𝑖) ≠ 𝑑(𝑥𝑗)} 𝑓𝑜𝑟 𝑖, 𝑗 = 1, 2, . . . , 𝑛.
where 𝑛 is the number of elements in 𝑈 and 𝑥𝑖, 𝑥𝑗 ∈ 𝑈.
Discernibility function 𝑓𝐷(𝑨) is defined as follows:
𝑓𝐷(𝐶) = ⋀{⋁ 𝑎 : 𝑎 ∈ 𝑐𝑖𝑗 , 1 ≤ 𝑖 < 𝑗 ≤ 𝑛, 𝑐𝑖𝑗 ≠ ∅}
A discernibility function 𝑓𝐷(𝐴) is a boolean function, all constituents in the disjunctive normal form of 𝑓𝐷(𝐶) are all 𝐷-reducts of 𝐶, and all prime implecants of the conjunctive normal form of 𝑓𝐷(𝐶) are also all 𝐷-reducts of 𝐶.
An indiscernibility matrix of decision table in Figure 6 (a) is given in Figure 9 below.
Figure 9. An example of indiscernibility matrix.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
CHAPTER 3
DESIGN OF THE PROPOSED METHOD
3.1 Potential Boundary Region and Discernibility Power
One of the contributions of this thesis is presenting a new search heuristics named discernibility power based on the rough set theory. Before introducing discernibility power, we have to redefine the rough set for disambiguation and convenience.
Redefining a Rough Set
Guided by the original definition of rough set theory, we redefine a rough set. Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷), for each block X of partition 𝑈/𝐷, the rough set of X is redefined below:
The positive region of X:
POSA (X) = {xC(x) X}.
The negative region of X:
NEGA (X) = {x | C(x) ∩ X=∅}.
The boundary region of X:
BOUNDA (X) = X POSA (X) NEGA (X).
Notice that the positive region here is the same as the definition of the lower-approximation of a rough set, but it differs from the one mentioned in 2.2.3, which is the positive region of 𝐷 respect to 𝐶. As sketched in Figure 10, a rough set is redefined by 3 disjunctive traditional sets, positive region, negative region and boundary region. The redefined rough set is also a
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
partition of 𝑼. The purpose of the redefinition is to connect the rough set theory with the separate-and-conquer algorithm, which iteratively grows a rule by rejecting as many negative data records as possible and accepting as many positive data records as possible. Based on the redefinition of a rough set, we introduce two concepts: Potential boundary region (PotBound) and discernibility power (DiscPow).
Figure 10. The redefined rough set.
Potential Boundary Region
Consider the rough set of 𝑋 defined above, the meaning of the potential boundary region of attribute ai is the set of elements which will become indiscernible without ai. The definition of PotBound of X with respect to attribute ai is given below:
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) = 𝐵𝑜𝑢𝑛𝑑𝑨′(𝑋) − 𝐵𝑜𝑢𝑛𝑑𝑨(𝑋), where 𝑨 = (𝑈, 𝐶 ∪ 𝐷) and 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}).
Here is an example of PotBound. Consider sdfsff, the original decision table is 𝑨 = (𝑈, 𝐶 ∪ 𝐷), if the attribute a2 is removed, the new decision table becomes 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}). x6, x7 become indiscernible, and the boundary region of Y expands. The expanded part of the boundary region {x6, x7} is the PotBound of a2.
negative boundary
positive
‧
DiscPow has the monotonicity property, which means that removing elements from a
rough set, or a partition of 𝑈, will never increase the DiscPow. Below is the proof.
Figure 12. The rough set of A = (U,C∪D ).
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Figure 13. The rough set of A' = (U,C∪D {a2})
Given a decision table 𝑨 = (𝑈, 𝐶 ∪ 𝐷) as shown in Figure 12, the DiscPow of 𝑎𝑖 ∈ 𝐶 with respect to the rough set of 𝑋 ∈ 𝑈/𝐷 is as below:
𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) = 𝐶𝑎𝑟𝑑(𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖)), and the 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) is given below:
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) = 𝐵𝑜𝑢𝑛𝑑𝑨′(𝑋) − 𝐵𝑜𝑢𝑛𝑑𝑨(𝑋), where 𝑨′ = (𝑈, 𝐶 ∪ 𝐷 − {𝑎𝑖}), as shown in Figure 13.
Below are definitions for 𝑋 with respect to 𝑨′:
The positive region of 𝑋:
𝑃𝑂𝑆𝑨′(𝑋) = {𝑥 | 𝐶(𝑥) 𝑋 } The negative region of 𝑋:
𝑁𝐸𝐺𝑨′(𝑋) = {𝑥 | 𝐶(𝑥) ∩ 𝑋 = ∅ } The boundary region of 𝑋:
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑋 − 𝑃𝑂𝑆𝑨′(𝑋) − 𝑁𝐸𝐺𝑨′(𝑋)
By the 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) given above, the boundary region of 𝑋 has another definition:
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) + 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋), 𝑁𝐸𝐺𝑨′(𝑋)
𝐵𝑂𝑈𝑁𝐷𝑨′(𝑋) = 𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋) + 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋)
𝑃𝑂𝑆𝑨′(𝑋)
𝑃𝑜𝑡𝐵𝑜𝑢𝑛𝑑𝑨(𝑋, 𝑎𝑖) 𝐵𝑂𝑈𝑁𝐷𝑨(𝑋)
‧
than one element from 𝑈 can be considered as iterally removing an element, and inserting more than one element from 𝑈 can be considered as iterally inserting an element. By all above, removing elements from 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or drop, and inserting elements to 𝑈 will cause the 𝐷𝑖𝑠𝑐𝑃𝑜𝑤𝑨(𝑋, 𝑎𝑖) to either hold or rise, and this is the monotonicity property of DiscPow.Discernibility Power is one of the search heuristics of the proposed rule-based algorithm: ROUSER, which will be introduced in the next subsection.
3.2 ROUSER
ROUSER follows the separate-and-conquer algorithm as the framework. Our contribution here is connecting the proposed DiscPow as the search heuristic used by the GROW function in the separate-and-conquer algorithm. The GROW function of ROUSER is shown in Figure 14. ROUSER removes attributes whose values of DiscPow are zero in each iteration, and it updates DiscPow of every attribute until all values of DiscPow of the remaining attributes are not zero. If multiple attributes need to be removed, the current version of ROUSER simply removes the one that is independent of the class entered as a parameter to the separate-and-conquer algorithm in Figure 2. We use Chi-Squared value to decide the
‧
degree of independence. Chi-Squared value was first used in feature selection in [9]. Feature selection with Chi-Square test together with rough set theory was proposed in [19].
Figure 14. The GROW function.
Once an attribute is removed in an iteration when the GROW function is running, we no longer need to compute its DiscPow value anymore because of the monotonicity property of DiscPow. When elements are removed from the rough set covered by current rule, the DiscPow value of an attribute will be the same or a smaller value. Once the DiscPow value of the attribute is zero, it will no longer increase and hence the attribute can be removed. The DISCPOW function is shown in Figure 15.
Figure 15. The DISCPOW function. return the cardinality of PotBound(ai ) GROW(Rule,Covered):
do:
for every attribute ai:
DiscPowi = DISCPOW(ai,Covered) ChiSquaredi = CHISQUARED(ai ,Covered) Among attributes with DiscPowi =0, ignore ai with minimum ChiSquaredi
while exist ai with DiscPowi = 0 (a,v) = CHOOSE_ATTR&VALUE() grow the rule with (a,v) as an antecedent
‧
The CHOOSE_ATTR&VALUE function in GROW function searches for an attribute-value pair, i.e. (ai ,vi), that will be used to grow a rule. We use the idea of purity value [9][20] as the search heuristics. In our algorithm we provide 3 types of purities as options:
PurityOverAll, PurityPotBound, and PurityHybrid. The first is the same as the original definition of purity, and the others are proposed by us. The definitions of these purities are given below:
PurityOverAll = |pall|/(|pall|+|nall|),
where pall is the positive records covered by the candidate attribute and value, and nall is the negative records covered by the candidate attribute and value;
PurityPotBound = |ppb|/(|ppb|+|npb|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value, and npb is the negative records in the potential boundary region of the candidate attribute, and npb is covered by the candidate attribute and value;
PurityHybrid = |ppb|/(|ppb|+|nall|),
where ppb is the positive records in the potential boundary region of the candidate attribute, and ppb is covered by the candidate attribute and value; nall is the negative records covered by the candidate attribute and value.
In addition to purity, we provide weighted Information Gain as an option for search heuristic, which is defined as:
WInfoGain = (p2all/p1all)*( log(|p2all|/(|p2all|+|n2all|)) - log(|p1all|/(|p1all|+|n1all|)) )
where p1all and n1all is the positive and negative records respectively from the original set of data records, and p2all and n2all is the positive and negative records respectively from the chosen subset of data records. The “log(|p1all|/(|p1all|+|n1all|))” is the information content of the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
original set of data records, while “log(|p2all|/(|p2all|+|n2all|))” is the information content of the chosen subset. “(p2all/p1all)” is the weight of the Information Gain.
We also provide 2 methods, and the first is called “Max”, which finds the maximum (i.e.
purity) from all possible attribute-value pairs. The second is called “Frequent Max”, which finds the most frequent value in each attribute and then finds the maximum (i.e. purity) from them.
At last, our CHOOSE_ATTR&VALUE function can choose an attribute-value pair in 7 different ways:
1. PurityOverAll, Max 2. PurityPotBound, Max 3. PurityHybrid, Max
4. PurityOverAll, Frequent Max 5. PurityPotBound, Frequent Max 6. PurityHybrid, Frequent Max 7. WInfoGain, Max
ROUSER generates a set of rules for each class. As soon as a rule set is generated, it is concatenated to the bottom of the rule list. The BUILD_CLASSIFIER algorithm of ROUSER is shown in Figure 16. The class list is sorted by ascending frequency order as RIPPER does. For an unseen case, ROUSER searches down the rule list and uses the first rule that covers the case to classify it.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ROUSER has to decide if two records are indiscernible to determine the boundary and potential boundary regions. Consider the examples in Figure 17, where there are two records j and k. If we want to know if record j and record k are indiscernible, we have to check every attribute’s value. If each attribute has the same value in record j and k, we say that the two records are indiscernible.
Figure 17. The example for checking if two records are indiscernible.
It is a simple task to decide if two records are indiscernible or not. However, missing values make the task complicated. We define four types of indiscernibility between two values, as shown in Table III, Table IV, Table V, and Table VI. These tables show how we treat a missing value for an attribute when we try to check if two records are indiscernible. From type 1 to type 4, the determination of indiscernibility becomes stricter. Currently, ROUSER
It is a simple task to decide if two records are indiscernible or not. However, missing values make the task complicated. We define four types of indiscernibility between two values, as shown in Table III, Table IV, Table V, and Table VI. These tables show how we treat a missing value for an attribute when we try to check if two records are indiscernible. From type 1 to type 4, the determination of indiscernibility becomes stricter. Currently, ROUSER