Chia-Chi Liao
Department of Computer Science National Chengchi University
Taipei, Taiwan (R.O.C.) [email protected]
Kuo-Wei Hsu
Department of Computer Science National Chengchi University
Taipei, Taiwan (R.O.C.) [email protected]
Abstract—In this paper, we propose a rule-based classification algorithm named ROUSER (ROUgh SEt Rule). Researchers have proposed various classification algorithms and practitioners have applied them to various application domains, while most of the classification algorithms are designed with a focus on classification performance rather than interpretability or understandability of the models built using the algorithms.
ROUSER is specifically designed to extract human understandable decision rules from nominal data. What distinguishes ROUSER from most, if not all, other rule-based classification algorithms is that it utilizes a rough set approach to decide an attribute-value pair for the antecedents of a rule.
Moreover, the rule generation method of ROUSER is based on the separate-and-conquer strategy, and hence it is more efficient than the indiscernibility matrix method that is widely adopted in the classification algorithms based on the rough set theory. On about half of the data sets considered in experiments, ROUSER can achieve better classification performance than do classification algorithms that are able to generate decision rules or trees.
Keywords-machine learning; classification; decision rules; rule induction; rough set; separate-and-conquer
I. INTRODUCTION
In machine learning, a classification task is to classify an unknown data record into a pre-specified category based on values of attributes of the data record. Before the machine is capable to do the classification task, it needs to learn from some training data where each data record has been associated with a category. Each attribute of a given data set corresponds to a domain of continuous values, i.e. real numbers, or a domain of discrete values, i.e. nominal data.
The proposed classification algorithm is specialized to nominal data. Nominal data are common in banking. Most of a customer’s personal information, such as gender, marital status, hobbits, and hometown, are nominal data. Banks would like to have rules to utilize a customer’s personal information in the calculation his or her credit score. In addition, biologists are familiar with nominal data. Gene data are all nominal, and biologists want to study the relationships between gene combinations and a certain disease. Furthermore, although data from sensors are usually real numbers, engineers often need to discretize them into nominal data for further processing.
To monitor a machine and check if it is stable or not, for example, engineers may want to use data of real numbers from sensors in the machine to train a classification model, or a classifier, in which the underlying classification algorithm is based on complex mathematical methods. Examples of such classification algorithms include Support Vector Machines (SVMs) [4] or Artificial Neural Networks (ANNs) [2].
Engineers may obtain high accuracy from the trained classification models but learn nothing from them. Engineers need to know the possible causes of a fault or a problem in order to perform fault diagnosis and resolve the problem, but they will have difficulty identifying the possible causes from complex mathematical expressions given by SVMs or ANNs.
The goal of the classification algorithm in this paper is to extract human understandable decision rules from nominal data.
A decision rule is a function mapping a data space (a space of data records) to a class space (a space of categories or class labels). A human understandable decision rule is helpful for domain experts, such as engineers in the above example, to learn the causes and effects from data, and it is also important for scientists who intend to acquire knowledge from data.
Researchers have proposed several classification algorithms with the ability of rule generation. Rule-based classification algorithms like RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [3] can generate rules directly, while tree-based classification algorithms like ID3 [13] and C4.5 [14]
can also generate rules after transformation. C4.5 is one of the most popular classification algorithms [16], while RIPPER represents the state-of-the-art rule-based classification algorithms [8][9]. What makes the classification algorithm proposed in this paper different from the rule-based and tree-based classification algorithms is that it decides an attribute-value pair for the antecedents of a rule according to the rough set theory.
In most classification algorithms that are based on the rough set theory, the indiscernibility matrix [10] is used to generate all possible rules in nominal data. However, the computational cost of the indiscernibility matrix is high. There exist speed-up methods, but most of them are still based on the indiscernibility matrix [1][5]. The classification algorithm proposed in this paper adopts the separate-and-conquer strategy [6] rather than the indiscernibility matrix, for rule generation.
978-1-4673-0890-8/12/$31.00 ©2012 IEEE 1 COMNETSAT 2012
Class = CHOOSE(ClassSet)
Covered= COVER(Covered ,Rule) RuleSet=RuleSet ∪{Rule}
TrainData=TrainData \ Covered return RuleSet
The rest of this paper is organized is the following way:
Section 2 will give the preliminaries, and the proposed classification algorithm will be introduced in Section 3. The experimental results are presented in Section 4. The paper will be concluded in Section 5 with potential directions for future work.
II. PRELIMINARY A. Rule-based classification algorithm
Rule learning is to learn rules from the given training data, and a rule-based classification algorithm uses the learned rules to classify unseen data records. For classification, a decision rule is a logic statement with the following form:
condition1 ∧ condition2 ∧…→class,
where a condition is usually an attribute-value pair indicating a certain value of a certain attribute that is required to trigger the condition.
If a training data record matches all conditions of the rule, we say that the rule covers the data record; if the rule covers a data record and classify the data record to the right class, we say that the rule explains the data record.
RIPPER [3] is a popular rule-based classification algorithm.
It has two stages: The generation stage and the optimization stage. The classification algorithm proposed in this paper competes with it in the generation stage.
B. The Rough Set Theory
The rough set theory is first introduced by Zdzisław I.
Pawlak in 1982 as a mathematical tool to characterize imprecise knowledge [11][12]. As shown in Figure 1 (a) and (b), the main difference between a rough set and a classic set is the appearance of a boundary “region” (not just a boundary), where the uncertain elements exist, in a rough set. The fuzzy set theory [17] is another tool to characterize imprecise knowledge. The main difference between a fuzzy set and a rough set is that a fuzzy set needs a membership function to the degree of an element’s participation in it. Practically speaking, such a membership function is defined under some assumptions and on a case-by-case basis. Nevertheless, a rough set needs no membership function, since the uncertain elements are located in the boundary region in a rough set.
Given a decision table A=(U,C∪D), as shown in Figure 1 (c), where U={x1,x2,x3,x4,x5,x6,x7,x8} is the universe or the training data, C={a1,a2} is the condition or the attribute set of the training data, and D={d} is the decision or the set of class labels of the training data. A rough set of d=y is shown in Figure 1 (d). Since there is no difference between the condition of x4 and that of x5, they are in the boundary region.
For data records x1 and x2, if the values of all their attributes are the same, we say that x1 and x2 are indiscernible. The set of all data records including x that are indiscernible by the attribute set C is denoted by C(x). Below are definitions for Y corresponding to the set of d=y:
The positive region of Y:
POSA (Y) = {x⎪C(x) ⊆ Y}.
The negative region of Y:
NEGA (Y) = {x|C(x) ∩ Y=∅}.
The boundary region of Y:
BOUNDA (Y) = Y − POSA (Y) − NEGA (Y).
Figure 1. (a) Classic set. (b) Rough set. (c) Decision table A=(U,C∪D). (d) Rough set of d=y.
III. THE PROPOSED METHOD A. Separate-and-Conquer
The separate and conquer strategy first builds a rule that explains a part of the training data, separates them, and conquers the rest recursively until no data remain. It ensures that every data record is at least covered by one rule. Figure 2 gives the separate and conquer algorithm, the core of the proposed classification algorithm in this paper. Before the algorithm begins, one of the classes is chosen. POSITIVE chooses the data that should be classify to the chosen class, and NEGATIVE chooses the others. Every Rule is empty in the beginning, and continues to grow until no negative data is covered by it.
Figure 2. The SEPERATE&CONQUER algorithm.
(a) (b)
GROW(Rule,Covered):
do:
for every attribute ai:
DiscPowi = DISCPOW(ai,Covered) ChiSquaredi = CHISQUARED(ai ,Covered)
Among attributes with DiscPowi =0, ignore ai with minimum ChiSquaredi
while exist ai with DiscPowi = 0 (a,v) = CHOOSE_ATTR&VALUE() grow the rule with (a,v) as an antecedent
DISCPOW(ai,Covered): return the cardinality of PotBound(ai )
CHOOSE_ATTR&VALUE():
for each attribute ai:
find the most frequent value vi in the positive of PotBound
for each vi:
calculate the Pui=PURITY(R), R is the candidate rule growed from the original rule with (ai=vi)
B. Search Heuristics
The GROW function in the separate and conquer algorithm given in Figure 2 searches from the covered data a suitable attribute and the corresponding value in order to grow a rule.
Examples of search heuristics include entropy used in ID3 [13]
and information gain used in C4.5 [14].
C. Potential Boundary Region and Discernibility power Here we introduce two concepts: Potential boundary region (PotBound) and discernibility power (DiscPow). Consider Figure 3, which is based on Figure 1. If the attribute a2 is removed, the boundary region expands. The expanded part of the boundary region {x6, x7} is defined as the PotBound of a1. The definition of PotBound is given below:
PotBound A (Y, ai) = Bound A’ (Y) − Bound A (Y), where A = (U,C∪D) and A' = (U,C∪D −{ai}).
The DiscPow of a1 is the cardinality of the PotBound of a1, which is 2 in this example. The meaning of DiscPow of attribute ai is how many elements with become indiscernible without ai. DiscPow has the monotonicity property. When an element is removed from a rough set, it is removed from the positive region, the boundary region, or the negative region. No matter from which region the element is removed, the DiscPow value can never increase, and the only way that decreases the DiscPow value is removing an element from the PotBound.
Figure 3. (a) Decision table A' = (U,C∪D −{a2}). (b) The new rough set of d=y.
D. ROUSER
The algorithm of the GROW function of ROUSER is shown in Figure 4. ROUSER removes attributes whose values of DiscPow are zero in each iteration, and it updates DiscPow of every attribute until all values of DiscPow of the remaining attributes are not zero. If multiple attributes need to be removed, the current version of ROUSER simply removes the one that is independent of the class entered as a parameter to the separate and conquer algorithm (Figure 2).
Once an attribute is removed in an iteration when the GROW function is running, we no longer need to compute its DiscPow value anymore because of the monotonicity property of DiscPow. When elements are removed from Covered, the DiscPow value of the attribute will be the same value or smaller. Once the DiscPow value of the attribute is zero, it will no longer increase and hence the attribute can be removed. The DISCPOW function is shown in Figure 5.
Figure 4. The GROW function.
Figure 5. The DISCPOW function.
The CHOOSE_ATTR&VALUE function in GROW is shown in Figure 6. This function searches for an attribute-value pair, i.e. (ai ,vi), that will be used to grow a rule. For each attribute ai we choose a value vi which is the most frequent value in the positive of PotBound of ai. Then, for all pairs, we choose the one corresponding to the highest Purity value [9][15], as defined below:
PURITY(R) = p / ( p+n ),
where R is the candidate rule, p is the cardinality of positive region covered by R, and n is the cardinality of negative region covered by R.
Figure 6. The CHOOSE_ATTR&VALUE function in GROW of ROUSER.
Since the chosen value appears in the positive of PotBound, The new generated condition of the rule can explain the data in the positive of PotBound and reject the data in negative of PotBound.
3
BUILD_CLASSIFIER( ):
build a ClassList by ascending frequency or for each Class in ClassList:
RuleSet=SEPERATE&CONQUER(Clas concatenate the RuleSet to the bottom
return RuleList
ROUSER generates set of rules for each cl rule set is generated, it is concatenated to the b list. The BUILD_CLASSIFIER algorithm shown in Figure 7. The class list is sorte frequency order in order to prevent the distribution problem. For an unseen case, RO down the rule list and uses the first rule that c classify it.
Figure 7. The BUILD_CLASSIFIER alg ROUSER ignores missing value when eva The reason is simple: missing value cannot be rule. In some situations, the PotBound of an a with all missing value, which means that this valid value in the PotBound. This attribute sho since missing value gives no power to discern
IV. EXPERIMENT AND RESUL In this section, we discuss experiment ROUSER with ID3, J48 (an implementation o JRip (an implementation of RIPPER [3]), provided by WEKA [7].
TABLE I. THE DATA SETS USED FOR EXP
UCI data set
Data description Data name #instances #attributes Class dis
1 audiology 226 69
2 car evaluation 1728 6
3 chess 3196 36
4 mushroom 8124 22
5 molecular biology:
spice gene bottom of the rule of ROUSER is ed by ascending imbalance class OUSER searches covers the case to
gorithm.
luating DiscPow.
chosen to grow a attribute are filled s attribute has no ould be removed,
The data sets used for experim UCI Machine Learning Repository They are collected from different ap biology, gaming, politics, and mark attributes ranges from 6 to 69; th ranges from 2 to 24; for some of th are imbalanced; and some data sets some attributes.
We use 10-fold cross-valid classification performance. The summarized in Table II. The numb accuracy rates in percentage. The ROUSER provides comparable performance. Some results for ID3 limitations. On two data sets, JRip but it is outperformed by ROUSER that ROUSER performs well on da attributes and data sets where t imbalanced. However, ROUSER do data sets 2 (car evaluation) and 5 think that there is no optimization st is in RIPPER) to reduce errors a deeper investigation of this will b Furthermore, ROUSER does not expected on the data sets 1 (audiolo We think that ROUSER needs a mo handle missing values. An improvem the future work.
TABLE II. EXPERIMENT RESULT
UCI V. CONCLUSIONS AND A rule-based classification algo proposed. It is designed to process human understandable decision rul set approach as its search heuristi method of ROUSER is based on strategy.
As a prototype without the pruning stage to reduce errors, classification performance compa accurate on the data sets 3, 4, 7, a least 4% more accurate on the data s rule-based or tree-based classificatio experiments. Since the search heuri different from the search heuristics Gain) used by the other three algori the proposed PotBound and DiscP
ments are all available from [18], as shown in Table I.
pplication domains, such as keting; the number of their he number of their classes hem, the class distributions
are with missing values on dation to evaluate the
experimental results are ers reported in Table II are numbers are highlighted if or better classification are not available due to its outperforms J48 and ID3, R. The results also indicate ata sets that are with more the class distributions are oes not perform well on the 5 (molecular biology). We
tage in ROUSER (but there and over fitting occurs. A be part of the future work.
perform as well as we ogy) and 9 (voting-records).
ore sophisticated strategy to ment on this will be part of
orithm named ROUSER is nominal data and generate es. ROUSER uses a rough ic, and the rule generation n the separate-and-conquer optimization stage or the ROUSER still provides arable (2% more or less and 9) to or even better (at
set 6) than that given by the on algorithms considered in istics of ROUSER is totally s (Entropy and Information ithms, the results imply that Pow are useful. This also
4
shows the potential of ROUSER and gives an example of future work.
For future work, we plan to conduct more experiments, develop better strategies to select attributes and handle missing values, and apply ROUSER to data sets obtained from a real-world case study.
ACKNOWLEDGMENT
The National Science Council of Taiwan (R.O.C.) supported this work under Grant NSC 100-2218-E-004-002.
The support is gratefully acknowledged. The authors would also like to thank anonymous reviewers for their valuable time.
REFERENCES
[1] J. G. Bazan, H. S. Nguyen, S. H. Nguyen, P. Synak, J. Wróblewski, and blewski, "Rough set algorithms in classification problem," in Rough set methods and applications, ed: Physica-Verlag GmbH, 2000, pp. 49-88.
[2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming:
Athena Scientific, 1996.
[3] W.W. Cohen, “Fast Effective Rule Induction,” Proc. 12th Int'l Conf.
Machine Learning (ICML), pp. 115-123, 1995.
[4] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[5] J. Dai, Q. Xu, and W. Wang, "A comparative study on strategies of rule induction for incomplete data based on rough set approach,"
International Journal of Advancements in Computing Technology, vol.
3, p. 176–183, 2011.
[6] J. Fürnkranz, "Separate-and-Conquer Rule Learning," Artif. Intell. Rev., vol. 13, pp. 3-54, 1999.
[7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, pp. 10-18, 2009.
[8] J. C. Huhn and E. Hullermeier, "FR3: A Fuzzy Rule Learner for Inducing Reliable Classifiers," Fuzzy Systems, IEEE Transactions on, vol. 17, pp. 138-149, 2009.
[9] W. Jiabing, Z. Pei, W. Guihua, and W. Jia, "Classifying Categorical Data by Rule-Based Neighbors," in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1248-1253.
[10] G. Pagallo and D. Haussler, "Boolean Feature Discovery in Empirical Learning," Machine Learning, vol. 5, pp. 71-99, 1990.
[11] Z. Pawlak, "Some Issues on Rough Sets,” Transactions on Rough Sets I, vol. 3100, J. Peters, A. Skowron, J. Grzymala-Busse, B. Kostek, R.
Swiniarski, and M. Szczuka, Eds., ed: Springer Berlin / Heidelberg, 2004, pp. 1-58.
[12] Z. Pawlak, A. Skowron, "Rudiments of rough sets", Information Sciences, vol.177, no.1, pp.3-27, 2007.
[13] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.
[14] J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.
[15] S. M. Weiss and N. Indurkhya, "Reduced complexity rule induction,"
presented at the Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Sydney, New South Wales, Australia, 1991.
[16] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.
McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.
[17] L. A. Zadeh, "Fuzzy Sets," Information and Control, vol. 8, pp. 338–
353, 1965.
[18] Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
5
Hybrid Ensembles of Decision Trees and Artificial
Abstract—Ensemble learning is inspired by the human group decision making process, and it has been found beneficial in various application domains. Decision tree and artificial neural network are two popular types of classification algorithms often used to construct classic ensembles. Recently, researchers pro-posed to use the mixture of both types to construct hybrid ensem-bles. However, researchers use decision trees and artificial neural networks together in an ensemble without further discussion. The focus of this paper is on the hybrid ensemble constructed by using decision trees and artificial neural networks simultaneously. The goal of this paper is not only to show that the hybrid ensemble can achieve comparable or even better classification performance, but also to provide an explanation of why it works.
Index Terms—Machine learning, classification, neural nets
I. INTRODUCTION
For classification tasks, an ensemble trains and uses a group of member classifiers. Given a data record whose class label is unknown, an ensemble will use its member classifiers to generate individual classifications and then aggregates these
For classification tasks, an ensemble trains and uses a group of member classifiers. Given a data record whose class label is unknown, an ensemble will use its member classifiers to generate individual classifications and then aggregates these