Mining decision rules on data streams in the presence of concept drifts

(1)

Mining decision rules on data streams in the presence of concept drifts

Cheng-Jung Tsai

a,*

, Chien-I. Lee

b

, Wei-Pang Yang

c a_{Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, ROC} b_{Department of Information and Learning Technology, National University of Tainan, Tainan, Taiwan, ROC}

c_{Department of Information Management, National DongHwa University, Hualien, Taiwan, ROC}

Abstract

In a database, the concept of an example might change along with time, which is known as concept drift. When the concept drift occurs, the classification model built by using the old dataset is not suitable for predicting a new dataset. Therefore, the problem of con-cept drift has attracted a lot of attention in recent years. Although many algorithms have been proposed to solve this problem, they have not been able to provide users with a satisfactory solution to concept drift. That is, the current research about concept drift focuses only on updating the classification model. However, real life decision makers might be very interested in the rules of concept drift. For exam-ple, doctors desire to know the root causes behind variation in the causes and development of disease. In this paper, we propose a con-cept drift rule mining tree, called CDR-Tree, to accurately discover the underlying rule governing concon-cept drift. The main contributions of this paper are: (a) we address the problem of mining concept-drifting rules which has not been considered in previously developed classification schemes; (b) we develop a method that can accurately mine rules governing concept drift; (c) we develop a method that should classification models be required, can efficiently and accurately generate such models via a simple extraction procedure rather than constructing them anew; and (d) we propose two strategies to reduce the complexity of concept-drifting rules mined by our CDR-Tree.

Keywords: Data mining; Classiﬁcation; Decision tree; Data stream; Concept drift

1. Introduction

With the rapid development and large distribution of electronic data, extracting useful information from many numerous and jumbled sources has become a major goal

of many scholars. data mining (Han & Kamber, 2001;

Men-zies, 2003), an important technique for extracting informa-tion from databases, has been proposed to solve this problem. Among the several functions of data mining,

clas-siﬁcation (Lee, Tsai, Wu, & Yang, 2008; Tsai, Lee, Chen, &

Yang, 2007) is crucially important and has been applied successfully to several areas; including the discovery of commodity deal dependence, customer relationship man-agement, risk analysis, etc. A standard classification task can be divided into two steps. In the first step, a training dataset is given. Each example in the training dataset con-tains a number of attributes and a target class. Attributes can be classified into continuous attributes and categorical attributes. The main difference between them is that the relations are ordered in the continuous attributes, but not in the categorical attributes. With the given training data-set, a classification system will generate a classifier to dem-onstrate the relations between the attributes and the target class in that training dataset. Then, the second step is to evaluate the accuracy of the generated classifier from the first step by using a testing dataset. The famous techniques developed for classification include: Bayesian classification, 0957-4174/$ - see front matterÓ 2007 Elsevier Ltd. All rights reserved.

doi:10.1016/j.eswa.2007.11.034 *

Corresponding author. Present address: 4F., No. 167, Sec. 1, Funong St., East District, Tainan 70175, Taiwan, ROC. Tel.: +886 6 2133111x777; fax: +886 6 3017137.

E-mail addresses:tsaicj@cis.nctu.edu.tw(C.-J. Tsai),leeci@mail.nutn. edu.tw(C.-I. Lee),wpyang@mail.ndhu.edu.tw(W.-P. Yang).

www.elsevier.com/locate/eswa Expert Systems with Applications 36 (2009) 1164–1178

Expert Systems with Applications

(2)

neural networks, genetic algorithms, decision trees, etc. Among them, decision trees are the most popular due to the merits of rapid construction, comparable accuracy,

and readable rules (Rastogi & Shim, 1998).

According to how the training dataset is obtained, the classiﬁcation problem also can be sorted into non-incremen-tal learning and incremennon-incremen-tal learning. Incremennon-incremen-tal learning is important for applications in which the training dataset

comes in the form of a data stream (Cunningham &

Now-lan, 2003; Domingos & Hulten, 2000; Jin et al., 2003). In such an instance, it is not feasible to collect all training data

before applying an algorithm. Incremental learning

(Furnkranz & Widmer, 1994; Maloof, 2003; Maloof et al., 2002; Utgoﬀ, 1989; Utgoﬀ, Berkman, & Clouse,

1997) is becoming ever more important since most of

infor-mation in our lives is presented in data stream consists of data block. However, most proposed approaches to incre-mental learning assumed data streams come under

station-ary distribution; namely, the data concept remains

unchanged. But in reality, any instance of applicable data, such as disease variation, weather forecasts, consumers’ shopping habits, or virus detection may vary as time goes by. Such a change of concept is known as concept drift (Cunningham & Nowlan, 2003; Hulten, Spencer, & Ddmingos, 2001; Klinkenberg, 2001; Klinkenberg et al., 1998; Kolter et al., 2003; Koychev, 2000; Lane & Brodley, 1998; Lee, Tsai, Wu, & Yang, 2007; Wang, Fan, Yu, & Han, 2003; Widmer & Kubat, 1996).

The current solutions to the problem of concept drift focus on efficiently rebuilding classifiers to accurately pre-dict new incoming datasets. Unfortunately, they are unable to discern the main reasons why a concept drifts. As for the users, they might be more interested in the rules of concept drift. For example, doctors desire to know the main causes of disease variation, scholars long for the rules of weather transition, and sellers would like to find out the reasons why the consumers’ shopping habits change. The idea is simple but novel; to our knowledge, we are the first group to address the problem of mining concept-drifting rules. Note that, what we address is different from the emerging pattern; which is the itemset whose support increases

signif-icantly in association rule mining (Fan &

Ramamohan-arao, 2006; Wang, Zhao, Dong, & Li, 2006). As claimed inFreitas (2000), classification and association rule discov-ery are fundamentally different mining tasks. The former can be considered a nondeterministic task, which is unavoidable given the fact that it involves prediction; while the later can be considered a deterministic task which does not involve prediction in the same sense as the classification task does.

In this paper, we propose a concept drift rule mining tree algorithm called CDR-Tree to accurately discover the rules of concept drift. The main contributions of this paper are: (a) mining concept-drifting rules, an interesting problem has never been discussed in the past is addressed; (b) our CDR-Tree can accurately mine the rules of concept drift; (c) if classiﬁcation models are required, CDR-Tree

can eﬃciently and accurately generate them via a simple extraction procedure instead of building them from scratch; (d) two strategies are also proposed to reduce the complexity of concept-drifting rules mined by CDR-Tree. The remainder of this paper is organized as follows:

Sec-tion2 is the review of related works. In Section3, the

the-ory of concept drift rules is introduced and then our concept drift rule mining tree algorithm is elucidated.

The performance evaluation is shown in Section 4. In the

last section, our conclusions and future research directions are explained.

2. Related work

In this section, we will introduce some related works, including some dealing with the traditional decision tree, incremental learning algorithms, and the proposed tech-niques for solving the concept drift problem.

2.1. Decision trees

A traditional decision tree (Clark & Niblett, 1989;

Quin-lan, 1986; QuinQuin-lan, 1993), as shown in Fig. 1, is a ﬂow chart-like tree structure consisting of a number of Boolean functions. Within the decision tree, each internal node denotes a test on an attribute; and each branch represents an outcome of the test. Each path from the root to a leaf node forms a rule; and each leaf node is associated with a target class (or class). Before building a decision tree, a

training dataset is required as shown in Table 1 (which is

used for the building of the decision tree inFig. 1). Each

instance in the training dataset includes a set of attribute values and a target class.

Traditional decision tree approaches consist of two phases: the building phase and the pruning phase. In the building phase, according to a splitting function such as information gain, each internal node ﬁnds a splitting attri-bute to partition the coming data into corresponding child nodes unless all the examples in this node are pure (i.e. all

(3)

examples in this node have the same target class) or cannot be further partitioned. Nevertheless, many of the branches reflect anomalies in the training data due to noise data or outliers. To prevent such an overfitting problem, a decision tree should prune its model to remove the least reliable branches. This generally results in faster classification and an improvement in the ability of the tree to correctly clas-sify unseen data. To clasclas-sify an unknown instance, begin-ning with the root node, successive internal nodes are visited until this example has reached a leaf node. The class of this leaf node is then assigned to the corresponding example as a prediction.

2.2. Incremental learning algorithms

A traditional decision tree algorithm such as ID3 (

Utg-oﬀ, 1989) is not precise enough without taking incremental learning into consideration. Therefore, once the new instances are obtained, the decision tree needs to be re-built by using an integrated version of new and old datasets (Utgoﬀ et al., 1997). Based on ID3, Schlimmer and Fisher

proposed ID4 to solve this problem (Utgoﬀ, 1989). The

major difference with ID4 is that the new instances are added directly into the existing decision tree without gener-ating a new one. ID4 keeps some information in every node and judges if the best splitting attribute is still the same after the addition of new instances. If they are the same, the new instances continue going to the next node and are evaluated again; but if they are not, the sub-tree root in this node is discarded and a new sub-tree is then built. However, the final decision tree is also influenced by the different arrangement sequence of instances even they are from the same new training dataset.

Regarding the solutions to ID4 problems mentioned above, Utgoﬀ upgraded and developed a new decision tree

evolution called ID5 (Utgoﬀ, 1989). Similar to ID4, ID5

integrates new instances into decision trees. Also, ID5 pre-serves the useful information in each node and determines whether the best splitting attribute can still achieve optimal performance. If the best splitting attributes change, ID5 will replace them instead of deleting the tree. The sub-stitution made by ID5 is called pull-up. For pull-up to function properly, it is essential to have all instances stored

in the decision tree. Therefore, the recording cost of ID5 is more expensive than that of ID4. Another disadvantage of ID5 is that it generates a larger decision tree and rules of greater complexity than either ID3 or ID4. In order to reduce the size of the decision tree generated by ID5,

Utg-oﬀ has proposed ID5R (Utgoﬀ, 1989). The primary

pur-pose of ID5R is to ensure that the splitting attribute in each internal node is optimal and that the decision tree after revisions is identical to the one re-built by ID3 by applying the same training dataset.

2.3. Concept drift

Most incremental learning algorithms assume data stream under stationary distribution. But in reality, the con-cept of any given instance might either gradually or quickly vary. While the concept of data starts drifting, the classiﬁ-cation model constructed by using old datasets becomes unsuitable for the new one. Thus, it is imperative that the old classiﬁcation model be revised or a new one be re-built. At present, the solutions to the problem of concept drift can be generally divided into three categories:

2.3.1. Window-based approaches

Window-based approaches, like WAH (Widmer &

Kubat, 1996) and DNW (Klinkenberg et al., 1998), pick up the training dataset within ﬁxed or dynamic window

sizes to construct a classiﬁcation model (Hulten et al.,

2001; Lazarescu & Venkatesh, 2004; Maloof, 2003). WAH properly adjusts window sizes based on the accuracy of classification. When the concept has drifted due to new input instances, WAH will make the window size about 20% smaller. Also, WAH will reduce window size to avoid retaining redundant and unnecessary instances while the concept remains stable. Once the concept has drifted, but while its variation still remains smaller than a given thresh-old, WAH will retain the original window size. The above mentioned conditions being unfulfilled indicates that more instances are required to construct the classifier. In the meanwhile, WAH will combine all new and old instances together into a whole new training dataset. Although WAH can resolve the problem of concept drift, it can only be appllied to small datasets. Therefore, Klinkenberg and Renz developed DNW to remedy this deficiency. DNW constructs the predicted model by the way of data blocks; its learning method is similar to WAH, but with different methods for adjusting the window sizes. DNW builds a classifier in each block and compare accuracy, recall, and precision between the current classifier and the previous one. After a full comparison, DNW will review the extent of variations and make proper adjustments of window sizes.

2.3.2. Weight-based approaches

According to certain factors such as how long the has been data stored, this kind of approach will assign each

training instance with a distinctive weight (Kolter et al.,

Table 1

The patients’ diagnostic data

ID Sex Location Fever Cough Diagnosis

1 Male New York Yes No Inﬂuenza

2 Female Chicago Yes No Inﬂuenza

3 Female New York Yes Yes Pneumonia

4 Male New York Yes Yes Pneumonia

5 Male Chicago Yes Yes Pneumonia

6 Male New York No Yes Inﬂuenza

7 Female New York No No Healthy

8 Male New York No No Healthy

9 Female Chicago No No Healthy

(4)

2003; Koychev, 2000). Based on the weights, some out-dated instances will be opportunistically discarded. Thus, its learning curve is quite similar to that of window-based approaches.

2.3.3. Ensemble classiﬁers

An ensemble classiﬁer (Lee, Tsai, & Ku, 2006) deals with

concept drift by utilizing multiple classiﬁers and by voting

to construct a proper predictive model (Fan, 2004; Street &

Kim, 2001; Wang et al., 2003). For each data block, there will be a classifier built. According to the accuracy and the period of construction of each classifier, each classifier will be assigned a weight. These weights not only influence the final voting results, but also are main factors for consider-ation as to whether classifiers are eliminated or not. 3. Concept drift rule mining tree algorithm

The currently proposed solutions for concept drift

prob-lems discussed in Section 2 are feasible to eﬀectively and

precisely predict the target class of new coming instances, but they are incapable of informing users which rules cause concept drift. However, as stated in the introduction, users may be much more interested in the rules governing con-cept drift.

3.1. The rules of concept drift

Take the patients’ diagnostic data in Table 1 as the

example again and assume thatTable 2is the new

diagnos-tic dataset from the same patients. InTable 2, the drifting

values are marked with both underline and boldface. An instance with the same ID in both tables means the

diag-nostic data belongs to the same patient.Fig. 2is the

deci-sion tree constructed by usingTable 2. Comparing Fig. 1

withFig. 2, we ﬁnd that patients ID9and ID10are located

in leaf node A in the old decision tree and in node B in the new one. The corresponding decision rules are:

If (fever = ‘‘no”) and (cough = ‘‘no”) then (diagnosis = ‘‘healthy”) and

If (fever = ‘‘yes”) and (work = ‘‘Shanghai”) and

(cough = ‘‘yes”) then (diagnosis = ‘‘SARS”).

Comparing the rules of those two patients, we can see that someone might be infected with SARS if he displays fever, his working location is transferred to Shanghai, and had a bad cough. Simply stated, the variations of those three attributes, fever, working location, and cough, are the primary factors inﬂuencing concept drift. The concept drift rules detected from the two patients can be written in the form:

If (fever = ‘‘no ? yes”) and (location = ‘‘New York ? Shanghai”) and (cough = ‘‘no ? yes”) then (diagnosis = ‘‘healthy ? SARS”).

In this example, owing to the few instances and very sim-ple rules, users can clearly and quickly find the drifting rules between the two datasets. However, in a real application, it is a very difficult task for users to figure out such rules since the number of produced rules is usually very large. 3.2. Concept drift rule mining tree

In order to mine the concept drift rules mentioned in

Section 3.1, here we propose our CDR-Tree algorithm.

Section 3.2.1 is the building step of CDR-Tree. The idea

is simple but novel. Without loss of generality, here we

con-sider only the case that there are two data blocks: Tpand

Tqin a data stream. Note that, in addition to the concept

drift rules, users might also require the classiﬁcation model of each data block. CDR-Tree algorithm can do that via a quick and simple extraction step as will described in Sec-tion3.2.2. Finally, two strategies are proposed in Section 3.2.3to reduce the complexity of concept drift rules mined by CDR-Tree algorithm.

3.2.1. Building a CDR-Tree

To mine concept drift rules, CDR-Tree algorithm ini-tially integrates new and old instances from diﬀerent times Table 2

The new coming diagnostic data from the same patients

ID Sex Work Fever Cough Diagnosis

1 Male Chicago No No Healthy

2 Female Chicago No No Healthy

3 Female Shanghai Yes No Inﬂuenza

4 Male New York No No Healthy

5 Male New York Yes No Pneumonia

6 Male New York No Yes Inﬂuenza

7 Female New York Yes No Pneumonia

8 Male New York Yes Yes Pneumonia

9 Female Shanghai Yes Yes SARS

10 Female Shanghai Yes Yes SARS

(5)

into pairs; then, following the manner of traditional deci-sion trees a CDR-Tree is built. During the building step, information gain is used as the criterion to select the best splitting attribute in each node. In other words, CDR-Tree regards the pairs made by integration of new and old data as a single attribute value and mines the rules of concept drift through the construction of a traditional decision tree. In addition, since a traditional decision trees stop building while a node is pure, the generated concept drifting rules

would miss some important information. TakingTables 1

and 2 as our example again, the integrated data of the

two tables are shown in Table 3, and Fig. 3 is the

corre-sponding CDR-Tree. As described in Section 3.1, for the

patients ID9and ID10, there is a drifting rule:

If (fever = ‘‘no ? yes”) and (Work = ‘‘Chicago ?

Shanghai”) and (cough = ‘‘no ? yes”) then

(diagnosis = ‘‘healthy ? SARS”).

However, a traditional decision tree will stop splitting at

the node C inFig. 4and then produce a rule:

If (fever = ‘‘no ? yes”) and (Work = ‘‘Chicago ?

Shanghai”) then (diagnosis = ‘‘healthy ? SARS”). It is clear that the former rule is more reliable and accurate than the latter one. To solve this problem,

CDR-Tree algorithm goes on splitting a pure node no in

which all instances have some common attribute value

ai, but this attribute a is never selected as a splitting

attri-bute in this path from the node noto the root. The

con-cept drifting rules are marked with dotted lines in the

CDR-Tree in Fig. 3. There are ﬁve concept drift rules

as follows:

Rule a: If (fever = ‘‘no ? yes”) and (work = ‘‘Chicago ?

Shanghai”) and (cough = ‘‘no ? Yes”) then

(diagnosis = ‘‘healthy ? SARS”).

Rule b: If (fever = ‘‘no ? yes”) and (work = ‘‘New

York ? New York”) then (diagnosis = ‘‘healthy ? pneumonia”).

Rule c: If (fever = ‘‘yes ? no”) and (cough = ‘‘yes ?

no”) then (diagnosis = ‘‘pneumonia ? healthy”). Table 3

The integrated data ofTables 1 and 2

ID Location Fever Cough Diagnosis

1 New York ? Chicago Yes ? No No ? No Inﬂuenza ? Healthy

2 Chicago ? Chicago Yes ? No No ? No Inﬂuenza ? Healthy

3 New York ? Shanghai Yes ? Yes Yes ? No Pneumonia ? Inﬂuenza

4 New York ? New York Yes ? No Yes ? No Pneumonia ? Healthy

5 Chicago ? New York Yes ? Yes Yes ? No Pneumonia ? Pneumonia

6 New York ? New York No ? No Yes ? Yes Inﬂuenza ? Inﬂuenza

7 New York ? New York No ? Yes No ? No Healthy ? Pneumonia

8 New York ? New York No ? Yes No ? Yes Healthy ? Pneumonia

9 Chicago ? Shanghai No ? Yes No ? Yes Healthy ? SARS

10 Chicago ? Shanghai No ? Yes No ? Yes Healthy ? SARS

(6)

Rule d: If (fever = ‘‘yes ? no”) and (cough = ‘‘no ? no”) then (diagnosis = ‘‘inﬂuenza ? healthy”).

Rule e: If (fever = ‘‘yes ? yes”) and (work = ‘‘New

York ? Shanghai”) then (diagnosis = ‘‘pneumonia ? inﬂuenza”).

In the above rules, the value on the left and right side of ‘‘?” respectively represents the value in two diﬀerent data blocks of a data stream. If observing carefully, we can ﬁnd

that the concept drift rules of the patients ID9 and ID10

mentioned in Section 3.1 are deﬁnitely mined in this

CDR-Tree.

In order to provide users with meaningful and interest-ing rules of concept drift, CDR-Tree algorithm defines a rule support RS and a rule confidence RC to filter

un-meaningful ones out. For a leaf node noin the CDR-Tree,

suppose this node is assigned class label c and contains No

instance, then:

RS¼ No and

RC¼ ð100Nc=NoÞ%;

where Nc is the number of instances with class c in this

node no. The default values of RS and RC are 2 and 50%

respectively. However, users can assign a larger threshold if they only want to check the notable rules. For example, by setting RS = 2 and RC = 90%, Rules c and e will be ﬁl-tered out. RS and RC are shown in the form of (RS, RC) in Fig. 3and the pseudo code of CDR-Tree algorithm is

pre-sented inFig. 4.

(7)

3.2.2. Extracting the decision tree from CDR-Trees

When users require the old or new classiﬁcation model, or both, in addition to the concept-drifting rules, CDR-Tree algorithm can provide them eﬃciently and accurately via the following extraction steps:

Step 1. To extract the old (new) classiﬁcation model, the splitting attribute values of all internal nodes, and the target class in all leaf nodes of the new (old) instances are ignored.

Step 2. Check each node from the bottom-up and left-to-right.

Step 3. For any node noand its sibling node(s) ns.

(a) If node nois a leaf and singleton node (i.e. it

does not have any sibling node), its parent node will be removed from the CDR-Tee

and node nowill be pulled-up. This situation

is illustrated inFig. 5a.

(b) If node no is an internal and singleton node,

the parent node of no will be removed and

the sub-tree rooted at no will be pulled-up.

This situation is illustrated inFig. 5b.

(c) If nshas the same splitting value as that of no

and no, ns are all leaf nodes. CDR-Tree will

merge them into a single node nm. The target

class of nmis assigned by a majority vote. This

situation is illustrated inFig. 5c and d.

(d) If nshas the same splitting value as that of no

but not all of them are leaf nodes, CDR-Tree

will pick out the internal node nm, which

con-tains the most instances among all internal

nodes ns. Except for the sub-tree STmrooted

at nm, all sibling nodes and their sub-trees

are then removed from the CDR-Tree. The instances, which belong to these removed leaf nodes and sub-trees, are migrated into the

internal node nmand will follow the path of

STm until they reach a leaf node as Fig. 5e

illustrates. Note that a migrant instance may

stop in an internal node nI of STmif there is

no branch to proceed. In such a condition, the CDR-Tree will use the splitting attribute

in nI to generate a new branch and

accord-ingly a new leaf node as illustrated in Fig. 5f. The target class of the leaf nodes in

STm and the newly generated leaf node(s)

are then assigned by a majority vote.

Step 4. Repeat Step 2 until no more nodes can be merged. Step 5. If there is a leaf node that is not pure, continue

splitting it.

Due to the merging strategy, some leaf nodes in a CDR-Tree might be not pure. The goal of Step 5 is to solve this problem. However, this step can be omitted if users do not really need an overly detailed decision tree. Note that the CDR-Tree keeps the count information in each node dur-ing its builddur-ing step; therefore, the computational cost for this extraction procedure is small. Compare this to building a decision tree from the beginning; CDR-Tree can generate

(8)

the decision tree much more eﬃciently and quickly.Fig. 6 is the pseudo code of the CDR-Tree’s extraction

proce-dure. Taking the CDR-Tree inFig. 3 as an example, the

extract decision trees are shown in Fig. 7, where Fig. 7a

is the old classiﬁcation model for Table 1 without

imple-menting Step 5;Fig. 7b is still the model for Table 1 but

with the implementation of Step 5; andFig. 7c and d

cor-respond toTable 2. By comparing these results to those in

Figs. 1 and 2, we can ﬁnd that without the implementation

of Step 5, there are only 1 misclassiﬁed instance inFig. 7a

and 2 inFig. 7c. When Step 5 is executed, Fig. 7b and d

reach 100% accuracy as are Figs. 1 and 2. Furthermore,

Fig. 7d is identical toFig. 2, butFig. 7b is a little diﬀerent fromFig. 1. From the example we determine that although the decision tree extracted from the CDR-Tree is not proved to be identical to that built from the beginning, it can reach a comparable accuracy even without the imple-mentation of Step 5.

3.2.3. Reducing the complexity of concept-drifting rules Even though the main purpose of CDR-Trees is to clearly discover concept drift rules and to utilize rule sup-port (RS) and rule conﬁdence (RC) to ﬁlter un-meaningful ones, readers might wonder whether our approach has a

(9)

higher computational cost due to the fact that CDR-Trees are more complex than a traditional decision trees. Never-theless, it is not as bad as imagined. Here we discuss the corresponding computational cost under the conditions of concept stability and concept drift. First, suppose that there are no drifting concepts: the number of attributes, the number of instances, and the number of values for an attribute in the integrated dataset would be all the same as that in the old or new dataset. Therefore, the computa-tional cost is very similar. The extra cost of CDR-Trees is the integration of new and old instances from diﬀerent time points and the extraction procedure. However, such com-putational costs are trivial when compared to those of building a decision tree. When there are drifting instances, the number of attributes and the number of instances remain the same, but the number of values for an attribute might increase. The increase of attribute values burdens the building of a node. Suppose a dataset contains i attributes and each attribute has j kinds of values, the computational cost for building a node in a CDR-Tree would be ij times that of a traditional decision tree. The worst case occurs only when the drifting ratio (i.e. the proportion of drifting instances to all instances) is 100% and all values of an attri-bute in the old dataset change to diﬀerent values in the new dataset. In the worst case, each attribute in the integrated

dataset would contain j2 kinds of values. However, the

worst case should rarely happen since in most real datasets: (a) the drifting ratio should be not too large; (b) drifting instances should gather in some specific areas in the dimen-sional space of attributes, i.e., there must be some meaning-ful attribute-values to cause concept drift. For example, age and salary may influence credit card applications but weight and height will not; IP address and the number of sending packages may be the main basis for finding a PC which sends virus packages.

Of course, many good discretization algorithms (Kurgan

& Cios, 2004; Lee, Tsai, Yang, & Yang, 2007; Liu, Hussain, Tan, & Dash, 2002) can be used to preprocess the integrated dataset to speed-up the construction of the CDR-Tree. In this paper, we also propose two strategies: T-strategy and V-strategy. The main goal of T-strategy is to reduce the number of concept-drifting rules and it is adopted after a CDR-Tree is built; on the contrary, V-strategy is used to simplify the training dataset to speed-up the building of CDR-Tree.

3.2.3.1. T-strategy. If the taxonomy tree of an attribute is given, we can remove some branches from the CDR-Tree and merge some produced drifting rules. For example, suppose the taxonomy tree of attribute ‘‘location” is

shown inFig. 8a: the CDR-Tree in Fig. 8b can be

simpli-ﬁed asFig. 8c. Similarly, the two discovered concept drift

rules:

If (fever = ‘‘no ? yes”) and (work = ‘‘Chicago ?

Shanghai”) then (diagnosis = ‘‘healthy ? inﬂuenza”), If (fever = ‘‘no ? yes”) and (work = ‘‘New York ? Shanghai”) then (diagnosis = ‘‘healthy ? inﬂuenza”) can be merged into only one rule:

If (fever = ‘‘no ? yes”) and (work = ‘‘USA ? Shang-hai”) then (diagnosis = ‘‘healthy ? inﬂuenza”).

3.2.3.2. V-strategy. If two people of different weights dem-onstrate an abnormal increase of K kilograms synchro-nously, their health is at risk. That is to say, for some attributes, we can regard the variance K as the cause of concept drift even if the original values are different. We define such an attribute as variance attribute in the following:

Deﬁnition 1. For a continuous attribute M and two data

blocks Tp andTq, assume that the attribute values of

instance i and j are respectively Mp_i, Mp_j in Tp, and

Mp_i,Mp_j, respectively, varies into Mq_i and Mq_j in Tq, where

Mp_i–Mq_i and Mq_j–Mp_j. If Mp_i Mqi ¼ M

p j M

q

j ¼ vðv P 1Þ,

and the variance v of attribute M would make the concept

of both instance i, j drift form c to c0_{, attribute M is called a}

variance attribute.

A scheme governing the variable attribute ‘‘weight” is

illustrated inFig. 9a. It means an increase of 0 or 5 kg in

weight has the same inﬂuence on concept drift. Similarly, Fig. 7. The extracted decision trees fromFig. 3: (a) the model ofTable 1

without implementing Step 5; (b) the model of Table 1 with the implementation of Step 5; (c) the model ofTable 2without implementing Step 5; and (d) the model ofTable 2with the implementation of Step 5.

(10)

the increase of 12 or 25 kg would cause the same degree of concept drift. For the real datasets, there may be some var-iable attributes such as weight and blood pressure. With the given scheme, the CDR-Tree can reduce the number of attribute values after data integration and prevent the decision tree from being too immense and complex. Fig. 9b and c are the illustrations of a CDR-tree, where Fig. 9c is the simpliﬁed version ofFig. 9b created by apply-ing the V-strategy. Note that, the V-Strategy is diﬀerent from the proposed discretization algorithms.

4. Experimental analysis and performance evaluation In this section, we implement CDR-Tree algorithm in Microsoft Visual C++ 6.0 for experimental analysis and performance evaluation. The experimental environment

and datasets are clearly described in Section4.1. In Section

4.2, we demonstrate how the accuracy of CDR-Trees is

affected b y different drift levels. The effectiveness of the

V-strategy is evaluated in Section 4.3. We compare the

accuracy of C4.5 to that of the model extracted from the

CDR-Tree in Section 4.4. Finally, the comparison of

exe-cution time among CDR-Tree, the model extracted from

the CDR-Tree, and C5.0 is given in Section4.5.

4.1. Experimental environment and datasets

All experiments in this paper are done on a 3.0 GHz Pentium IV machine with 512 MB DDR memory, running Windows 2000 professional. Experimental datasets are

generated by IBM Data Generator (Agrawal, Ghosh,

Imi-elinski, & Swami, 1992). We use IBM Data Generator Fig. 8. Illustrations of T-strategy: (a) The taxonomy tree of attribute ‘‘location”; (b) a CDR-tree; and (c) the simpliﬁed CDR-tree of (b) by using T-strategy.

Fig. 9. Illustrations of the V-strategy: (a) the scheme of the variable attribute ‘‘weight”; (b) a CDR-Tree; and (c) the simpliﬁed CDR-Tree of (b) by using the V-strategy.

(11)

instead of the hyperplane synthetic data used in (Kolter et al., 2003) since IBM Data Generator is a public and widely used data generator. In addition, IBM Data Gener-ator has several well-defined classification functions and parameters which can be used to generate different charac-teristics of datasets. The dataset generated by IBM Data Generator contains one Boolean target class and nine basic attributes: salary, commission, loan, age, zip code, h-years, h-value, e-level, and car. Among the nine attributes, zip code, e-level, and car are categorical attributes, and others are the continuous attributes. We use IBM Data Generator because we want to generate several kinds of datasets to evaluate our CDR-Tree. In our experiment, four classifica-tion funcclassifica-tions, P3, P5, P43, and P45 are randomly selected to generate the experimental datasets.

In order to analyze the performance of our CDR-Tree under different drifting ratios R% (i.e. the proportion of drifting instances to all instances), we use the four func-tions mentioned above to generate required experimental datasets. For each function, the noise level is set to 5% and the dataset generated by IBM Data Generator is regarded as the original/first dataset in the data stream. Then we code a program to amend the first dataset and generate the second ones as a new dataset. Our program works as follows: First, it randomly picks up one instance

S in the original dataset and randomly selects attributes am

(0 < m < 6) for reference. Instances, which have the same

values in all attributes am to that of S, are picked out.

The class label and values belonging to amof these picked

out instances are then replaced by a random value in the corresponding value-domain. The main principle of our program is that concept drifts are caused by the variances of some attributes. We limit the number of referable attri-butes less than five since drifting concepts should be caused by some but not a lot attribute values and there are only nine basic attributes in IBM data generator. If the number of drifting instances is less than the requirement, the pro-gram goes on next loop to get more drifting instances. On the contrary, if there are more instances satisfy the requirement, R% instances are randomly picked up as drifting ones. As a result, each function will generate 5 s datasets with different drifting ratios. A total of 4 old data-sets and 20 new datadata-sets are generated in our experiments. Every dataset includes 10,000 instances and the 10-fold cross-validation test method is applied to all experiments. That is, each experimental dataset is divided into 10 parts of which nine parts are used as training sets and the remaining one as the testing set. In the following experi-ments, we will use D(i) to denote a dataset generated by the classification function Pi and D(i, R) to represent a dataset with R% drifting ratio resulting from D(i).

4.2. The analysis of CDR-Tree

In this section, we use the 24 datasets mentioned in

Sec-tion4.1to evaluate the accuracy of CDR-Trees and to

ana-lyze whether it can precisely explore the concept drift rules.

At first, focusing on five different drift levels, the accuracy of the CDR-Tree in 20 integrated datasets is shown in Fig. 10. As can be found in this figure, CDR-Tree main-tains high accuracy in all 20 datasets. However, it is worth noting that the higher the concept-drifting ratio is, the lower the accuracy of CDR-Tree will be. This is because a higher drifting ratio makes the CDR-Tree more complex. To further analyze whether the concept-drifting rules produced by CDR-Tree can accurately predict the drifting instances, for each experimental dataset, we only select the instances that really have a drifting concept from the test-ing data to calculate the accuracy. The experimental result

is shown as Fig. 11. As expected, the concept drift rules

mined by CDR-Tree can accurately predict those drifting instances.

4.3. The comparison between CDR-Trees and V-CDR-Trees Due to the lack of background knowledge, we are not able to produce a proper taxonomy tree for our experimen-tal dataset to evaluate our T-strategy. However, to analyze whether the V-Strategy can eﬀectively reduce the complex-ity of CDR-Tree, 4 old datasets and 24 new ones

men-tioned in Section 4.1 are again utilized. We use V-CDR

to represent a CDR-Tree with V-Strategy in this experi-ment. A simple variance scheme |v| = 1 for all continuous

60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.0 5 10 15 20 30

concept drift ratio (%)

accuracy (%)

D(3) D(43) D(5) D(45)

Fig. 10. The accuracy of CDR-Trees under ﬁve diﬀerent drifting ratios.

60 65 70 75 80 85 90 95 100 5 10 15 20 30

accuracy (%)

D(3) D(43) D(5) D(45)

(12)

attributes is used. The experimental results are shown from Figs. 12–15. As can be seen from the four figures, the num-ber of nodes in V-CDR-Trees is significantly smaller than that of CDR-Trees in all drift ratios. In conclusion, it is shown that V-Strategy can effectively lower the complexity of the CDR-Tree.

In order to compare the accuracy of the mined concept drift rules between CDR-Trees and V-CDR-Trees, among each experimental dataset, the testing instances that really have a drifting concept are used to calculate accuracy.

The experimental results are shown fromFigs. 16–19 and

we can see that the accuracy of the V-CDR-Tree is slightly lower than that of the CDR-Tree in these cases even though we use a very simple variance scheme.

D(43) 0 100 200 300 400 500 600 700 800 900 5 10 15 20 30

number of nodes

CDR V-CDR

Fig. 13. The comparison of number of nodes using dataset D(43).

D(5) 0 100 200 300 400 500 600 700 800 900 5 10 15 20 30

number of nodes

CDR V-CDR

D(3) 0 100 200 300 400 500 600 700 800 900 5 10 15 20 30

number of nodes

CDR V-CDR

D(43) 60 65 70 75 80 85 90 95 100 CDR V-CDR 5 10 15 20 30

accuracy (%)

Fig. 17. The comparison of accuracy of concept drift rules using dataset D(43). D(45) 0 100 200 300 400 500 600 700 800 5 10 15 20 30

number of nodes

CDR V-CDR

D(3) 60 65 70 75 80 85 90 95 100 CDR V-CDR 5 10 15 20 30

accuracy (%)

Fig. 16. The comparison of accuracy of concept drift rules using dataset D(3).

(13)

4.4. The comparison of accuracy between E-CDR-Trees and C5.0

In this experiment, we evaluate whether our approach

mentioned in Section3.2.2can accurately extract

classiﬁca-tion models from CDR-Trees. First, all 24 datasets are used by C5.0 to build the decision tree. For 20 CDR-Trees, the old and new classiﬁcation models are extracted as E-CDR-Trees. Since the results of the 24 datasets are very similar due to the limitation of content, we only show the accuracy of datasets with 10% drifting ratio. The results

are shown in Fig. 20. From Fig. 20 we can see that the

accuracy of E-CDR-Trees is similar to that of C5.0. This demonstrates the accuracy of our extracting strategy as

described in Section3.2.2.

4.5. The comparison of execution time among CDR-Trees, E-CDR-Trees, and C5.0

The motivation behind CDR-Trees and C5.0 is inher-ently diﬀerent: CDR-Tree algorithm mainly aims at pro-viding concept-drifting rules and quickly extracts the prediction model if it is required by users; but C5.0 is

pri-marily designed to build a decision model to predict the unseen data. Thus comparing the execution time between them might be unfair. However, in order to give readers a clear overview of our approach, we show the comparison of execution time among a CDR-Tree, an E-CDR-Tree,

and C5.0 inFig. 21. The execution time for C5.0 on dataset

D(i, R) denotes the total building time of two models on datasets D(i) and D(i, R); that for E-CDR denotes the total time to extract the old and new decision trees. Similarly, due to the limitation of content and the fact that the results of all datasets are very similar, we only show the execution time of datasets generated by function P43. As expected, the CDR-Tree needs more execution time than C5.0 since the training dataset is more complicated than that used by C5.0. However, the time required for the CDR-Tree to extract the decision tree is much less than that required for C5.0. This demonstrates that with the given CDR-Tree,

our extraction strategy proposed in Section3.2.2can

eﬃ-ciently elucidate the classiﬁcation model than building it from scratch.

5. Conclusions and future research direction

Recently, concept drift has become a popular research issue in the ﬁeld of data mining. Even though many schol-D(45) 60 65 70 75 80 85 90 95 100 CDR V-CDR 5 10 15 20 30

accuracy (%)

Fig. 19. The comparison of accuracy of concept drift rules using dataset D(45). D(5) 60 65 70 75 80 85 90 95 100 CDR V-CDR 5 10 15 20 30

accuracy (%)

Fig. 18. The comparison of accuracy of concept drift rules using dataset D(5). 0 10 20 30 40 50 60 70 80 90 100 D(3,10) D(43,10) D(5,10) D(45,10) dataset accur acy ( % ) E-CDR C5.0

Fig. 20. The comparison of accuracy between E-CDR-Tree and C5.0 using four datasets with 10% drifting ratio.

0 2 4 6 8 10 12 14 16 18 D(43,5) D(43,10) D(43,15) D(43,20) D(43,30) dataset

execution time (sec.)

CDR E-CDR C5.0

Fig. 21. The comparison of execution time among CDR-Tree, E-CDR-Tree, and C5.0 by using datasets D(43).

(14)

ars have proposed different methods, they all focus only on updating the classification model and are unable to eluci-date the main causes why concept drifts. However, the decision makers might be more interested in the rules of concept drift. In this paper, we address this issue and pro-pose the Concept Drift Rule Mining Tree algorithm to solve this problem. Our CDR-Tree cannot only produce drifting rules, but also efficiently extract the classification model of each data block for decision makers to have wide application. T-strategy and V-strategy, which need the sup-port of users in the corresponding domain, are also pro-posed to simplify the CDR-Tree and the mined rules.

The experimental results in Section 4 show the accuracy

of the CDR-Tree and the eﬃciency of our extracting strategies.

There are still many issues worth further investigation. First, in this paper we only analyze the cases in which there are only two data blocks in a data stream. If analysis of greater than two is required, the CDR-Tree will become much larger and more complicated. Therefore, one of our future focuses is to extend the use of the CDR-Tree which can more efficiently process multi-block concept drift prob-lems. Secondly, although we propose two strategies to reduce the complexity of our CDR-Tree algorithm, for a given dataset, there might be no given taxonomic tree or variance attributes. Therefore, another interesting problem is to propose a method to generate a taxonomic tree or to find variance attributes automatically. Thirdly, an existing discretization algorithm can also be used to reduce the complexity of CDR-Trees. However, since the integrated dataset used in CDR-Trees is different than a traditional dataset, another future focus is to propose a discretization algorithm which is more suitable to CDR-Trees. Finally, although our extraction methods can efficiently extract the classification model for each data block from CDR-Trees and the extracted model can reach accuracy compa-rable to the decision tree built from the beginning, it would be interesting to find a way in which one can extract an identical tree.

References

Agrawal, R., Ghosh, A., Imielinski, T., Iyer B., & Swami, A. (1992). An interval classiﬁer for database mining applications. In Proceedings of the 18th conference on very large databases.

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3(4), 261–283.

Cunningham, P., & Nowlan, N. (2003). A case-based approach to spam ﬁltering that can track concept drift. In Proceedings of the ICCBR workshop on long-lived CBR systems.

Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth international conference on knowledge discovery and data mining (pp. 71–80). Boston.

Fan, H., & Ramamohanarao, K. (2006). Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classiﬁers. IEEE Transactions on Knowledge and Data Engineering, 18(6), 721–737.

Fan, W. (2004). Systematic data selection to mine concept-drifting data streams. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 128–137).

Freitas, A. A. (2000). Understanding the crucial diﬀerences between classiﬁcation and discovery of association rules. SIGKDD Explora-tions, 2(1), 65–69.

Furnkranz, J., & Widmer, G. (1994). Incremental reduced error pruning. In Proceedings of the 11th international conference on machine learning (pp. 70–77). San Francisco.

Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann Publisher.

Hulten, G., Spencer, L., & Ddmingos, P. (2001). Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD interna-tional conference on knowledge discovery and data mining (pp. 97–106). San Francisco.

Jin, R., & Agrawa, G. (2003). Eﬃcient decision tree construction on streaming data. In Proceedings of the nineth ACM SIGKDD interna-tional conference on knowledge discovery and data mining (pp. 571–576). Washington.

Klinkenberg, R. (2001). Using labeled and unlabeled data to learn drifting concepts. Workshop notes of the IJCAI-01 workshop on learning from temporal and spatial data (pp. 16–24). CA.

Klinkenberg, R., & Renz, I. (1998). Adaptive information ﬁltering: Learning in the presence of concept drifts. Workshop notes of the ICML-98 workshop on learning for text categorization (pp. 33–40). CA.

Kolter, J. Z., & Maloof, M. A. (2003). Dynamic weighted majority: A new ensemble method for tracking concept drift. In Proceedings of the third international IEEE conference on data mining (pp. 123–130). Mel-bourne, FL.

Koychev, I. (2000). Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI 2000 workshop current issues on spatio-temporal reasoning. Germany.

Kurgan, L., & Cios, K. J. (2004). CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2), 145–153. Lane, T., & Brodley, C. E. (1998). Approaches to online learning and

concept drift for user identiﬁcation in computer security. In Proceed-ings of the fourth international conference on knowledge discovery and data mining (pp. 259–263). New York.

Lazarescu, M., & Venkatesh, S. (2004). Using multiple windows to track concept drift. Intelligent Data Analysis Journal, 8(1), 29–59.

Lee, C. I., Tsai, C. J., Wu, T. Q., & Yang, W. P. (2008). A multi-relational classiﬁer for imbalanced database. Expert Systems with Applications, 36(3), 2008.

Lee, C. I., Tsai, C. J., Yang, Y. R., & Yang, W. P. (2007). A top-down and greedy method for discretization of continuous attributes. In Proceed-ings of the fourth international conference on fuzzy systems and knowledge discovery. Haikou, China.

Lee, C. I., Tsai, C. J., Wu, J. H., & Yang, W. P. (2007). A decision tree-based approach to mining the rules of concept drift. In Proceedings of the fourth international conference on fuzzy systems and knowledge discovery. Haikou, China.

Lee, C. I., Tsai, C. J., & Ku, C. W. (2006). An evolutionary and attribute-oriented ensemble classiﬁer. In Proceedings of the international conference on computational science and its applications (pp. 1210– 1218).

Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Journal of Data Mining and Knowledge Discovery, 6(4), 393–423.

Maloof, M. (2003). Incremental rule learning with partial instance memory for changing concepts. In Proceedings of the international joint conference on neural networks. CA.

Maloof, M.A., and Michalski, R.S. (2002). Incremental learning with partial instance memory. In Proceedings of the 13th international symposium on methodologies for intelligent systems. Lyon, France. Menzies, T. (2003). Data mining for very busy people. In Proceedings of

the international IEEE conference on data mining (pp. 22–29). Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1),

81–106.

Quinlan, J. R. (1993). C4.5: Program for machine learning. San Mateo, CA: Morgen Kaufmann Publisher.

(15)

Rastogi, R., & Shim, K. (1998). PUBLIC: a decision tree classiﬁer that integrates building and pruning. In Proceedings of the 24th interna-tional conference on very large databases (pp. 404–415).

Street, W., & Kim, Y. (2001). A streaming ensemble algorithm for large-scale classiﬁcation. In Proceedings of the seventh international confer-ence on knowledge discovery and data mining (pp. 377–382). NY. Tsai, C. J., Lee, C. I., Chen, C. T., & Yang, W. P. (2007). A multivariate

decision tree algorithm to mine imbalanced data. WSEAS Transactions on Information Science and Applications, 4(1), 50–58.

Utgoﬀ, P. E. (1989). Incremental induction of decision trees. Machine Learning, 4(2), 161–186.

Utgoﬀ, P., Berkman, N., & Clouse, J. (1997). Decision tree induction based on eﬃcient tree restructuring. Machine Learning, 29(1), 5–44. Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting

data streams using ensemble classiﬁers. In Proceedings of the nineth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 226–235). Washington, DC.

Wang, L., Zhao, H., Dong, G., & Li, J. (2006). On the complexity of ﬁnding emerging patterns. Theoretical Computer Science, 335(1), 15–27.

Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101.