• 沒有找到結果。

3 Sensitive Concept Drift Probing Decision Tree Algorithm

3.3 Experiment and Analysis

3.3.3 The Comparison of Execution Time

In the aspect of comparison of execution time in Figures 3.7 to 3.9, SCRIPT, DNW, and CVFDT have a similar execution time for the building of the first decision tree in data block B0. However, SCRIPT and CVFDT requires a little more execution time in the initial step since SCRIPT needs to calculate the CDAVs and CVFDT must record the counts in each node.

After the beginning step, SCRIPT is much more efficient than DNW and CVFDT when concept is stable in time step 1 and 2. When concept drifts, the execution time of SCRIPT is worse than that of CVFDT in most cases. However, this is caused by the fact that SCRIPT recognizes the drifting concepts and therefore needs more execution time to correct the original decision tree in these time steps. It is worth to note that when both SCRIPT and CVFDT detect the drifting instances and correct the decision tree, e.g. in time step 7, 8 and 12 in Fig. 8, SCRIPT is more efficient than CVFDT. The reason is SCRIPT can immediately know which sub-trees should be amended by checking the drifting CDAVs but CVFDT have to check the variation of information gain node by node from the root. Similar condition can be found in time step 6 and 7 in thy and spambase dataset. DNW, which builds a new classifier in each time step, always has the worst computational cost. The computational is much worse as time goes by since DNW does not recognize the concept drift and therefore mixes the new data block into the old one.

0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 3.7 The comparison of execution time on dataset ‘satimage’.

0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 3.8 The comparison of execution time on dataset ‘thy’.

0 1 2 3 4 5 6 7 8 9 10 11 12 0

1000 2000 3000 4000 5000 6000

time step (t)

execution time (ms)

SCRIPT DNW CVFDT

Figure 3.9 The comparison of execution time on dataset ‘spambase’.

Chapter 4

Concept Drift Rule Mining Tree Algorithm

In previous chapter, we propose SCRIPT as a solution to sensitively and efficiently handle the concept-drifting problem on data stream. However, as most proposed approaches about concept drift, SCRIPT focuses on updating the classification model to accurately predict new coming data. As for the users, they might be more interested in the rules of concept drift. For example, doctors desire to know the main causes of disease variation, scholars long for the rules of weather transition, and sellers would like to find out the reasons why the consumers’ shopping habits change. In this chapter, we concentrate our focus on this problem. We first use an example to illustrate the problem of concept-drifting rules in Section 4.1. Then, the details of concept drift rule mining tree algorithm (CDR-Tree) are elucidated in Section 4.2. Experimental analyses and performance evaluations are shown in Section 4.3.

4.1 The Rules of Concept Drift

Here, we use a simple example to formally introduce the concept-drifting rules.

Example 4.1: Take the patients’ diagnostic data in Table 4.1 as the example and assume that

Table 4.2 is the new diagnostic dataset. In Table 4.2, the drifting values are marked with both underline and boldface. An instance with the same ID in both tables means the diagnostic data belongs to the same patient. Figures 4.1 and 4.2 are the decision trees constructed by using Tables 4.1 and 4.2, respectively. Comparing Figure 4.1 with Figure 4.2, we find that patients ID9 and ID10 are located in leaf node A in the old decision tree and in node B in the new one.

The corresponding decision rules are:

If (fever = ‘no’) and (cough = ‘no’) then (diagnosis = ‘healthy’) and

If (fever = ‘yes’) and (workplace = ‘S’) and (cough = ‘yes’) then (diagnosis = ‘SARS’).

Comparing the rules of those two patients, we can see that someone might be infected with SARS if he displays fever, his working location is transferred to city S, and had a bad cough.

Simply stated, the variations of attributes ‘fever’, ‘workplace’ and ‘cough’, are the primary factors influencing concept drift. The concept drift rules detected from the two patients can be written in the form:

If (fever = ‘no→yes’) and (workplace = ‘N→S’) and (cough = ‘no→yes’) then (diagnosis = ‘healthy → SARS’).

In this example, owing to the few instances and very simple rules, users can clearly and quickly find the concept-drifting rules between the two datasets. However in a real application, it is a very difficult task for users to figure out such rules since the number of produced rules is usually very large.

Table 4.1 The patients’ diagnostic data

ID sex workplace fever cough diagnosis

1 male N yes no influenza

2 female C yes no influenza

3 female N yes yes pneumonia

4 male N yes yes pneumonia

5 male C yes yes pneumonia

6 male N no yes influenza

7 female N no no healthy

8 male N no no healthy

9 female C no no healthy

10 female C no no healthy

Table 4.2 The new coming diagnostic data from the same patients

ID sex workplace fever cough diagnosis

1 male C no no healthy

2 female C no no healthy

3 female S yes no influenza

4 male N no no healthy

5 male N yes no pneumonia

6 male N no yes influenza

7 female N yes no pneumonia

8 male N yes yes pneumonia

9 female S yes yes SARS 10 female S yes yes SARS

Figure 4.1 The decision tree built using Table 4.1.

Figure 4.2 The decision tree built using Table 4.2.

4.2 Concept Drift Rule Mining Tree Algorithm

In order to mine the concept-drifting rules mentioned in Section 4.1, here we propose our CDR-Tree algorithm. Section 4.2.1 is the building step of CDR-Tree. Without loss of generality, here we consider only the case that there are two data blocks: Tp and Tq in a data stream. Note that, users might also require the classification model of each data block after they check these rules of concept drift. CDR-Tree can do that via a quick and simple extraction step as is presented in Section 4.2.2.

4.2.1 Building a CDR-Tree

To mine concept-drifting rules, CDR-Tree algorithm initially integrates new and old instances from different times into pairs; then, following the manner of traditional decision trees a CDR-Tree is built. During the building step, information gain is used as the criterion to select the best splitting attribute in each node. In other words, CDR-Tree regards the pair made by integration of new and old data as a single attribute value and mines the rules of concept drift through the construction of a traditional decision tree. In addition, since a traditional decision trees stop building while a node is pure, the generated concept drifting rules would miss some important information.

Example 4.2: Taking Tables 4.1 and 4.2 as our example again, the integrated data of the two tables are shown in Table 4.3, and Figure 4.3 is the corresponding CDR-Tree. As described in Example 4.1, for the patients ID9 and ID10, there is a drifting rule:

If (fever = ‘no   yes’) and (workplace = ‘C   S’) and (cough = ‘no   yes’) then

(diagnosis = ‘healthy   SARS’),

However, a traditional decision tree will stop splitting at the node C in Figure 4.3 and then produce a rule:

If (fever = ‘no   yes’) and (workplace = ‘C   S’) then (diagnosis = ‘healthy   SARS’).

It is clear that the former rule is more reliable and accurate than the latter one.

Table 4.3 The integrated data of Table 4.1 and Table 4.2

ID workplace fever cough diagnosis

1 N   C yes   no no   no influenza   healthy

2 C   C yes   no no   no influenza   healthy

3 N   S yes   yes yes   no pneumonia   influenza

4 N    yes   no yes   no pneumonia   healthy

5 C   N yes   yes yes   no pneumonia   pneumonia

6 N   N no   no yes   yes influenza   influenza

7 N   N no   yes no   no healthy   pneumonia

8 N   N no   yes no   yes healthy   pneumonia

9 C   S no   yes no   yes healthy   SARS

10 C   S no   yes no   yes healthy   SARS

Figure 4.3 The CDR-Tree built using Table 4.3.

To solve this problem, CDR-Tree algorithm goes on splitting a pure node no in which all instances have some common attribute value but this attribute is never selected as a splitting attribute in this path from the node no to the root. The concept drifting rules are marked with dotted lines in the CDR-Tree in Figure 4.3. There are five concept drift rules as follows:

Rule a: If (fever = ‘no   yes’) and (workplace = ‘C   S’) and (cough = ‘no   Yes’) then (diagnosis = ‘healthy   SARS’);

Rule b: If (fever = ‘no   yes’) and (workplace = ‘N   N’) then (diagnosis = ‘healthy  pneumonia’);

Rule c: If (fever = ‘yes   no’) and (cough = ‘yes   no’) then (diagnosis = ‘pneumonia  healthy’);

Rule d: If (fever = ‘yes   no’) and (cough = ‘no   no’) then (diagnosis = ‘influenza

 healthy’);

Rule e: If (fever = ‘yes   yes’) and (workplace = ‘N   S’) then (diagnosis =

‘pneumonia   influenza’).

In the above rules, the value on the left and right side of ‘ ’ respectively represents the value in two different data blocks of a data stream. If observing carefully, we can find that the concept-drifting rules of the patients ID9 and ID10 mentioned in Example 4.1 are definitely mined by this CDR-Tree.

In order to provide users with meaningful and interesting rules of concept drift, CDR-Tree defines a rule support RS and a rule confidence RC to filter un-meaningful ones out.

For a leaf node no in the CDR-Tree, suppose this node is assigned class label c and contains No instance, then:

RS = No and

RC = (100Nc / No) %,

where Nc is the number of instances with class c in this node no. The default values of RS and RC are 2 and 50% respectively. However, users can assign a larger threshold if they only want to check the notable rules. For example, by setting RS = 2 and RC = 90%, Rules c and e will be filtered out. RS and RC are shown in the form of (RS, RC) in Figure 4.3. Below is the pseudo code of CDR-Tree

CDR-Tree Algorithm

df: default value for goal predicate;

No: the total number of instances in the leaf node o

Nc: the number of instances in the leaf node o with class label c;

RS: the rule support RS, which is set to 2 as the default

RC: the rule confidence RC, which is set to 50% as the default

CDRTree(Tp,Tq, df, RS, RC) never selected as a splitting attribute in the path form node no to the root Do

Go on splitting node no by using attribute Ai; Assign the class label by majority vote;

RSo = No;

RCo = (100Nc / No) %

If RSo ≥ RS and RCo ≥ RC then

Mark the path form this leaf node no to root as a concept-drifting rule;

Return CDR-Tree;

4.2.2 Extracting Decision Trees from a CDR-Tree

When users require the old or new decision tree, or both, in addition to the concept-drifting rules, CDR-Tree algorithm can provide them efficiently and accurately via the following extraction steps:

Step 1. To extract the old (new) classification model, the splitting attribute values of all internal nodes and the class labels in all leaf nodes of the new (old) instances are ignored.

Step 2. Check each node from the bottom-up and left-to-right.

Step 3. For any node no and its sibling node(s) ns,

(a.) If node no is a leaf and singleton node (i.e. itdoes not have any sibling node), its parent node will be removed from the CDR-Tee and node no will be pulled-up.

This situation is illustrated in Figure 4.4(a).

(b.) If node no is an internal and singleton node, the parent node of no will be removed and the sub-tree rooted at no will be pulled-up. This situation is illustrated in Figure 4.4(b).

(c.) If ns has the same splitting value as that of no and no, ns are all leaf nodes.

CDR-Tree will merge them into a single node nm. The class label of nm is assigned by a majority vote. This situation is illustrated in Figures 4.4(c) and (d).

(d.) If ns has the same splitting value as that of no but not all of them are leaf nodes, CDR-Tree will pick out the internal node nm, which contains the most instances among all internal nodes ns. Except for the sub-tree STm rooted at nm, all sibling nodes and their sub-trees are then removed from the CDR-Tree. The instances, which belong to these removed leaf nodes and sub-trees, are migrated into the internal node nm and will follow the path of STm until they reach a leaf node as Figure 4.4(e) illustrates. Note that a migrant instance may stop in an internal node

nI of STm if there is no branch to proceed. In such a condition, the CDR-Tree will use the splitting attribute in nI to generate a new branch and accordingly a new leaf node as illustrated in Figure 4.4(f). The target class of the leaf nodes in STm and the newly generated leaf node(s) are then assigned by a majority vote.

Step 4. Repeat Step 2 until no more nodes can be merged.

Step 5. If there is a leaf node that is not pure, continue splitting it.

Figure 4.4 Illustrations of the extraction strategy in CDR-Tree algorithm.

Due to the merging strategy, some leaf nodes in a CDR-Tree might be not pure. The goal

of Step 5 is to solve this problem. However, this step can be omitted if users do not really need an overly detailed decision tree. Note that the CDR-Tree keeps the count information in each node during its building step; therefore, the computational cost for this extraction procedure is small. Compare this to building a decision tree from the beginning; CDR-Tree can generate the decision tree much more efficiently and quickly. Below is the pseudo code of the CDR-Tree’s extraction procedure.

The extraction procedure of CDR-Tree CDRTreeExtract (CDR-Tree)

If the decision tree of old instances is requested then

The splitting attribute values in all internal nodes and the class labels in all leaf nodes of the new instances are ignored;

Else

The splitting attribute values in all internal nodes and the class labels in all leaf nodes of the old instances are ignored; Pick out the internal node nm with the most instances among all internal nodes;

Remove all the sibling nodes and their sub-trees except for the sub-tree STm rooted at nm; Migrate the instances belong to these removed leaf nodes and sub-tree into the internal node nm;

For each migrant instance in STm

If it can reach a leaf node Migrate it into the leaf node;

Else

Migrate it into the internal node where no branch can be proceeded;

End if

For each node in the path of STm

If it is a leaf node and contains migrant instances then Assign a target class to it by the majority vote;

If it is a internal node and contains migrant instances then Create new branch and corresponding leaf nodes;

Assign a class label to the new leaf nodes by the majority vote;

For node leaf node in the extracted CDR-Tree If it is not pure then

Go on splitting it;

End if

Return extracted decision tree

Example 4.3: Taking the CDR-Tree in Figure 4.3 as an example, the extract decision trees are shown in Figure 4.5, where Figure 4.5(a) is the old classification model for Table 4.1 without implementing Step 5; Figure 4.5(b) is still the model for Table 4.1 but with the implementation of Step 5; and Figures 4.5(c) and (d) correspond to Table 4.2. By comparing these results to those in Figures 4.1 and 4.2, we can find that without the implementation of Step 5, there are only 1 misclassified instance in Figure 4.5(a) and 2 ones in Figure 4.5(c).

When Step 5 is executed, Figure 4.5(b) and Figure 4.5(d) reach 100% accuracy as are Figure 4.1 and 4.2. Furthermore, Figure 4.5(d) is identical to Figure 4.2, but Figure 4.5(b) is a little different from Figure 4.1. From this example we determine that although the decision tree extracted from the CDR-Tree is not proved to be identical to that built from the beginning, it can reach a comparable accuracy even without the implementation of Step 5.

Figure 4.5 The extracted decision trees from Fig. 4.3: (a) the model of Table 4.1 without implementing Step 5; (b) the model of Table 4.1 with the implementation of Step 5; (c) the model of Table 4.2 without implementing Step 5; (d) the model of Table 4.2 with the implementation of Step 5.

4.3 Experiment and Analysis

We implement CDR-Tree algorithm in Microsoft Visual C++ 6.0 for experimental analysis and performance evaluation. The experimental environment and datasets are clearly described in Section 4.1. In Section 4.2, we demonstrate how the accuracy of CDR-Trees is affected by different drifting levels. We compare the accuracy of C4.5 to that of the model extracted from the CDR-Tree in Section 4.3. Finally, the comparison of execution time among CDR-Tree, the model extracted from the CDR-Tree, and C5.0 is given in Section 4.4.

All experiments here are done on a 3.0GHz Pentium IV machine with 512 MB DDR memory, running Windows 2000 professional. Due to the lack of a benchmark containing concept-drifting datasets, our experimental datasets are generated by IBM data generator. We use IBM data generator because we want to generate several kinds of datasets to evaluate our CDR-Tree. In our experiment, four classification functions, Functions F3, F5, F43 and F45, are randomly selected to generate the experimental datasets.

In order to analyze the performance of our CDR-Tree under different drifting ratios R%

(i.e. the proportion of drifting instances to all instances), we use the four functions mentioned above to generate required experimental datasets. For each function, the noise level is set to 5% and the dataset generated by IBM data generator is regarded as the original/first dataset in the data stream. Then we code a program to amend the first dataset and generate the second ones as a new dataset. Our program works as follows: First, it randomly picks up one instance S in the original dataset and randomly selects attributes am (1 ≤ m ≤ 5) for reference. Instances, which have the same and values in all attributes am to that of S, are picked out. The class label and values belonging to am of these picked out instances are then replaced by a random value in the corresponding value-domain. The main principle of our program is that concept drifts are caused by the variances of some attributes. We limit the number of referable attributes less

than five since drifting concepts should be caused by some but not a lot attribute values and there are only nine basic attributes in IBM data generator. If the number of drifting instances is less than the requirement, the program goes on next loop to get more drifting instances. On the contrary, if there are more instances satisfy the requirement, R % instances are randomly picked up as drifting ones. As a result, each function generates five second datasets with different drifting ratios. A total 4 old datasets and 20 new datasets are generated in our experiments. Every dataset includes 10000 instances and the 10-fold cross-validation test method is applied to all experiments. That is, each experimental dataset is divided into 10 parts of which nine parts are used as training sets and the remaining one as the testing set. In the following experiments, we will use D(i) to denote a dataset generated by Function Fi and D(i,R) to represent a dataset with R% drifting ratio resulting from D(i).

4.3.1 The Analysis of CDR-Tree

In this section, we use the 24 datasets mentioned in Section 4.3.1 to evaluate the accuracy of CDR-Tree and to analyze whether it can precisely explore the concept-drifting rules. At first, focusing on five different drift levels, the accuracy of the CDR-Tree in 20 integrated datasets is shown in Figure 4.6. As can be found in this figure, CDR-Tree maintains high accuracy in all 20 datasets. However, it is worth noting that the higher the concept-drifting ratio is, the lower the accuracy of CDR-Tree will be. This is because a higher drifting ratio makes the CDR-Tree more complex. To further analyze whether the concept-drifting rules produced by CDR-Tree can accurately predict the drifting instances, for each experimental dataset, we only select the instances that really have a drifting concept from the testing data to calculate the accuracy. The experimental result is shown as Figure 4.7.

As expected, the concept drift rules mined by CDR-Tree can accurately predict those drifting

instances.

Figure 4.6 The accuracy of CDR-Trees under five different drifting ratios.

0 5 10 15 20 25 30

Figure 4.7 The accuracy of concept-drifting rules produced by CDR-Trees.

4.3.2 The Comparison of Accuracy between E-CDR-Tree and C5.0

In this experiment, we evaluate whether our approach mentioned in Section 4.2.2 can accurately extract classification models from CDR-Trees. First, all 24 datasets are used by

In this experiment, we evaluate whether our approach mentioned in Section 4.2.2 can accurately extract classification models from CDR-Trees. First, all 24 datasets are used by