2 Background and Related Work
2.4 UCI Database and IBM Data Generator
In this dissertation, we use both real and synthetic datasets to carry out a series of experimental evaluations. The real experimental datasets are selected from UCI database [59]
which is a repository of several kinds of datasets. UCI database is widely used by the machine learning community for the empirical analysis of machine learning algorithms.
For artificial experimental datasets, we use IBM data generator [1][2], which was designed by IBM Almaden Research Center and is an open source written by C++
programming language. IBM data generator is a popular tool for researchers to generate artificial data to evaluate the performance of proposed algorithms. One advantage of IBM data generator is that it contains a lot of built-in functions to generate several kinds of datasets, and therefore enable researchers to carry out a series of experimental comparisons. There are nine basic attributes (salary, commission, loan, age, zipcode, h-years, h-value, e-level, and car) and a target attribute in IBM data generator. Among the nine attributes, zipcode, e-level, and car are categorical attributes; and all the others are continuous ones. The number of class labels can be decided by users and is set to 2 as default. The summary of these nine basic attributes are illustrated in Table 2.2. In this dissertation, we modify IBM data generator to generate datasets containing concept-drifting records.
Table 2.2 The summary of nine basic attributes in IBM data generator
Attribute Type Value domain
salary continuous 20,000 to 150,000
commission continuous if Salary ≥ 75000, Commission = 0 else uniformly distributed from 10000 to 75000
loan continuous 0 to 500000
h-year continuous 1 to 30
h-value continuous 0.5k*100000 to 1.5k*100000, where k∈{1 ... 9}depends on zipcode
age continuous 20 to 80
car categorical 1 to 20
e-level categorical 0 to 4
zipcode categorical 1 to 9
Chapter 3
Sensitive Concept Drift Probing Decision Tree Algorithm
In this chapter, we first give some formal discussions of the concept-drifting problem in Section 3.1. In Section 3.2, we introduce our sensitive concept drift probing decision tree algorithm (SCRIPT). The empirical analyses of SCRIPT are presented in Section 3.3.
3.1 One-way Drift and Two-way Drift
To make readers easily understand the problem we will address later, in this dissertation we divide the concept drift into concept stable, concept drift and concept shift. We refer to the examples in [73] and modify the figures to illustrate the problem in Figure 3.1. Figure 3.1 represents a two-dimensional data stream and is divided into six successive data blocks according to the arriving time of data. Instances arriving between ti and ti+1 form block Bi, and the separating line in each block stands for the optimum classification boundary in this block.
During time t0 to t1, data blocks B0 and B1 have similar data distribution. That is, data stream during this period is stable. Thereafter in B2, some instances shows concept drift and the optimum boundary changes. This is defined as concept drift. Finally, data blocks B4 and B5
have opposite sample distribution and this is defined as concept shift. Obviously, since the sample distributions of the first two blocks B0 and B1 are quite close, we can use decision tree DT0 built by B0 as the classifier for B1 to save the computational and recording cost.
Meanwhile, B2 shows slight differences when compared with the sample distributionof B1 and an efficient approach should make correction according to the original decision tree in stead of rebuilding it.
Figure 3.1 A data stream with the occurrence of concept drift.
In Section 2.2, we have showed that past proposed solutions are not sensible enough to the drifting concepts. That is, the proposed solutions can detect the changes until the number of drifting instances reaches a threshold to cause obvious difference in accuracy or information gain or gini index. Here we describe another concept drift problem which would enforce some proposed solutions such as CVFDT and DNW make a wrong prediction. In order to introduce this problem, we subdivide concept drift into one-way drift and two-way drift. Take Figure 3.1 as the example again, we can find that some negative data in B2 drift to be positive data in B3, known as one-way drift. However, the positive data in B4 drift to be negative in B5, and vice versa, known as two-way drift. We can regard two-way drift as a kind of “local” concept shift if it occurs in the internal or leaf node of a decision tree. If the variation of information gain or gini index is used as the criterion to judge the occurrence of
concept drift, e.g. the difference of information gain adopted in CVFDT, we can detect only one-way drift since the information gain obtained from B4 would the same as B5. It is worth to note that for the real data, two-way drift might happen. For example, a hacker in turn uses two computers with IP address x and y to send attack packages. When an internal node, which is learned from the first data block, splits the packages form x as safe and that from y as attack, there might be a contrary result learned from another data block. A similar condition might be found in trash mail protection, image comparison and so on.
3.2 Sensitive Concept Drift Probing Decision Tree Algorithm
3.2.1 Class Distribution on Attribute Value
Since the proposed solutions to mine concept-drifting data stream check the occurrence of concept drift on the level of instance or attribute, they generally are not sensitive enough.
Besides, they are also unable to detect the two-way drift illustrated in Figure 3.1. To solve these problems, SCRIPT probe the changes at a more detailed level, which is called Class Distribution on Attribute Values (CDAV) and defined as follows.
Definition 3.1: Assuming that a data block contains m target classes ck (k = 1,... ,m), n attributes Ai (i = 1,... ,n), and each attribute ai havingv attribute values aij (j = 1,... ,v), then the distribution of target class ck on the attribute value aij is defined as a CDAVij (Class Distribution on Attribute Value).
With Definition 3.1, we can use the chi-square (X2) test to check if there are concept drifts between two data blocks. X2 test is a statistical measure used to test the hypothesis that two discrete attributes are statistically independent. Applied to the concept-drifting problem, it tests the hypothesis that the class distribution on an attribute value of two data blocks is identical. The formula to computing the X2 value is
ijk
, where fijk represents the number of instances having attribute value aij and class ck in D and f’ijk is that in D’. With Formula 3.1, we can then define the variance of a CDAVij in the two data blocks as follows.
Definition 3.2: For a given significant level α, the varianceCDAVD→D'(i,j)of the a CDAVij distribution on all attribute value aij in the two data blocks show no significant difference, and neither do the accuracy of decision tree built according to D and D’, respectively.
Proof:
SinceCDAVD→D'(i, j)< ε,
we can obtain fijk ≅ f’ijk for target classes ck (k = 1,... ,m), attributes Ai (i = 1,... ,n) and attribute value aij.
For attribute Ai, the Entropy before the splitting is I(Ai) =
∑
Table 3.1, and N denotes the total number of instances in the data block.Since fijk ≅ f’ijk and N = N’ we can obtain
That is, the Entropy of attribute ai before splitting in data blocks D and D’ is similar.
Suppose we splitting all instances N into v subset by attribute Ai, the Entropy of attribute Ai
after splitting is
From Formulas (3.3) and (3.7), we can get that
)
That is, the Information gain of attribute Ai in data blocks D and D’ is similar.
As a result, the two decision trees which are respectively built by using blocks D and D’ will be similar. ■
Table 3.1 The class label distribution on an attribute Ai
Class label \ value ai1 ai2 … aiv Summation
By Proposition 3.1 and Formula 3.2, we can detect any kind of concept drift between two data blocks and then build an accurate decision tree. The significance level can be set to be
smaller or larger according to the needs of applications. With a given significance level, we can obtain the ε by checking the X2 table in a statistical book. The degree of freedom will be 1 less than the number of classes. Suppose that we set the level of significance α = 5% and there are three classes, if allCDAVD→D'(i,j)are less than ε = 5.991, that means the class distribution on all attributes shows no significant difference between D and D’ with 95% confidence. As a result, the information gain obtained from any attribute will show no significant difference and the decision tree need not to be rebuilt. Note that the purpose of Proposition 3.1 is to claim that a rebuild tree will have very similar accuracy to that of original one, rather than to guarantee the rebuild tree will be a copy of the original one.
Example 3.1: For clearly understand our idea, a case with two datasets D and D’ is presented in Table 3.2. Each of the two sets has two attributes A1 and A2, and each attribute has three attribute values (a11, a12, a13; a21, a22, a23). There are total 500 instances and two classes are c1
and c2 in each dataset. Assuming that the level of significance α =5% (degree of freedom = 1 and ε = 3.841), we can infer the following by Formula 3.2:
Since all CDAVs have no significant difference, by Proposition 3.1 mentioned above, the decision trees built respectively with D and D’ would be very similar. To verify this, we build the two decision trees and show the corresponding rules. The rules obtained from data set D are
(1) A1 = “a12” → c2;
(2) A1 = “a13” → c2;
We can find that the two decision tree have identical rules. This result corresponds to Proposition 3.1.
Table 3.2 Two data sets D and D’ without the occurrence of concept drift
Dataset D D’
Corollary 3.1: By Proposition 3.1, we can infer that if the variance of CDAVfor the two data blocks D and D’ is greater than or equivalent to a threshold ε, (i.e. CDAVD→D' (i,j) ≥ ε), then concept drift may occur between D and D’. As a result, the original decision tree needs to be corrected.
Example 3.2: Here, we use the two datasets in Table 3.3, which is modified from Table 3.2, to illustrate this Corollary. Again assuming that the level of significant α = 5% (degree of
freedom = 1 and ε = 3.841), we can infer the following by Formula 2:
Since CDAV13 achieves significant difference, by Corollary 3.1, we can claim that concept drift occurs and the decision trees built respectively with D and D’ would be different. To verify this, we again show the corresponding rules for two trees as follows. The rules obtained from data set D are
And the rules obtained from data set D’ are (1) A1 = “a12”→ c2; data set D’; the results correspond to our Corollary.
Table 3.3 Two data sets D and D’ with the occurrence of concept drift
Dataset D D’
attribute A1 A2 A1 A2
Attribute value a11 a12 a13 a21 a22 a23 a11 a12 a13 a21 a22 a23
c1 192 41 13 18 216 12 203 34 20 23 208 16
Class label
c2 33 142 79 74 122 58 42 135 66 68 118 67
3.2.2 Correction Mechanism in SCRIPT
Before we introduce the correction mechanism in SCIPIT, it is worth to note that drifting instances should gather in some specific areas in the dimensional space of attributes, otherwise they can be regarded as noise instances. Accordingly, another advantage of CDAV is that it can reveal which attribute values cause concept drift before building the decision tree by aggregating the drifting CDAVs. This enables SCRIPT to efficiently and immediately amend the original decision tree. For example, we can recognize the concept drift is caused by attribute value a13 in Example 3.2. Therefore, we can only correct the subtree rooted at a13 to efficiently correct the classification model.
Example 3.3: We use Figure 3.2 to further illustrate the idea of correction mechanism in SCRIPT. Figure 3.2 is a decision tree trained from old customer’s data to predict if a customer will apply for credit cards. For better understanding, only the subtree rooted at attribute
“salary” is shown. A similar decision tree, except that it is trained from new customer’s data stream, is shown in Figure 3.2 (b). By comparison with the CDAVs in Figure 3.2 (a) and Figure 3.2 (b), we can find that some concepts in new data block are significantly different from that in old one. More importantly, we can find that these changes gather up in the branch
of “age < 20 and 20 ≤ age < 40”. Accordingly, the aggregated drifting CDAVs is 0 ≤ age < 40 and it means that a people younger than 40 have changed his concepts in this example. To efficiently provide a decision tree suitable for new customers’ data block, we can only correct the subtree rooted at 0 ≤ age < 40 as in Figure 3.2 (b).
(a)
(b)
Figure 3.2. Two data blocks with the occurrence of concept drift: (a) original data block and the corresponding sub-tree; (b) new data block and the corresponding sub-tree.
Now, we detail the correction mechanism in SCRIPT. In the processing of data stream,
when the difference of CDAV between new data block Bt and original data block Bt -i (t ≥ i ≥ 1) is greater than the given threshold (the level of significance is set 0.05 as the default), the correction methods in SCRIPT can be divided into the following cases. The corresponding illustration of each case is shown in Figure 3.3. In each case of Figure 3.3, the dotted node in the left tree (original tree) denoted the occurring of concept drift and the dotted subtree in the right tree (new tree) is an alternate tree built by SCRIPT.
For each aggregated drifting CDAV in attribute Ai with value(s) aij,
a. If attribute Ai is not a splitting attribute of a node in the original decision tree, SCRIPT will use this attribute to split all leaf nodes by using data block Bt. Such a variation of CDAV indicates that an attribute with originally little information changes into an optimal splitting attribute due to concept drift. We illustrate this condition in Figure 3.3(a).
b. If attribute Ai is a splitting attribute of a node in the original decision tree and all CDAVs group in an interval aij, SCRIPT will remove the subtree rooted at the attribute value aij from the original tree and use data block Bt to build the alternative tree. Such a variation means concept drift is caused by a fixed range aij of the attribute Ai. Take Figure 3.3(b) for example, for the attribute age, those under 20 were originally inclined not to apply for credit cards; however, with the growing consuming ability of students, more and more are applying.
c. If attribute Ai is a split attribute of a node in the original decision tree but all CDAVs are scattered in several interval, SCRIPT will removes the subtree rooted at this attribute Ai from the original tree and use data block Bt to build the alternative tree.
Such a variation represents concept drift is caused by the attribute Ai but within multiple ranges of the attribute. For instance, for the attribute of age, people younger than 20 and older than 40 were originally both inclined not to apply for credit cards;
however, with the change of payment types, more and more are applying. In this case,
attribute ‘age’, no longer the optimal split attribute, is replaced by attribute ‘credit rating’, according to a test result. This case is illustrated in Figure 3.3(c).
Note that all aggregated drifting CDAVs might distribute among several attributes and SCRIPT will check if they are in the same path in the original tree before the correctness. If two aggregated CDAVs are in the same path, the one locates in the highest level will be reserved and the other will be ignored.
Figure 3.3 The illustrations of the correction mechanism in SCRIPT when concept drift occurs.
3.2.3 The Pseudocode and Computational Complexity of SCRIPT
Here we present the pseudo-code of SCRIPT and analyze its computational complexity.
Below is the pseudo code of SCRIPT. Giving the size of data block N and the significance level α, SCRIPT calculates the CDAVs in data block B0 in Line 4 as the initial reference. Note that, N can be set larger in a high speed environment or smaller for the real time application;
however, fijk must be larger than 5 which is a basic requirement in X2 statistics test. Similarly, α can be set smaller if the detection of concept drift is very important and larger otherwise.
The default significant level α in SCRIPT is set as 0.05 since this value is widely used as the default in statistics. It is not hard to imagine that SCRIPT will be more sensitive to the concept drift but may require more computational cost if we use a larger significant level α;
on the contrary, SCRIPT will be more tolerant to the noise data with a smaller α. The CDAVs of new coming block Bt+1 are calculated in Line 7. The CDAVs of two data blocks Bt and Bt+1
are compared in Lines 8 to 11. All drifting CDAVs are then aggregated in Line 13 for the purpose of efficiently correcting the decision model in Lines 19 to 28. The recorded information is updated in Line 29. Finally, the decision tree is output in Line 31.
SCRIPT Algorithm
3. Record all splitting attributes of DT0 in Splitatt[];
4. Count the CDAVs in B0 and record them in RCDAV[];
19. For all aggregated CDAVs belonged to attribute Ai in ACDAV[]
20. If attribute Ai is not in Splitatt[]
29. Update Splitatt[], RCDAV[], and the recorded decision tree;
30. End If
31. Output the decision tree.
Below, we compare the system cost of SCRIPT to that of two state-of-the-art window-based approaches: DNW and CVFDT. Assumed a data block has i attributes, k class labels, each attribute has j attribute values, since SCRIPT records the referred CDAVs, it has a memory cost O(ijk). SCRIPT also needs to record a decision tree and the splitting attributes in this tree, however, the memory cost is O(n) and can be ignored, where n is the number of nodes of the recorded decision tree. For CVFDT, since it has to record the counting in each node of the recorded decision tree, the memory cost is O(nijk). DNW, which is a sliding window approach, might have a worst record cost since it record instances instead counting and new data blocks might have to be mixed into old ones. If the maintained data is w times as much as that of a new data block, then the memory cost of DNW is O(wNijk), where N is the number of instances in the block.
In the aspect of computational cost, while the concept is stable, the required computational cost of SCRIPT is O(ijk). For CVFDT, it has to check the information gain for each attribute in each node of decision tree when a new data block is given. If the tree has n nodes, the computational cost needed would be O(nijk) [38]. For DNW, the computational cost would be O(dwNijk), where d is the depth of the tree, since the tree is rebuilt from scratch.
When there is concept drift, DNW and CVFDT have a similar computational cost to that in stable stream. Since SCRIPT directly corrects some sub-trees by checking the drifting CDAVs, the computational cost of the rebuilding is O(ij)+O(n’ijk), where n’ ≤ n and O(ij) is responsible for the comparison of CDAVs and O(n’ijk) is the computational cost for the rebuilding of sub-trees. Comparisons of system cost among SCRIPT, DNW, and CVFDT in stable and drifting data stream are summarized in Table 3.4. In summary, SCRIPT has the smallest memory requirement and computational cost when concept is stable. When concept drifts, SCRIPT still requires the smallest memory cost and a better or comparable computational cost.
Table 3.4 The Comparisons of system cost among SCRIPT, DNW, and CVFDT
Concept Stable Concept Drift
Algorithm Memory Cost Computational Cost Memory Cost Computational Cost
DNW O(wNijk) O(dwNijk) O(wNijk) O(dwNijk)
CVFDT O(nijk) O(nijk) O(nijk) O(nijk)
SCRIPT O(ijk) O(ijk) O(ijk) O(n’ijk)
3.3 Experiment and Analysis
In this section, two the-state-of-art data stream mining algorithms, DNW and CVFDT, are implemented to compare with our SCRIPT. We run all experiments on a PC equipped with Windows XP professional operating system, Pentium III 1GHz CPU and 512mb Sdram memory. For the preset of parameters in DNW, we refer to [38] and set α = 5.0, β = 0.25, and γ = 0.50.
3.3.1 Experimental Datasets
Due to the lack of a benchmark containing concept-drifting datasets, we modified three
Due to the lack of a benchmark containing concept-drifting datasets, we modified three