Class Distribution on Attribute Value - One-way Drift and Two-way Drift

3 Sensitive Concept Drift Probing Decision Tree Algorithm

3.1 One-way Drift and Two-way Drift

3.2.1 Class Distribution on Attribute Value

Since the proposed solutions to mine concept-drifting data stream check the occurrence of concept drift on the level of instance or attribute, they generally are not sensitive enough.

Besides, they are also unable to detect the two-way drift illustrated in Figure 3.1. To solve these problems, SCRIPT probe the changes at a more detailed level, which is called Class Distribution on Attribute Values (CDAV) and defined as follows.

Definition 3.1: Assuming that a data block contains m target classes ck (k = 1,... ,m), n attributes Ai (i = 1,... ,n), and each attribute ai havingv attribute values aij (j = 1,... ,v), then the distribution of target class ck on the attribute value aij is defined as a CDAVij (Class Distribution on Attribute Value).

With Definition 3.1, we can use the chi-square (X²) test to check if there are concept drifts between two data blocks. X² test is a statistical measure used to test the hypothesis that two discrete attributes are statistically independent. Applied to the concept-drifting problem, it tests the hypothesis that the class distribution on an attribute value of two data blocks is identical. The formula to computing the X² value is

ijk

, where fijk represents the number of instances having attribute value aij and class ck in D and f’ijk is that in D’. With Formula 3.1, we can then define the variance of a CDAVij in the two data blocks as follows.

Definition 3.2: For a given significant level α, the varianceCDAV_D→_D_'(i,j)of the a CDAVij distribution on all attribute value aij in the two data blocks show no significant difference, and neither do the accuracy of decision tree built according to D and D’, respectively.

Proof:

SinceCDAV_D→_D_'(i, j)< ε,

we can obtain fijk ≅ f’ijk for target classes ck (k = 1,... ,m), attributes Ai (i = 1,... ,n) and attribute value aij.

For attribute Ai, the Entropy before the splitting is I(Ai) =

∑

Table 3.1, and N denotes the total number of instances in the data block.

Since fijk ≅ f’ijk and N = N’ we can obtain

That is, the Entropy of attribute ai before splitting in data blocks D and D’ is similar.

Suppose we splitting all instances N into v subset by attribute Ai, the Entropy of attribute Ai

after splitting is

From Formulas (3.3) and (3.7), we can get that

)

That is, the Information gain of attribute Ai in data blocks D and D’ is similar.

As a result, the two decision trees which are respectively built by using blocks D and D’ will be similar. ■

Table 3.1 The class label distribution on an attribute Ai

Class label \ value ai1 ai2 … aiv Summation

By Proposition 3.1 and Formula 3.2, we can detect any kind of concept drift between two data blocks and then build an accurate decision tree. The significance level can be set to be

smaller or larger according to the needs of applications. With a given significance level, we can obtain the ε by checking the X² table in a statistical book. The degree of freedom will be 1 less than the number of classes. Suppose that we set the level of significance α = 5% and there are three classes, if allCDAV_D→_D_'(i,j)are less than ε = 5.991, that means the class distribution on all attributes shows no significant difference between D and D’ with 95% confidence. As a result, the information gain obtained from any attribute will show no significant difference and the decision tree need not to be rebuilt. Note that the purpose of Proposition 3.1 is to claim that a rebuild tree will have very similar accuracy to that of original one, rather than to guarantee the rebuild tree will be a copy of the original one.

Example 3.1: For clearly understand our idea, a case with two datasets D and D’ is presented in Table 3.2. Each of the two sets has two attributes A1 and A2, and each attribute has three attribute values (a11, a12, a13; a21, a22, a23). There are total 500 instances and two classes are c1

and c2 in each dataset. Assuming that the level of significance α =5% (degree of freedom = 1 and ε = 3.841), we can infer the following by Formula 3.2:

Since all CDAVs have no significant difference, by Proposition 3.1 mentioned above, the decision trees built respectively with D and D’ would be very similar. To verify this, we build the two decision trees and show the corresponding rules. The rules obtained from data set D are

(1) A1 = “a12” → c2;

(2) A1 = “a13” → c2;

We can find that the two decision tree have identical rules. This result corresponds to Proposition 3.1.

Table 3.2 Two data sets D and D’ without the occurrence of concept drift

Dataset D D’

Corollary 3.1: By Proposition 3.1, we can infer that if the variance of CDAVfor the two data blocks D and D’ is greater than or equivalent to a threshold ε, (i.e. CDAVD^→D^＇ (i,j) ≥ ε), then concept drift may occur between D and D’. As a result, the original decision tree needs to be corrected.

Example 3.2: Here, we use the two datasets in Table 3.3, which is modified from Table 3.2, to illustrate this Corollary. Again assuming that the level of significant α = 5% (degree of

freedom = 1 and ε = 3.841), we can infer the following by Formula 2:

Since CDAV13 achieves significant difference, by Corollary 3.1, we can claim that concept drift occurs and the decision trees built respectively with D and D’ would be different. To verify this, we again show the corresponding rules for two trees as follows. The rules obtained from data set D are

And the rules obtained from data set D’ are (1) A1 = “a12”→ c2; data set D’; the results correspond to our Corollary.

Table 3.3 Two data sets D and D’ with the occurrence of concept drift

Dataset D D’

attribute A1 A2 A1 A2

Attribute value a11 a12 a13 a21 a22 a23 a11 a12 a13 a21 a22 a23

c1 192 41 13 18 216 12 203 34 20 23 208 16

Class label

c2 33 142 79 74 122 58 42 135 66 68 118 67

在文檔中一個提昇分類演算法探勘概念漂移資料效能之研究 (頁 39-45)