• 沒有找到結果。

1 Introduction

1.4 Synopsis of This Dissertation

The rest of the dissertation is organized as follows. In Chapter 2, we briefly review related research, including: the introduction of the typical decision tree algorithm, the problem of concept drift in data stream mining, and previous discretization algorithms. In Chapter 3, the problem of mining concept-drifting data and some formal definitions are first introduced.

We then elucidate SCRIPT and evaluate it performances. In Chapter 4, we first use an example to introduce the concept-drifting rules to enhance reader understanding of this problem more clearly. Then, CDR-Tree is detailed and evaluated. In Chapter 5, we give some formal definitions of the problem of discretizing multi-valued and multi-labeled data. The details of our new discretization metric and OMMD, the empirical evaluations of OMMD are then presented. Finally, the conclusions of this dissertation are made and some open problems are described in Chapter 6.

Chapter 2

Background and Related Work

In this chapter, we provide a brief survey of the background and related work. First of all, the typical decision tree algorithm is introduced in Section 2.1. In Section 2.2, we review the methods which are proposed to mine a concept-drifting data stream. Previous discretization algorithms are studied in Section 2.3. Finally, the experimental datasets used in this dissertation are introduced in Section 2.4.

2.1 Decision Tree

A typical decision tree [62][63] is a flow-chart-like tree structure, which is constructed by a recursive divide-and-conquer algorithm. In a decision tree, each internal node denotes a test on an attributes, each branch represents an outcome of the test, and each leaf node has an associated target class (class labels). The top-most node in a tree is called root and each path from the root to a leaf node represent a rule. A typical decision tree is shown in Figure 2.1. To classify an unseen example, beginning with the root node, successive internal nodes are visited until this example reaches a leaf node. The class of this leaf node is then assigned to this example as a prediction. For instance, the decision in Figure 2.1 will approve a golden credit card application if the applicant has a salary higher than 85000 and his/her repayment

record is good.

Figure 2.1 A typical decision tree.

A number of decision tree algorithms, such as ID3 [71], C4.5 [62], CART [6], CHAID [6], SLIQ [54], SPRINT [65], RainForest [28], and PUBLIC [64] have been proposed over the years. Most of them are composed of two phases: the building phase and the pruning phase. Besides, before inducing the decision tree, the original dataset is usually to be divided into training data and testing data. In the building phase, the training data is recursively partitioned by a splitting function until all the examples in a node is pure (i.e. all examples in this node have the same class labels) or can not be further partitioned (i.e. all examples in this node contain the same attribute value but different target class). Several famous splitting functions, such as information gain and gain ratio [30], had widely been used in past. After a decision tree is built, many of the branches will reflect anomalies in the training data due to noise data or outlier. To prevent such an overfitting problem [30], decision tree would prune its model to remove the least reliable branches, and generally resulting in faster classification and an improvement in the ability of the tree to correctly classify unseen data. There are two common approaches to prune tree: pre-pruning [64] and post-pruning [55]. Pre-pruning

approach halt its tree building early by deciding no further partitioning the subset of training data at a given node, while post-pruning removes branches from a fully grown tree by use of testing data.

2.2 Data Stream Mining and Concept Drift

Most proposed algorithms of data stream mining [27][51][66][71][72] assumed data blocks come under stationary distributions, but in reality, the concept of an instance might vary. While the concept of data starts drifting, the classification model constructed by using old data is unsuitable for the new one. Thus, it is imperative to revise the old classification model or re-build a new one. VFDT (Very Fast Decision Tree Learner) [18] has been proposed to solve the scalable problem when learning from very large data stream. It starts with a single leaf and starts collecting training examples from a data stream. When VFDT gets enough data to know, with high confidence that it knows which attribute is the best to partition the data with, it turns the leaf into an internal node and goes on splitting it. However, as most incremental learning methods, it assumes that the data is a random sample drawn from a stationary distribution and is inappropriate for the mining of concept drifting data such as credit card approval and fraud detection.

Window-based approaches [32][39][46][51][75] are the common solutions for the problem of concept drift on data stream. They use a fixed or sliding window [33] to select appropriate training data for different time points. CVFDT [32] (concept-adapting very fast decision tree learner), which is formerly VFDT, is a representative window-based approach for mining concept drift on data stream. It solves the concept-drifting problem by maintaining only fixed amount of data within the window. CVFDT keeps its learned tree up-to-date with this window by monitoring the quality of its old decisions as data move into and out of the window. In particular, whenever a new instance is read it is added to the statistics at all the nodes in the tree that it passes through, the last example in the window is forgotten from every node where it had previously had an effect, and the validity of all statistical tests are checked.

If CVFDT detects a change, it starts growing an alternate tree in parallel which is rooted at the

newly-invalidated node. When the alternate is more accurate on new data than the original, the original will be replaced by the alternate tree.

WAH (window adjustment heuristic) [75] and DNW (determine new window size) [38]

[39] are also representative window-based algorithms, however, they use sliding window.

WAH take the actual condition of decision tree into account to dynamically adjust the window size. After new data stream join, the doubt for concept drift will reduce the size of windows by 20%. Contrarily, when data are stable, a unit of window is deleted to avoid maintaining too many unused data. When the concept seems to be stable, the original window size is maintained. If none of the conditions mentioned above are valid, it means that more information will be needed to build classifiers. As a result, old data will not be left out of the window and new data will also be added in it. Although WAH can solve the problem of concept drift according to actual conditions, but it is suitable only for small databases. DNW deals with the learning of training data by way of data block, which is suitable for data stream environment. DNW has a similar way of learning to WAH; however, they are different in condition and way of assessment. DNW builds a classifier for each block, and compares the three parameters: accuracy, recall, and precision for classifiers on the current blocks with the ones for the previous classifiers. Weighting-based [40][41] and ensemble classifier [22][68]

were also introduced to handle the concept-drifting problem on data stream. Weighting-based approach provides each example with a weight according to their age and utility for the classification task. Ensemble classifier built separate sub-classifiers and then combines the prediction of each sub-classifier to classify the unseen data. The main disadvantage of an ensemble classifier is the huge system cost caused by the building and maintenance of all sub-classifiers.

While the concept is stable, the methods mentioned above would spend unnecessary system cost, including computational cost to build or rebuild a decision tree or storage cost to record similar data streams. Moreover, when concept is drifting, they generally are not

sensitive enough to the concept drift problem. That is, if the proportion of drifting instances to all instances in a data block is small, the proposed solutions can detect the changes until the number of drifting instances reaches a threshold to cause obvious difference in accuracy or information gain. For some applications such as fraudulent credit card transactions, the sensitivity to detect drifting concepts would be very important. In an ensemble classifier, the fraudulent transactions might be ignored due to the predictions of old sub-classifiers. For weighting-based approaches, even giving a high weight to the transactions in the new data block, they might also make a wrong prediction since the influence of old transactions. Fixed window-based approaches have a similar problem to that in weighting-based approaches, and sliding window-based approaches would also disregard such changes since these drifting transactions would not cause obvious variance of accuracy or information gain.

2.3 Discretization Techniques

In this subsection, we review some proposed discretization algorithms and the detail of the-state-of-art CAIM discretization algorithm.

2.3.1 Proposed Discretization Approaches

Attributes can be divided into continuous attribute and categorical attribute. Among the proposed classification algorithms, some of them such as AQ [58], CLIP [14][13], and CN2 [15] can handle only categorical attributes, and some of them can handle continuous attributes but would perform better on categorical attributes [76]. To solve this problem, a lot of discretization algorithms have been proposed. Given a continuous attribute A, discretization algorithms would discretize this attribute into n discrete intervals {[d0,d1],(d1,d2],…,(dn-1,dn]}, where d0 is the minimal value and dn is the maximal value of attribute A. The discrete result {[d0,d1],(d1,d2],…,(dn-1,dn]} is called a discretization scheme D on attribute A. Discretization is usually performed prior to the learning process and can be broken into two steps. The first step is to decide the number of discrete intervals, and most discretization algorithms require the user to specify the number of intervals [11]. The second step is to find the width (also called the boundary) of each interval. A good discretization scheme should keep the high interdependency between the discrete attribute and the class labels to carefully avoid changing the distribution of the original data [56][69].

The literature on discretization is vast. Liu, Hussain and Dash [49] stated that discretization approaches have been proposed along five lines: supervised versus unsupervised, static versus dynamic, global versus local, splitting versus merging, and direct

versus incremental. Below we give a brief introduction of the five lines.

1. Supervised methods discretize attributes with the consideration of class information, while unsupervised methods do not.

2. Dynamic methods [4][77] consider the interdependence among the features attributes and discretize continuous attributes when a classifier is being built. On the contrary, the static methods consider attributes in an isolated way and the discretization is completed prior to the learning task.

3. Global methods, which use total instances to generate the discretization scheme, are usually associated with static methods. On the contrary, local methods are usually associated with dynamic approaches in which only parts of instances are used for discretization.

4. Merging methods start with the complete list of all continuous values of the attribute as cut-points and remove some of them by merging intervals in each step. Splitting methods start with an empty list of cut-points and add new ones in each step. The computational complexity of merging methods is usually larger than splitting ones.

5. Direct methods, such as Equal-Width and Equal-Frequency [12], require users to decide on the number of intervals k and then discretize the continuous attributes into k intervals simultaneously. On the other hand, incremental methods begin with a simple discretization scheme and pass through a refinement process although some of them may require a stopping criterion to terminate the discretization.

Take the two simplest discretization algorithms Equal Width and Equal Frequency [12] as examples; both of them are unsupervised, static, global, splitting and direct method. In the follows, we review some typical discretization algorithms by following the line of splitting versus merging.

Famous splitting methods include Equal Width and Equal Frequency [12], Paterson-Niblett [60], maximum entropy [76], CADD (Class-Attribute Dependent Discretizer

algorithm) [11], IEM (Information Entropy Maximization) [23], CAIM (Class-attribute Interdependence Maximization) [44], and FCAIM (Fast Class-attribute Interdependence Maximization) [43]. Experiments showed that FCAIM and CAIM are superior to the other splitting discretization algorithms since its discretization schemes can generally maintain the highest interdependence between class labels and discretized attributes, result to the least number of generated rules, and attain the highest classification accuracy [43] [44]. FCAIM is the extension of CAIM. The main framework, including the discretization criterion and the stopping criterion, as well as the time complexity between CAIM and F-CAIM are all the same. The only difference is the initialization of the boundary point in two algorithms.

Compared to CAIM, F-CAIM was faster and had a similar C5.0 accuracy.

A common characteristic of the merging methods is in the use of the significant test to check if two adjacent intervals should be merged. ChiMerge [36] is the most typical merging algorithm. In addition to the problem of high computational complexity, the other main drawback of ChiMerge is that users have to provide several parameters during the application of this algorithm that include the significance level as well as the maximal and minimal intervals. Hence, Chi2 [50] was proposed based on the ChiMerge. Chi2 improved ChiMerge by automatically calculating the value of the significance level. However, Chi2 still requires the users to provide an inconsistency rate to stop the merging procedure and does not consider the freedom which would have an important impact on discretization schemes. Thereafter, Modified Chi2 [70] takes the freedom into account and replaces the inconsistency checking in Chi2 by the quality of approximation after each step of discretization. Such a mechanism makes Modified Chi2 a completely automated method to attain a better predictive accuracy than Chi2. After Modified Chi2, Extended Chi2 [69] considers that the class labels of instances often overlap in the real world. Extended Chi2 determines the predefined misclassification rate from the data itself and considers the effect of variance in two adjacent intervals. With these modifications, Extended Chi2 can handle an uncertain dataset.

Experiments on these merging approaches by using C5.0 show that the Extended Chi2 outperformed the other bottom-up discretization algorithms since its discretization scheme, on the average, can reach the highest accuracy [69].

2.3.2 CAIM Discretization Algorithm

CAIM is the newest splitting discretization algorithm. In comparison with other splitting discretization algorithms, experiments showed that on the average, CAIM can generate a better discretization scheme. These experiments also showed that a classification algorithm, which uses CAIM as a preprocessor to discretize the training data, can on the average, produce the least number of rules and reach the highest classification accuracy [44].Given the two-dimensional quanta matrix (also called a contingency table) in Table 2.1, where nir

denotes number of records belonging to the ith class label that are within interval (vr-1, vr], L is total number of class labels, Ni+ is number of records belonging to the ith class, N+r is number of records that are within interval (vr-1, vr] and I is number of intervals, CAIM defines the interdependency between the class labels and the discretization scheme of a continuous attribute A as

caim = (∑(maxr2/N+r)) / I, (2.1) where maxr is the maximum value among all vir values. The larger the value of caim is, the better the generated discretization scheme D will be.

Table 2.1 The quanta matrix for attribute A and discretization scheme D label \ interval [v0,v1] … (vr-1,vr] … (vI-1,vI] summation

l1 n11 n1r n1n N1+

: : : : :

li ni1 nir nin Ni+

: : : : :

lL nL1 nLr nLn NL+

summation N+1 … N+r … N+I N

CAIM is a progressing discretization algorithm and it does not require users to provide any parameter. For a continuous attribute, CAIM will test all possible cutting points and then generate one in each loop; the loop is stopped until a specific condition is met. For each possible cutting point in each loop, the corresponding caim value is computed according to the Formula 2.1, and the one with the highest caim value is chosen. Since finding the discretization scheme with the globally optimal caim value would require a lot computation cost, CAIM algorithm only finds the local maximum caim to generate a sub-optimal discretization scheme. In the experiment, CAIM adopts cair criterion [76] as shown in Formula 2.2 to evaluate the quality of a generated scheme. Cair criterion is used in CADD algorithm [11]. CADD has several disadvantages, such as it needs a user-specified number of intervals and requires training for selection of a confidence interval. Some experimental results also show that cair is not a good discretization formula since it can suffer from the overfitting problem [44]. However, cair can effectively represent the interdependency between target class and discretized attributes, and therefore is used to measure a discretization scheme.

2.4 UCI Database and IBM Data Generator

In this dissertation, we use both real and synthetic datasets to carry out a series of experimental evaluations. The real experimental datasets are selected from UCI database [59]

which is a repository of several kinds of datasets. UCI database is widely used by the machine learning community for the empirical analysis of machine learning algorithms.

For artificial experimental datasets, we use IBM data generator [1][2], which was designed by IBM Almaden Research Center and is an open source written by C++

programming language. IBM data generator is a popular tool for researchers to generate artificial data to evaluate the performance of proposed algorithms. One advantage of IBM data generator is that it contains a lot of built-in functions to generate several kinds of datasets, and therefore enable researchers to carry out a series of experimental comparisons. There are nine basic attributes (salary, commission, loan, age, zipcode, h-years, h-value, e-level, and car) and a target attribute in IBM data generator. Among the nine attributes, zipcode, e-level, and car are categorical attributes; and all the others are continuous ones. The number of class labels can be decided by users and is set to 2 as default. The summary of these nine basic attributes are illustrated in Table 2.2. In this dissertation, we modify IBM data generator to generate datasets containing concept-drifting records.

Table 2.2 The summary of nine basic attributes in IBM data generator

Attribute Type Value domain

salary continuous 20,000 to 150,000

commission continuous if Salary ≥ 75000, Commission = 0 else uniformly distributed from 10000 to 75000

loan continuous 0 to 500000

h-year continuous 1 to 30

h-value continuous 0.5k*100000 to 1.5k*100000, where k∈{1 ... 9}depends on zipcode

age continuous 20 to 80

car categorical 1 to 20

e-level categorical 0 to 4

zipcode categorical 1 to 9

Chapter 3

Sensitive Concept Drift Probing Decision Tree Algorithm

In this chapter, we first give some formal discussions of the concept-drifting problem in Section 3.1. In Section 3.2, we introduce our sensitive concept drift probing decision tree algorithm (SCRIPT). The empirical analyses of SCRIPT are presented in Section 3.3.

3.1 One-way Drift and Two-way Drift

To make readers easily understand the problem we will address later, in this dissertation we divide the concept drift into concept stable, concept drift and concept shift. We refer to the examples in [73] and modify the figures to illustrate the problem in Figure 3.1. Figure 3.1 represents a two-dimensional data stream and is divided into six successive data blocks according to the arriving time of data. Instances arriving between ti and ti+1 form block Bi, and the separating line in each block stands for the optimum classification boundary in this block.

During time t0 to t1, data blocks B0 and B1 have similar data distribution. That is, data stream during this period is stable. Thereafter in B2, some instances shows concept drift and the optimum boundary changes. This is defined as concept drift. Finally, data blocks B4 and B5