• 沒有找到結果。

2 Background and Related Work

2.3 Discretization Techniques

In this subsection, we review some proposed discretization algorithms and the detail of the-state-of-art CAIM discretization algorithm.

2.3.1 Proposed Discretization Approaches

Attributes can be divided into continuous attribute and categorical attribute. Among the proposed classification algorithms, some of them such as AQ [58], CLIP [14][13], and CN2 [15] can handle only categorical attributes, and some of them can handle continuous attributes but would perform better on categorical attributes [76]. To solve this problem, a lot of discretization algorithms have been proposed. Given a continuous attribute A, discretization algorithms would discretize this attribute into n discrete intervals {[d0,d1],(d1,d2],…,(dn-1,dn]}, where d0 is the minimal value and dn is the maximal value of attribute A. The discrete result {[d0,d1],(d1,d2],…,(dn-1,dn]} is called a discretization scheme D on attribute A. Discretization is usually performed prior to the learning process and can be broken into two steps. The first step is to decide the number of discrete intervals, and most discretization algorithms require the user to specify the number of intervals [11]. The second step is to find the width (also called the boundary) of each interval. A good discretization scheme should keep the high interdependency between the discrete attribute and the class labels to carefully avoid changing the distribution of the original data [56][69].

The literature on discretization is vast. Liu, Hussain and Dash [49] stated that discretization approaches have been proposed along five lines: supervised versus unsupervised, static versus dynamic, global versus local, splitting versus merging, and direct

versus incremental. Below we give a brief introduction of the five lines.

1. Supervised methods discretize attributes with the consideration of class information, while unsupervised methods do not.

2. Dynamic methods [4][77] consider the interdependence among the features attributes and discretize continuous attributes when a classifier is being built. On the contrary, the static methods consider attributes in an isolated way and the discretization is completed prior to the learning task.

3. Global methods, which use total instances to generate the discretization scheme, are usually associated with static methods. On the contrary, local methods are usually associated with dynamic approaches in which only parts of instances are used for discretization.

4. Merging methods start with the complete list of all continuous values of the attribute as cut-points and remove some of them by merging intervals in each step. Splitting methods start with an empty list of cut-points and add new ones in each step. The computational complexity of merging methods is usually larger than splitting ones.

5. Direct methods, such as Equal-Width and Equal-Frequency [12], require users to decide on the number of intervals k and then discretize the continuous attributes into k intervals simultaneously. On the other hand, incremental methods begin with a simple discretization scheme and pass through a refinement process although some of them may require a stopping criterion to terminate the discretization.

Take the two simplest discretization algorithms Equal Width and Equal Frequency [12] as examples; both of them are unsupervised, static, global, splitting and direct method. In the follows, we review some typical discretization algorithms by following the line of splitting versus merging.

Famous splitting methods include Equal Width and Equal Frequency [12], Paterson-Niblett [60], maximum entropy [76], CADD (Class-Attribute Dependent Discretizer

algorithm) [11], IEM (Information Entropy Maximization) [23], CAIM (Class-attribute Interdependence Maximization) [44], and FCAIM (Fast Class-attribute Interdependence Maximization) [43]. Experiments showed that FCAIM and CAIM are superior to the other splitting discretization algorithms since its discretization schemes can generally maintain the highest interdependence between class labels and discretized attributes, result to the least number of generated rules, and attain the highest classification accuracy [43] [44]. FCAIM is the extension of CAIM. The main framework, including the discretization criterion and the stopping criterion, as well as the time complexity between CAIM and F-CAIM are all the same. The only difference is the initialization of the boundary point in two algorithms.

Compared to CAIM, F-CAIM was faster and had a similar C5.0 accuracy.

A common characteristic of the merging methods is in the use of the significant test to check if two adjacent intervals should be merged. ChiMerge [36] is the most typical merging algorithm. In addition to the problem of high computational complexity, the other main drawback of ChiMerge is that users have to provide several parameters during the application of this algorithm that include the significance level as well as the maximal and minimal intervals. Hence, Chi2 [50] was proposed based on the ChiMerge. Chi2 improved ChiMerge by automatically calculating the value of the significance level. However, Chi2 still requires the users to provide an inconsistency rate to stop the merging procedure and does not consider the freedom which would have an important impact on discretization schemes. Thereafter, Modified Chi2 [70] takes the freedom into account and replaces the inconsistency checking in Chi2 by the quality of approximation after each step of discretization. Such a mechanism makes Modified Chi2 a completely automated method to attain a better predictive accuracy than Chi2. After Modified Chi2, Extended Chi2 [69] considers that the class labels of instances often overlap in the real world. Extended Chi2 determines the predefined misclassification rate from the data itself and considers the effect of variance in two adjacent intervals. With these modifications, Extended Chi2 can handle an uncertain dataset.

Experiments on these merging approaches by using C5.0 show that the Extended Chi2 outperformed the other bottom-up discretization algorithms since its discretization scheme, on the average, can reach the highest accuracy [69].

2.3.2 CAIM Discretization Algorithm

CAIM is the newest splitting discretization algorithm. In comparison with other splitting discretization algorithms, experiments showed that on the average, CAIM can generate a better discretization scheme. These experiments also showed that a classification algorithm, which uses CAIM as a preprocessor to discretize the training data, can on the average, produce the least number of rules and reach the highest classification accuracy [44].Given the two-dimensional quanta matrix (also called a contingency table) in Table 2.1, where nir

denotes number of records belonging to the ith class label that are within interval (vr-1, vr], L is total number of class labels, Ni+ is number of records belonging to the ith class, N+r is number of records that are within interval (vr-1, vr] and I is number of intervals, CAIM defines the interdependency between the class labels and the discretization scheme of a continuous attribute A as

caim = (∑(maxr2/N+r)) / I, (2.1) where maxr is the maximum value among all vir values. The larger the value of caim is, the better the generated discretization scheme D will be.

Table 2.1 The quanta matrix for attribute A and discretization scheme D label \ interval [v0,v1] … (vr-1,vr] … (vI-1,vI] summation

l1 n11 n1r n1n N1+

: : : : :

li ni1 nir nin Ni+

: : : : :

lL nL1 nLr nLn NL+

summation N+1 … N+r … N+I N

CAIM is a progressing discretization algorithm and it does not require users to provide any parameter. For a continuous attribute, CAIM will test all possible cutting points and then generate one in each loop; the loop is stopped until a specific condition is met. For each possible cutting point in each loop, the corresponding caim value is computed according to the Formula 2.1, and the one with the highest caim value is chosen. Since finding the discretization scheme with the globally optimal caim value would require a lot computation cost, CAIM algorithm only finds the local maximum caim to generate a sub-optimal discretization scheme. In the experiment, CAIM adopts cair criterion [76] as shown in Formula 2.2 to evaluate the quality of a generated scheme. Cair criterion is used in CADD algorithm [11]. CADD has several disadvantages, such as it needs a user-specified number of intervals and requires training for selection of a confidence interval. Some experimental results also show that cair is not a good discretization formula since it can suffer from the overfitting problem [44]. However, cair can effectively represent the interdependency between target class and discretized attributes, and therefore is used to measure a discretization scheme.