Discretizing Continuous Attributes and the Computational Complexity

4 Concept Drift Rule Mining Tree Algorithm

5.2 Ordered Multi-valued and Multi-labeled Discretization Algorithm

5.2.2 Discretizing Continuous Attributes and the Computational Complexity

In our knowledge, CDR-Tree is the first algorithm to mine ordered multi-valued and multi-labeled datasets. Without loss of generality, here we use ordered two-valued and two-labeled data ({v1, v2}, {l1, l2}) to illustrate how OMMD discretize continuous attributes.

First, OMMD sorts data according to v1 and then v2. After sorting, continuous attributes are in the form of {v1, v2}, ..., {v1, vp}, ..., {vq, v2}, ..., {vq, vp}, ..., where v2 < vp, v1 < vq. The dataset in Table 5.2 has been sorted according to this principle. After sorting, iterative splitting discretization by using Formula 5.1 is applied to records whose v1 values are identical. In other words, records are split into |v1| subsets and discretization is carried out in each subset by referring to v2 and the label sets.

Example 5.3: In Table 5.2, there are two subsets {R1, R2, R3, R4, R5, R6, R7} and {R8, R9, R10}. The discretization scheme after applying iterative splitting discretization is shown in Table 5.7.

Table 5.7 The discretization scheme of Table 5.2 after iterative splits interval [60,44~51] [60,52~69] [60,70~81] [70,68~79] [70,80~86]

variance [-16,-9] [-8, +9] [+10, +21] [-2, +9] [+10, +16]

label set {h, s} {h, h} {h, s} {h, h} {h, s}

Moreover, a merging procedure is implemented after iterative splits in OMMD. This merging procedure is inspired by the observation that there are variant attributes in real data.

A variant attribute means that label sets can be identified by using variance of attribute values.

The formal definition of variant attributes is given in Definition 5.3. As mentioned in Chapter 1, a good discretization algorithm should produce a concise summarization of attributes and to help the experts and users understand the data more easily. This merging procedure enables OMMD to produce discretization schemes which are simpler and easily understood by users.

In our real life, there are many variant attributes such as weight, blood pressure, body temperature and so on. For example, if two people of different body temperatures demonstrate an abnormal increase synchronously, both of them are sick.

Definition 5.3: Given an ordered and inseparable multi-valued and multi-labeled dataset D and a continuous attribute A in D. Let V = {vi | i = 1,…,p; p ≥ 1} be the domain of A and L = {lj | j = 1,…,q; q ≥ 1}be the set of all class labels, we can get that records in D are in the form of ({v’}, {l’}), where v’ ⊂ V, l’ ⊂ L and |v’| = |l’|. Suppose that there are records ∈D which have a identical label set {l’}, we can get mino ≤ vo – vo-1 ≤ maxo, where vo is the o-th value in v’ and mino and maxo respectively denotes the minimal and maximal variance between vo and vo-1 of these records. If there does not exist a record R satisfies mino ≤ vo – vo-1 ≤ maxo but its label set is {l’}, the attribute A is called a variant attribute.

Example 5.4: Take Table 5.7 as the example again, the discrimination scheme after merging is shown in Table 5.8. The discrimination scheme is output in the form of [vari-1, vari] ∪ ..., which means that records with variance between vari-1 and vari have the identical label set. In Table 5.8, records which have an increase from 10 to 21 kilograms and a decrease from 9 to 16 kilograms in weight have the same label set {health, sick}. Similarly, records which have a variance from -8 to +9 kilograms can be grouped together. Obviously, the discretization

scheme in Table 5.8 is simpler and more meaningful to users than that in Table 5.7.

Table 5.8 The discretization scheme of Table 5.2 after merging variance [-16,-9] ∪[+10, +21] [-8~+9]

label {h, s} {h, h}

Below is the pseudocode of OMMD for discretizing continuous attributes in an ordered two-valued and two-labeled dataset. It is worth to note that instead of adopting a simple greedy search method as that used in CAIM, NOMMD uses simulated annealing approach to reduce the probability of sticking on local maxima. The complexity of NOMMD for discretizing a continuous attribute is estimated as follows. The computational complexity of sort in Line 2 is O(NlogN), where N is the number of records in D. Divide a continuous attribute into S subsets in Line 3 and the initial setups in Lines 4 and 5 require O(N). Similar to OMMD, the expected running time of iterative partition from Line 6 to Line 32 and the expected running time can be estimated as O(S·N/S·L³) = O(N), where S is the number of subsets. Compared to NOMMD, the extra time required by OMMD is caused by the merging procedure in Line 35 to Line 42. Intuitively, this merging procedure takes O(N³) time, which is the computational complexity required by typical merging discretization algorithms such as the original ChiMerge. Since this merging procedure works on splitting discretization schemes, it can be optimized to O(N). Consider the discretization scheme generated after iterative partition from Line 6 to Line 32, there are two important properties of the produced discretization schemes.

1. The variances of discretization intervals in each subset can be regarded as a list consists of ordered values (var0, var1, ..., vari), where vari is a variance values.

2. For each list, the label distribution of any two adjacent intervals is different.

With the two proprieties, it is easily to prove that S discretization intervals in different subset

can be merged together in one step if and only if 1. All S intervals have identical label distribution.

2. Intervals which do not contain the maximum variance maxS or the minimum variance minS of subset S has the same variance interval [vari-1, vari] .

3. Intervals which contain maxS or minS are in the form of [vari-1, maxS] or [minS, vari].

Therefore, the computational complexity for merging intervals in different subset from Line 35 to Line 39 can be estimated as O(L·N). Above-mentioned merging intervals are further merged in Line 40 to Line 42 if their label distribution is the same. For example, the merging of variance [-16,-9] and [+10, +21] in Table 5.7. This step requires only a small constant time.

Since the number of class labels in most real data is a small constant, the total computational complexity of the merging procedure can be estimated as O(N). As a result, the expected computational complexity of OMMD for discretizing a continuous attribute is O(NlogN).

Please note that for an inseparable continuous attributes, it is possible that only parts of values can be represented by the variance. In other words, the output discretization scheme may be in the form of [vari-1, vari] ∪ [v1, v2j ~ v2j] ∪....

OMMD algorithm for inseparable continuous attributes

D: an non-ordered two-valued and two-labeled dataset. Records in D are in the format of ({a1, a2}, {l1, l2});

i: the number of continuous attributes in D;

S: the number of subsets.

L: the number of class labels in each subset;

Cut[]: the set of possible cut-points which are the midpoints of adjacent values belonging to different class labels in a continuous attribute;

MMDSS: the multi-valued and multi-labeled discretization scheme of subset S;

MMDS: a multi-valued and multi-labeled discretization scheme;

OMMD (D)

1. For each continuous attribute Ai

2. Sort Ai according to a1 and then a2;

13. For each possible cut-point in Cut[]

14. Add it into MMDSS;

15. Calculate the corresponding C’ in Equation 5.1;

16. Get the cut-point cut whose C’ value is the maximum maxC’;

17. ΔC = maxC’ - C’

18. If ΔC’ > 0 or I < L then 19. Remove cut from Cut[];

20. Add cut and the cut-point(s) in MMDSStemp into MMDSS;

21. C’ = maxC’;

28. Add cut into MMDSStemp;

29. Itemp = Itemp + 1;

30. Goto Line 13 with probability e^Δ^C;

31. End If

32. Update MMDS by referring to MMDSS;

33. For each discretization interval in MMDS 34. Calculate the variances;

35. For S intervals belong to different subset in MMDS

36. If all intervals in S subsets have the same label distribution and have identical variance [vari-1, vari] or are in the form of [vari-1, maxS] or [minS, vari] then 37. Merge these intervals;

38. Update MMDS;

39. End if

40. For variant intervals in MMDS which have the same label distribution 41. Merge these intervals;

42. Update MMDS;

43. Output MMDS for continuous attribute Ai;

在文檔中一個提昇分類演算法探勘概念漂移資料效能之研究 (頁 88-94)