CHAPTER 2 Literature Survey
2.7 k-Nearest-Neighbor Classifier
The k-nearest-neighbor classifier (k-NN) is a method for classifying objects based on the k closest training examples in the feature space. To classify an unknown object, a k-nearest-neighbor classifier searches the feature space for the k objects that are closest to the unknown object, i.e. the k “nearest neighbor”. Then the unknown object can be classified into the major class of its k nearest neighbors. Basically, this approach is quite labor intensive, especially given a great amount of training sets or a high-dimensional feature space. That is why it proposed in the early 1950s [18], but popular until the 1960s. As the computing power has been improved, it has been widely used in the area of pattern recognition.
Since the complexity of k-NN is sensitive to the size of training data and the dimension of the feature space, many approaches to speed up classification have been proposed over the years. For example: seeking to reduce the times of distance evaluations actually performed, partitioning the feature space and restricting the distance computation within specific area, and using the parallel computation technique.
CHAPTER 3
Calculation of Attribute Similarity
The goal of the thesis is to cluster attributes such that the efficiency of classification can be improved. For achieving this goal, it is thus important to develop an evaluation method which can measure the similarity of attributes. In this thesis, we use the dependency degree to represent the similarity between two attributes. If two attributes have high dependency degree on each other, they can be thought of as high similarity. In this chapter three evaluation methods for attribute similarity are proposed.
3.1 Attribute Similarity Based on Relative Dependency
As mentioned above, Han et al. developed an approach based on the relative dependency for finding approximate reducts [17]. We extend this metric to measure the similarity between any two attributes [25]. Given two attributes Ai and Aj, the relative dependency degree of Ai with regard to Aj was denoted by Dep(Ai, Aj) and was defined as: dependency degree only considers the relative dependency between a condition attribute set and a decision attribute set. Here we extend the above formula to estimate the relative data dependency between any pair of attributes. The dependency degree was not symmetric, such that the condition Dep(Ai, Aj)=Dep(Aj, Ai) was not always valid. The average of Dep(Ai, Aj)
and Dep(Aj, Ai) was thus used to represent the similarity of the two attributes Ai and Aj, that is:
( , ) ( , )
( , )
2
i j i j
i j
Dep A A Dep A A
Sim A A +
= .
3.2 Attribute Similarity Based on Majority Sets
In last section, we propose an evaluation method for attribute similarity based on relative dependency. This measure can not, however, reflect the actual similarity (dependency) of attributes in some situations. The small decision system shown in Table 3.1 is used as an example to illustrate the problem.
Table 3.1: A simple decision system to evaluate the similarity.
Object Age Income Children Buying Computers
x1 Young Low No No
x2 Young Low No No
x3 Young Low No Yes
x4 Young Low No Yes
x5 Young Middle No No
x6 Young Middle No Yes x7 Young Middle No Yes x8 Young Middle Yes No
It can be observed from Table 3.1 that |Π{Age}(U)| = 1, |Π{Children}(U)| = 2 and |Π
{Income}(U)| = 2. The relative dependency degrees are thus found as follows: Dep(Age, Children)
= 0.5, Dep(Age, Income) = 0.5, Dep(Children, Age) = 1, and Dep(Income, Age) = 1. Both of
Sim(Age, Children) and Sim(Age, Income) can then be calculated as 0.75, such that the degree for the attribute Age to resemble the attribute Children is equal to the degree for Age to resemble Income. However, it is easy to observe from Table 3.1 that Age is more similar to Children than to Income. The main reason for causing this phenomenon is due to the projection operation, which merely concerns how many distinct values exist instead of analyzing the permutation of the values. Below, another measure is proposed for evaluating the similarity of attributes more precisely.
Let Ai denote the i-th attribute in an information or decision system, Vi denote the set of attribute values for attribute Ai, and vit be the t-th possible value of Ai, vit ∈ Vi. Also let X(vit) represent the set of objects whose values for attribute Ai are vit, X v( it∧vjt) represent the set of objects whose values for the two attributes Ai and Aj are vit and vjs, i ≠ j. Besides, an attribute Aj is said to functionally depend on another attribute Ai (i.e., Ai →Aj) if for any vit, all the objects in X(vit) must have the same value of attribute Aj. Even if an attribute does not functionally depend on another attribute, their dependency degree may also be analyzed. Here, the dependency degree is calculated based on the idea of keeping the most objects (or removing the least objects) to make the property of functional dependency valid. According to this idea, the majority set is defined below to achieve this purpose. Let Maj vAj( )it be the
Take the calculation of the dependency degree for Children depending on Income in Table 3.1 as an example to illustrate the above idea. Consider the attribute value No for Children. Since X(Income = Middle∧ Children = No) = {x5, x6, x7} and X(Income = Low Children ∧ = No) = {x1, x2, x3, x4}, thus MajIncome(Children No= )= {Income = Low}.
Note that ( )
Aj it
Maj v is a set, instead of a value. It may include more than one value. For example, since X(Income = Low ∧ Age = Young) = {x1, x2, x3, x4} and X(Income = Middle ∧ Age = Young) = {x5, x6, x7, x8}, the majority set of MajIncome(Age Young= ) is {Low, Middle}. Based on the majority measure, an evaluation of the dependency degree between two attributes is defined as follows:
( ) attributes Children and Income as an example again. Since the majority
Income( )
Maj Children No= is {Income = Low} and MajIncome(Children Yes= ) is {Income = Middle}, the dependency degree DepM(Income,Children) is calculated as follows:
.
Similarly, the dependency degree is not symmetric, such that the condition DepM(Aj, Ai)
= DepM(Ai, Aj) is not always valid. The average value of DepM(Ai, Aj) and DepM(Aj, Ai) is thus used to represent the similarity of the two attributes Ai and Aj. Therefore, the proposed similarity measure, denoted Sim(Ai, Aj), for a pair of attributes Ai and Aj is defined as follows:
For the above example, the dependency degree is calculated as 0.875, and is 0.625. They are not equal to each other. The
similarity for the two attributes Income and Children is calculated as (0.875+0.625)/2, which is 0.75.
) ,
(Income Children SimM
3.3 Generalized Attribute Similarity Based on Majority Sets
In general, attributes can be divided into three types: categorical, binary and ordinal. For example, the attributes Sex and Country in Table 3.2 are categorical; Married is binary;
English and Income are ordinal.
Table 3.2: A simple information system to evaluate the dependency.
Object Sex Married English Country Income x1 Male Yes Excellent Singapore Very High x2 Male No Excellent Singapore Very High x3 Male No Excellent Singapore High x4 Male No Good Singapore High x5 Female Yes Average Singapore Middle x6 Female Yes Good Taiwan Middle x7 Female No Good Taiwan High x8 Male No Excellent Taiwan High x9 Male Yes Average Taiwan Middle x10 Male Yes Poor Taiwan Middle
For the binary or categorical attribute, two attribute values are regarded absolutely different if they are not the same. For an ordinal attribute, however, the difference between
two values should be evaluated according to their orders. Take the attribute English in Table 3.2 as an example. The difference between Excellent and Good is smaller than that between Excellent and Poor. In the last section, we propose a measure of attribute similarity based on the majority set. That method assumes the difference between any two attribute values is either 0 or 1 (i.e. the same or absolutely different) no matter what the attribute type is, in order to simplify the problem. Obviously, it is not desirable when some ordinal attributes appear, because the results from discretization may greatly affect the evaluated similarity. In this section, we thus propose another evaluation measure for attribute similarity, which is more general than the last method. The advantages of this method are: (1) the difference between two attribute values is not so rigid, and (2) the attribute similarity is less sensitive to the error from discretization.
Before the dependency is calculated, the values of each ordinal attribute should be mapped to ranks (orders). For example, the information system in Table 3.2 is transformed into Table 3.3.
Table 3.3: The transformed table of Table 3.2.
Object Sex English Country Income x1 Male 4 Singapore 3 x2 Male 4 Singapore 3 x3 Male 4 Singapore 2 x4 Male 3 Singapore 2 x5 Female 2 Singapore 1 x6 Female 3 Taiwan 1 x7 Female 3 Taiwan 2
x8 Male 4 Taiwan 2
x9 Male 2 Taiwan 1
x10 Male 1 Taiwan 1
In Table 3.3, the set of values for attribute English includes Poor, Average, Good and Excellent. These values are thus mapped to {1, 2, 3, 4}. Similarly, the set of values for Income include Middle, High and Very High, which are then mapped to {1, 2, 3}. An evaluation of the dependency degree between two attributes is then defined as follows:
( )
the total number of objects, λ is the coefficient (fraction) for counting the objects which are not major ones, 0≦λ<1, and represents the difference between two distinct
values of attribute A
(
v vis, imΔ
)
i. Note that the computation of Δ
(
v vis, im)
depends on the type of attribute Ai and is divided into the following cases:(1) Attribute Ai is binary or categorical. In this case, Δ
(
v vis, im)
= 0 if the values visThe coefficient λ, on the other hand, is regarded as a weight of the objects except the major objects. The smaller λ is, the less important those objects are. Actually, this formula is the general form of the major dependency. That is, the inner summation is eliminated if λ is set as 0. Basically, the dependency ( , ) involves four cases: ordinal to ordinal,
M i j
Dep + A A
ordinal to categorical (binary), categorical (binary) to categorical (binary), and categorical (binary) to ordinal. These four cases are shown in Table 3.4.
ordinal to categorical (binary), categorical (binary) to categorical (binary), and categorical (binary) to ordinal. These four cases are shown in Table 3.4.
For Cases 1 and 2, since attribute Ai is ordinal, the coefficient λ should be given in advance. For Cases 3 and 4, since attribute Ai is categorical or binary, the coefficient λ should be set as 0 to eliminate the inner summation. Below, some examples are given to illustrate the cases.
Take the dependency (i.e. Case 2) as an example. Since there are two possible values of Country, both the conditions Country = Singapore and Country = Taiwan should be considered. Assume λ is set at 0.5. The dependency
is calculated by the following steps.
( ,
DepM+ Income Country)
M ( , )
Dep + Income Country
(1) Since MajIncome(Country = Singapore) is {3, 2}, the set of objects in X(Country = Singapore) is analyzed to determine which value can get a higher dependency degree.
For Income = 3:
For Income = 2:
The larger one (=3.5), which is “if Country = Singapore then Income = 2”, can get the higher dependency evaluation.
(2) Similarly, consider another condition that Country = Taiwan as follows:
0.5 | ( 2 ) | . There are three conditions which should be considered due to the three values {1, 2, 3} of Income. In addition, since Country is a categorical attribute, the coefficient λ should be set at 0. The dependency
( ,
DepM+ Income Country)
)
( ,
DepM+ Country Income can be calculated in the same way. In summary, the computation of the dependency can be divided into two situations. The first one is that the pre-attribute, A
( , )
M i j
Dep + A A
i, is ordinal (i.e. Cases 1 and 2); the other one is that the pre-attribute is binary or categorical (i.e. Cases 3 and 4). We ignore the effect that the post-attribute (Aj) is ordinal.
The dependency degree is not symmetric, such that the condition DepM(Aj, Ai) = DepM(Ai, Aj) is not always valid. The average value of DepM(Ai, Aj) and DepM(Aj, Ai) is thus used to represent the similarity of the two attributes Ai and Aj. Therefore, the proposed similarity
measure, denoted Sim(Ai, Aj), for a pair of attributes Ai and Aj is defined as follows:
( )
( , ) 1 ( , ) ( , )
M i j 2 M j i M i j
Sim A A = Dep + A A +Dep + A A .
CHAPTER 4
Attribute Clustering with Pre-Defined Cluster Numbers
In this chapter an attribute clustering method based on k-medoids is proposed to partition the attributes into k clusters according to the dependency between each pair of attributes. It also uses a better search strategy to find centers (representative attributes) in a dense region, instead of random selection in k-medoids. After the attributes are partitioned into k clusters, each cluster can thus be represented by its representative attribute. The whole feature spaces can thus be greatly reduced.
4.1 The Basic Concepts of the Proposed Algorithm
For most clustering approaches, the distance between two objects is usually adopted as a measure for representing their dissimilarity, which is then used for deciding whether the objects belongs to the same cluster or not. In this thesis, the attributes, instead of the objects, are to be clustered. The conventional distance measures such as Euclidean distance or Manhattan distance are thus not suitable since the attributes may have different formats of data, which are hard to compare. For example, assume there are two attributes, one of which is age and the other is gender. It is thus hard to compare the two attributes via the traditional distance measure. Below, a measure based on the concept of relative data dependency is proposed to achieve it. It was proposed by Han et al. [17] and can be thought of as a kind of similarity degrees.
Given two attributes Ai and Aj, the distance (dissimilarity) measure for the pair of use the other two evaluation methods for attribute similarity, and take the reciprocal as the distance, too.
Take the distance between the two attributes Age and Income in Table 4.1 as an example.
Since |ΠAge (U )| = 3, |ΠChildren (U )| = 2 and |ΠAge, Children (U )| = 5, the relative dependency degrees Dep(Age, Children) and Dep(Children, Age) are 0.6 and 0.4, respectively. The distance d(Age, Children) is thus 1/Avg(0.6, 0.4), which is 2.
Table 4.1: A simple decision system.
Object Age Income Children Buying Computers
x1 Young Low No No
In this section, an attribute clustering algorithm called Most Neighbors First (MNF) is proposed to cluster the attributes into a fixed number of groups. Assume the number k of
desired clusters is known. Some preprocessing steps such as removal of inconsistent or incomplete tuples and discretization of numerical data are first done. After that, the proposed MNF attribute clustering algorithm is used to partition the feature space into k clusters and output the k representative attributes of the clusters.
The proposed clustering algorithm MNF is based on the k-medoids approach. Unlike the k-means approach, the proposed algorithm always updates the centers by some existing objects. Besides, it uses a better search strategy to find centers in a dense region, instead of random selection in k-medoids.
The proposed algorithm MNF consists of two major phases: (1) reassigning the attributes to the clusters and (2) updating the centers of the clusters. In the first phase, the proposed distance measure is used to find the nearest center of each attribute. The attribute is then assigned to the cluster with that center. In the second phase, each cluster Ci uses a searching radius ri to find the neighbors of the attributes in Ci. The attribute with the most neighbors in a cluster is then chosen as the new center. The proposed algorithm is described in details below.
The MNF attribute clustering algorithm:
Input: An information system I = (U, A∪{d}) and the number k of desired clusters.
Output: k appropriate attribute clusters with their representative attributes.
Step 1 : Randomly select k attributes {A1c, A2c, … , Akc} as the initial representative attributes (centers) in the k clusters, where Atc stands for the representative attribute (center) of the t-th cluster Ct, Atc∈ A. Denote Ac = {A1c, A2c, … , Akc} A as the initial representative attribute set.
⊆
Step 2 : For each non-representative attribute Ai∈A-Ac, compute the expanded relative dependency (distance) d(Ai, Atc) between attribute Aiand each representative attribute Atc as:
Step 3 : Allocate all non-center attributes to their nearest centers according to the distances found in step 2. Collect a center attribute with its allocated attributes as a cluster.
Step 4 : For each cluster Ct, calculate the distances between any two different attributes within Ct. attribute pairs in the cluster, which is
nt
attributes (called Near(At,i)) with their distances from At,i within rt. That is:
}.
) , ( and {
)
(At,i At,j At,j Ct d At,i At,j rt
Near = ∈ ≤
Step 7 : For each cluster Ct, find the attribute At,l with the most attributes in its Near set. Set At,l as the new center Atc of Ct.
Step 8 : Repeat Steps 2 to 7 until the clusters have converged.
Step 9 : Output the final clusters and their centers as the representative attributes.
After Step 9, k clusters of attributes are formed and k representative attributes for the feature space are found.
4.3 An Example
In this section, a simple example is given to show how the proposed algorithm can be used to cluster the attributes. Table 4.2 shows the scores of eight students. There are eight condition attributes A = {PR, CA, DM, C++, JAVA, DB, DS, AL}, respectively standing for the eight subjects: Probability, Calculus, Discrete Mathematics, C++, JAVA, Database, Data Structure and Algorithms. The values of the condition attributes are {A, B, C, D}, which stand for the grade levels of a subject. There is one decision attribute {ST}, which stands for {Study for Master Degree} and has two possible classes {Yes, No}. In this example, the number of clusters is set at 2 (i.e. k = 2). For the set of data, the proposed algorithm proceeds as follows.
Table 4.2: An example for attribute clustering.
Step 1: k attributes are randomly selected as the initial centers of the clusters. In this example, k is set at 2. Assume that the two attributes DM and DS are selected as the initial centers of the two clusters C1and C2, respectively.
Step 2: The distances (dissimilarities) between each non-center attribute and each center are calculated. Take the distance between PR and DM as an example. Since |ΠPR| = 3, |ΠDM| = 3 and |ΠPR, DM| = 5, the relative dependency degrees Dep(PR, DM) is calculated as 0.6 and Dep(DM, PR) is 0.6 as well. The distance between the two attributes is thus calculated as:
67
All the distances between non-center attributes and representative centers are shown in Table 4.3.
Table 4.3: The distances between non-center attributes and representative centers.
Cluster C1 Cluster C2
Attribute pair Distance Attribute pair Distance d(PR, DM) 1.67 d(PR, DS) 2.33 d(CA, DM) 1.67 d(CA, DS) 2.27 d(C++, DM) 2 d(C++, DS) 1 d(JAVA, DM) 1.67 d(JAVA, DS) 0.8 d(DB, DM) 2 d(DB, DS) 0.8 d(AL, DM) 1.33 d(AL, DS) 2
Step 3: All non-center attributes are allocated to their nearest centers. Thus, cluster C1
contains {PR, CA, AL, DM} and cluster C2 contains {C++, JAVA, DB, DS}.
Step 4: The distances between any two different attributes in the same clusters are calculated. The results are shown in Table 4.4.
Table 4.4: The distances between any two attributes within the same clusters.
Within cluster C1 Within cluster C2
Attribute pair Distance Attribute pair Distance d(PR, DM) 1.67 d(C++, DS) 1
d(PR, AL) 1.33 d(C++, DB) 1.25 d(CA, AL) 1.67 d(JAVA, DB) 2 d(PR, CA) 1.67 d(C++, JAVA) 1.67
d(CA, DM) 1.67 d(JAVA, DS) 1.25 d(AL, DM) 1.33 d(DB, DS) 1.25
Step 5: The searching radius of each cluster is calculated. Take the cluster C1 as an example. It includes 4 attributes {PR, CA, AL, DM}. The distances between each pair of attributes in C1 are {1.67, 1.67, 1.33, 1.67, 1.67, 1.33}. The radius r1 is then calculated as:
1.56
6
1.33 1.67 1.67 1.33 1.67 1.67
1 = + + + + + =
r .
Step 6: The Near set of each attribute in a cluster is calculated. Take the attribute PR in cluster C1 as an example. Its distance from the other three attributes CA, AL and DM in the same cluster are calculated as 1.67, 1.33 and 1.67. Near(PR) thus includes only the attribute AL since only AL is within the radius r1 (1.56), which is found from Step 5. Similarly, the Near sets of the other three attributes in the cluster C1 are found as follows:
Near(CA) =φ,
Near(AL) = {PR, DM}, and
Near(DM) = {AL}.
Step 7: Since the attribute AL has the most attributes in its Near set for the cluster C1, AL then replaces the attribute DM as the new center of C1. Similarly, the original center DS for C2 has the most attributes in its Near set. DS is thus still the center of C2.
Step 8: Steps 2 to 7 are repeated until the two clusters no longer change. The final clusters can thus be found as follows:
C1= {PR, CA, AL, DM}, with the center AL.
C2 = {C++, JAVA, DB, DS}, with the center DS.
Step 9: The final clusters and their centers as the representative attributes are then output.
The attributes in the same cluster can thus be considered to possess similar characteristics in classification and can be used as alternative attributes of the representative one.
1 \
1 1
( , )
j i i
k
j i
i i A C A
d A A k = C ∈
⎛ ⎞
⎜ ⎟
⎜ ⎟
⎝∑ ∑ ⎠.
4.3 Experimental Results
In this section, the implement action of the proposed algorithms for clustering attributes with multiple cluster numbers is described. Note that the similarity measure, similarity based on majority sets, is used to compute the similarity between two attributes. The real world dataset, Wisconsin Breast Cancer Databases (wdbc), was used to verify our approach. The characteristics of the dataset are shown in Table 4.5. The experiments were implemented in C++ on an AMD Athlon 64 X2 Dual Core 3800+ personal computer with 2.01 GHz and 1 GB RAM.
Table 4.5: The characteristics of the dataset, wdbc.
Num. of instances 569
Num. of Features 30
Num. of Features 30