Thesis Organization - 利用屬性分群之特徵選擇及其應用

CHAPTER 1 Introduction

1.3 Thesis Organization

The remaining parts of this thesis are organized as follows. Some related researches including feature selection, reduct, relative dependency, clustering, case-based reasoning and k-nearest- neighbors classifier are reviewed in Chapter 2. The proposed evaluation methods for attribute similarity are described in Chapters 3. Two attribute clustering algorithms are proposed in Chapters 4 and 5. An algorithm incorporating case-based reasoning and attribute clustering is given in Chapter 6. An algorithm incorporating k-nearest-neighbors classifier and attribute clustering is given in Chapter 7. Some conclusions and future works are given in Chapter 8.

CHAPTER 2 Literature Survey

In this chapter, some important concepts related to this thesis are briefly reviewed. They include feature selection, reducts, relative dependency, k-means and k-medoids, some clustering approaches with unknown cluster numbers, case-based reasoning and k-nearest-neighbor classifier.

2.1 Feature Selection

In machine learning, large amount of features (attributes) will significantly slow down the learning process. The existence of redundant and irrelevant features may also cause a classifier to over-fit training data. As mentioned above, feature selection is a process of removing irrelevant and redundant features, and is popular in pattern recognition, machine learning and data mining. The approaches for feature selection can be categorized into two models, the wrapper model and the filter model. The wrapper model evaluates a set of selected features by the predictive accuracy of the learning algorithm adopted. Approaches based on the model thus depend on the learning algorithms and is computationally expensive when the number of features is large. The filter model, on the contrary, separates feature selection from classifier learning, such that it is independent of the learning algorithms.

Generally speaking, the features selected by the wrapper model usually get a higher accuracy of classification than those by the filter model. A comparison of the two models is shown in Table 2.1.

Table 2.1: A comparison of the wrapper and the filter models.

According to feature evaluation, approaches based on feature selection can also be divided into two kinds, individual (feature) evaluation and subset evaluation [41]. The individual evaluation strategy weights (ranks) individual features according to their degrees of relevance to the decision attribute. That is, attributes are selected based on their individual importance in distinguishing the instances into different classes [5][13]. A subset of features is then found according to the ranking list. The subset evaluation strategy, on the other hand, considers a subset of attributes at the same time. It is usually composed of the following two major steps: generating candidate feature subsets and evaluating the goodness of the subsets.

The conventional framework of the subset evaluation strategy is shown in Figure 2.1 [41].

Subset

Figure 2.1: The conventional framework of the subset evaluation strategy.

Some approaches for evaluating the goodness of an individual feature or a subset of features have been proposed. For example, information gain was a popular ranking method used for individual attributes. It selected attributes with high differentiability to construct a decision tree. The relief approach was another individual-attribute evaluation method proposed by Kira and Rendell [21]. Its principle was that if for an attribute, the instances from different classes had different values and the ones from the same class had the same values, then the attribute was a good one for classification and should be given a high rank.

The relief approach worked in a data set only with two classes (i.e. there were two possible values for the decision attribute). It proceeded in the following three main steps: (1) picking a training instance x randomly, (2) finding the nearest neighbor xP with the other class from x and updating the weight (rank) of each attribute by comparing the attribute values of x and xP, and (3) finding the nearest neighbor xS with the same class as x and updating the weight of each attribute by comparing the attribute values of x and xS. Kononenko then extended the approach to multi-class data sets [24]. Hall proposed an approach called correlation-based feature selection, which was the first one to evaluate subsets of attributes rather than individual attributes [15]. In that approach, a heuristic evaluation equation that simultaneously considered feature-class and feature-feature correlations was proposed to take the relevance and the redundancy into account. Besides, several approaches evaluated the subset of attributes by using “class consistency” [1][26]. They searched for subsets of attributes that could divide the data into groups with strong classes. Next, the concept of reducts, which is another viewpoint of feature selection, is introduced.

2.2 Reducts

Let I = (U, A) be an information system, where U = {x1, x2, …, xN} is a finite non-empty set of objects and A is a finite non-empty set of attributes called condition attributes [23]. A decision system is an information system of the form I = (U, A {∪ d}), where d is a special attribute called decision attribute and d ∉ A [23]. For any object xi ∈ U, its value for a condition attribute a ∈ A, is denoted by fa(xi). The indiscernibility relation for a subset of attributes B is defined as:

}

where B is any subset of the condition attribute set A (i.e. B ⊆ A) [38]. If the indiscernibility relations from both A and B are the same (i.e. IND(B) = IND(A)), then B is called a reduct of A. That means the attributes used in the information system can be reduced to B, with the original indiscernibility information still kept. Furthermore, if an attribute subset B satisfies the following condition, then B is called a minimal reduct of A:

Take the simple information system in Table 2.2 as an example. In Table 2.2, the attribute set A consists of three attributes {Age, Income, Children} and the object set U consists of five objects {x1, x2, x3, x4, x5}. Since IND({Age, Children}) = IND(A) = {(x1, x1), (x2, x2), (x3, x3), (x4, x4), (x5, x5)}, the attribute subset {Age, Children} is a reduct of the information system. Besides, since neither IND({Age}) nor IND({Children}) equals IND(A), the attribute subset {Age, Children} is a minimal reduct.

Table 2.2: A simple information system.

Object Age Income Children

x1 Young Low No

x2 Middle Middle Yes

x3 Senior High Yes

x4 Young Low Yes

x5 Senior Middle No

When a decision system, instead of an information system, is considered, the definition of a reduct B (B ⊆ A) can be modified as follows [39]: attribute values of xi for the attribute set B. Similarly, if no subset of B can satisfy the above condition, B is called a minimal reduct in the decision system. Take the simple decision system shown in Table 2.3 as an example. It is modified from Table 2.2. A decision attribute, Buying Computers, is added to the original information system (Table 2.2) to form a decision system. In this example, the attribute subset {Age, Income} is not a reduct since the two objects x1 and x4 have the same values for the two attributes but belong to different classes.

On the contrary, the attribute subset {Age, Children} is a reduct for the decision system.

Furthermore, it is a minimal reduct since neither {Age} nor {Children} is a reduct.

Table 2.3: A simple decision system.

Object Age Income Children Buying Computers

x1 Young Low No No

x2 Middle Middle Yes No

x3 Senior High Yes Yes

x4 Young Low Yes Yes

x5 Senior Middle No No

Finding minimal reducts has been proven as an NP-Hard problem. Li et al proposed the concept of “approximate” reducts [25] to speed up the searching process. An approximate reduct allows for some reasonable tolerance degrees, but can greatly reduce the computation complexity. Next, the concept of relative dependency is introduced.

2.3 Relative Dependency

Han et al. [17] and Li et al. [25] developed an approach based on the relative dependency to find approximate reducts. The relative dependency is motivated by the operation “projection”, which is very important in the relational algebra. It can also be easily executed by SQL or other query languages. Given an attribute subset B ⊆ A and a decision attribute d, the projection of the object set U on B is denoted by ΠB (U) and can be computed by the following two steps: removing attributes in the different set (A－B) and merging all the remaining rows which are indiscernible [25]. Thus, among the tuples with the same attribute values for B, only one is kept and the others are removed. For example, the projection of the data in Table 2.2 on the attribute {age} is shown below:

Π{Age} (U ) = {x1, x2, x3}.

In this example, x4 and x5 are removed since they have the same value of the attribute Age as x1 and x3 have. Similarly, the projection on the attribute Children and on the attribute subset {Age, Children} is shown below:

Π{Children} (U ) = {x1, x2}, and

Π{Age, Children} (U ) = {x1, x2, x3, x4, x5}.

Han et al. thus defined the relative dependency degree ( ) of the attribute subset B with regard to the set of decision attributes D as follows:

δB

( )

D B D B B

Π ∪

= Π

δ ,

where |ΠB(U)| and |ΠB^∪D(U)| are the numbers of tuples after the projection operations are performed on U according to B and B∪D, respectively. Take the decision system shown in Table 2.2 as an example. |Π{Age} (U )| = |{x1, x2, x3}| = 3 and |Π{Age, Buying Computers} (U)| = |{x1, x2, x3, x4, x3}| = 5. The relative dependency degree of {Age} with regard to {Buying Computers}

is thus 3/5, which is 0.6.

2.4 The k-means and the k-medoids Clustering Approaches

The k-means and the k-medoids approaches are two well-known partitioning (or clustering) strategies. They are widely used to cluster data when the number of clusters is given in advance. The k-means clustering approach [27][18] consists of two major steps: (1) reassigning objects to clusters and (2) updating the centers of clusters. The first step calculates the distances between each object and the k centers and reassigns the object to the group with the nearest center. The second step then calculates the new means of the k groups just updated and uses them as the new centers. These two steps are then iteratively executed

until the clusters no longer change.

The k-medoids approach [20] adopts a quite different way of finding the centers of clusters. Assume k centers have been found. The k-medoids approach selects another object at random and replaces one of the original centers with the new object if better clustering results can be obtained. The absolute-error criterion [18] shown below is used to decide whether the replacement is better or not:

where E is the sum of the absolute errors for all the objects in the data set, p is an object in cluster Cj, oj is the current center of Cj, and the absolute value | p－oj | means the distance between the two objectsp and oj. For each randomly selected object oj’, one of the original k centers, say oj, will be replaced with it and its new sum E’ of absolute errors will be calculated. E’ will then be compared with the previous E. If E’ is less than E, then oj’ is more suitable as a center than oj. oj’ thus actually replaces oj as a new cluster center; otherwise, the replacement is aborted. The same procedure is repeated until the cluster centers no longer change.

The complexity of the k-medoids approach is in general higher than the k-means approach, but the former can guarantee that all the centers of clusters obtained are objects themselves. This feature is important to the proposed attribute clustering here since not only the attributes are clustered but also the representative attribute of each cluster has to be found.

On the contrary, the k-means approach may use non-object points as cluster centers. Note that both the k-means and the k-medoids approaches are mainly designed to cluster objects, but not attributes. As mentioned above, the goal of the paper is to cluster attributes. An attribute clustering method based on k-medoids is thus proposed to achieve this purpose. It also uses a

better search strategy to find centers in a dense region, instead of random selection in k-medoids. Besides, a method to measure the distances (dissimilarities) among attributes is also needed.

2.5 Clustering with Unknown Cluster Numbers

k-means and k-medoids are two well-known partitioning (or clustering) methods. They are widely used to cluster data when the number of clusters is given in advance. In real situations, it is sometimes hard to get the desired number of clusters in advance. In the past decades, many clustering approaches for this problem have been proposed. These approaches can be divided into two main classes. The first one is to design a procedure to determine the cluster number first [6][35]. Another procedure is then designed for forming the clusters according to the clustering number obtained. The kind of approaches may, however, overestimate or underestimate the cluster number, thus causing the final clustering results not completely meet users’ requirements. On the contrary, the second one is to develop new algorithms that perform clustering without deciding the desired cluster numbers in advance.

For example, some approaches based on evolutionary computation have been proposed.

Sarkar et al proposed a clustering approach for unknown cluster numbers based on the evolutionary programming technique [34]; Cole proposed the Genetic Clustering Algorithm (GCA) to partition objects into an adequate number of clusters [10].

Besides, some algorithms based on Artificial Neural Networks (ANN) have been proposed as well. For example, Xu et al proposed a modified competitive learning algorithm called Rival Penalized Competitive Learning (RPCL) [40]. In that algorithm, not only the weights of a winner node were modified but also the weights of the second winner (rival)

were considered. RPCL can automatically select an appropriate cluster number, but its performance is quite sensitive to the “de-learning” rate caused by the rival. To avoid this problem, Cheung proposed the k*-means clustering algorithm that the idea “rival” is replaced with some adjustment mechanism [9].

A particular viewpoint for clustering originated from the corrupted clique problem in the graph theory. A clique is a graph with its each vertex connected to all the other vertices; a clique graph is a graph that each connected component is a clique. The corrupted clique problem is stated as follows. Given a graph G, the problem is to find the smallest number of addition and removal of edges that will transform G into a clique graph. For clustering, the vertices in a graph represent the objects to be clustered, and two vertices are connected by an edge if their similarity is greater than or equal to a given threshold.

Ben-Dor et al proposed an algorithm named Clustering Affinity Search Techniques (CAST) to perform clustering for gene expression patterns [3]. It first computed the similarity between each pair of objects and then used the similarity to partition all objects into an appropriate number of clusters under a given threshold of affinity. It used two major operations: ADD and REMOVE. The ADD operation added highly similar objects to a cluster (i.e. similarity being greater than or equal to γ); The REMOVE operation removed less similar objects from a cluster (i.e. similarity being less than γ). With these two operations, each object was then processed one by one until all objects had been assigned to clusters. The desired clusters could thus be constructed. The average similarity of the attribute pairs in the same cluster can be guaranteed to be above γ. This is the major difference between CAST and other cluster algorithms.

2.6 Case-Based Reasoning

Case-Based Reasoning (CBR) is the process of solving new problems based on the solutions of similar past problems. The major tasks of CBR can be divided into five phases.

When a new problem arrives, the situation of this problem is identified in the Case Representation phase. After that, the important features of the new case are extracted as its indexes in the Indexing phase. These indexes are then passed to the Matching phase for retrieving similar cases in the case base. The Adaptation phase then adapts the solutions of similar cases by adaptation rules to fit the new problem. After the final solution of the new case is confirmed by users, it is stored in the case base via the Storage phase.

The success of a CBR system mainly depends on effective and efficient retrieval of similar cases for a new problem. Indexing and matching are thus very important to CBR [36].

Indexing usually uses some features of cases for identification and matching uses a pre-defined matching function for case retrieval. Each feature is given a weight to represent its importance. Based on weighted sums of features matched, similar cases in a case base can then be retrieved [8][14].

Several useful approaches have been proposed to retrieve similar cases. Two methods for assigning the weights of features were proposed by Cercone et al. [8] and Shin et al. [36].

Gupta discussed that the weights of features were different between a new case and prior cases [14]. The performance of retrieving similar cases has, however, seldom been discussed.

Retrieving similar cases needs much computation time when a matching function becomes complex or when the number of cases in a case base grows large. Retrieving similar cases efficiently thus becomes an important issue in large-scale CBR.

2.7 k-Nearest-Neighbor Classifier

The k-nearest-neighbor classifier (k-NN) is a method for classifying objects based on the k closest training examples in the feature space. To classify an unknown object, a k-nearest-neighbor classifier searches the feature space for the k objects that are closest to the unknown object, i.e. the k “nearest neighbor”. Then the unknown object can be classified into the major class of its k nearest neighbors. Basically, this approach is quite labor intensive, especially given a great amount of training sets or a high-dimensional feature space. That is why it proposed in the early 1950s [18], but popular until the 1960s. As the computing power has been improved, it has been widely used in the area of pattern recognition.

Since the complexity of k-NN is sensitive to the size of training data and the dimension of the feature space, many approaches to speed up classification have been proposed over the years. For example: seeking to reduce the times of distance evaluations actually performed, partitioning the feature space and restricting the distance computation within specific area, and using the parallel computation technique.

CHAPTER 3 Calculation of Attribute Similarity

The goal of the thesis is to cluster attributes such that the efficiency of classification can be improved. For achieving this goal, it is thus important to develop an evaluation method which can measure the similarity of attributes. In this thesis, we use the dependency degree to represent the similarity between two attributes. If two attributes have high dependency degree on each other, they can be thought of as high similarity. In this chapter three evaluation methods for attribute similarity are proposed.

3.1 Attribute Similarity Based on Relative Dependency

As mentioned above, Han et al. developed an approach based on the relative dependency for finding approximate reducts [17]. We extend this metric to measure the similarity between any two attributes [25]. Given two attributes Ai and Aj, the relative dependency degree of Ai with regard to Aj was denoted by Dep(Ai, Aj) and was defined as: dependency degree only considers the relative dependency between a condition attribute set and a decision attribute set. Here we extend the above formula to estimate the relative data dependency between any pair of attributes. The dependency degree was not symmetric, such that the condition Dep(Ai, Aj)＝Dep(Aj, Ai) was not always valid. The average of Dep(Ai, Aj)

and Dep(Aj, Ai) was thus used to represent the similarity of the two attributes Ai and Aj, that is:

( , ) ( , )

( , )

i j i j

i j

Dep A A Dep A A

Sim A A +

= .

3.2 Attribute Similarity Based on Majority Sets

In last section, we propose an evaluation method for attribute similarity based on relative dependency. This measure can not, however, reflect the actual similarity (dependency) of attributes in some situations. The small decision system shown in Table 3.1 is used as an example to illustrate the problem.

Table 3.1: A simple decision system to evaluate the similarity.

Object Age Income Children Buying Computers

x1 Young Low No No

x2 Young Low No No

x3 Young Low No Yes

x4 Young Low No Yes

x5 Young Middle No No

x6 Young Middle No Yes x7 Young Middle No Yes x8 Young Middle Yes No

It can be observed from Table 3.1 that |Π^{Age^}(U)| = 1, |Π^{Children^}(U)| = 2 and |Π

{Income^}(U)| = 2. The relative dependency degrees are thus found as follows: Dep(Age, Children)

= 0.5, Dep(Age, Income) = 0.5, Dep(Children, Age) = 1, and Dep(Income, Age) = 1. Both of

Sim(Age, Children) and Sim(Age, Income) can then be calculated as 0.75, such that the degree for the attribute Age to resemble the attribute Children is equal to the degree for Age to resemble Income. However, it is easy to observe from Table 3.1 that Age is more similar to

在文檔中利用屬性分群之特徵選擇及其應用 (頁 14-0)