C OMPOSITION P ENALTY - 具複合型屬性之特徵群聚與選取

The more attributes in a composite attribute, the higher accuracy for classification with the composite attribute. But the purpose of feature selection is to use less attributes to do classification. Thus some penalty should be given to composite attributes. A composite attribute with more attributes will be given more penalty.

In this thesis, we use the following penalty function for a composite attribute A with x attributes: penalty curve. For example, the plot of the penalty function with γ = 0.5 is shown in Figure 3.2.

The penalty function has two properties. Firstly, the penalty is among 0 to 1.

Secondly, the sum of the penalty values of two composite attributes respectively with x1 and x2 (x1 + x2 ≤ N) individual attributes will be smaller than that of the composite attribute with x1+x2 attributes. This is because a composite attribute with more attributes will be less desired. A single attribute can be thought of as a composite

attribute with only one attribute. In this case, the penalty is zero, which means no penalty. On the contrary, if a composite attribute includes all the N attributes, the penalty will be 1.

Figure 3.2: The plot of the penalty function with N = 8 and γ = 0.5

CHAPTER 4 GA-Based Clustering for Composite Attributes

In this chapter, a GA-based feature clustering algorithm for composite attributes is proposed. In this algorithm, we encode a possible composition and clustering result into a chromosome, and uses GA to drive the best one. The fitness of each chromosome is evaluated by the average accuracy of the possible attribute substitution in clusters, the cluster balance, and the total composition penalty for each possible feature subset.

4.1 Chromosome Representation

Each gene in a chromosome represents the status of an attribute, which can be divided into two parts, the composition part and the cluster part. The composition part is used to represent with which attributes an attribute can be combined into a composite attribute, and the cluster part is used to denote which cluster an attribute is located in. For convenient discussion, positive integers are used to encode the composition part and English lowercase letters to encode the cluster part. Assume

there are n attributes, A1 to An, to be processed. In a chromosome, two attributes Ai to Aj with the same integer in the composition part will compose a composite attribute, and those with the same English lowercase letter will be located in the same cluster.

Since the attributes composing a composite attribute have to be in the same cluster, the constraint that the letters should be the same for the genes with the same integers needs to be obeyed. An example is given below to illustrate it.

Example 4.1: Assume there are seven attributes {a1, a2, …, a7} to be divided into three clusters. Thus, K = 3 and N = 7. Suppose there is a chromosome shown in Table 4.1. According to its coding representation, A6 and A7 belong to cluster a, A1 and A4

belong to cluster b, and A2, A3 and A5 belong to cluster c. Besides, A3 and A5 compose a composite attribute and they have to be in the same cluster.

Table 4.1: An example for chromosome representation A1 A2 A3 A4 A5 A6 A7

1 2 3 4 3 5 6 b c c b c a a

4.2 Initial Population

Before the genetic algorithm begins, a set of P individuals are randomly generated to form the initial population. The composition part of a gene is

probabilistically set to a number among 1 to N, where N is the number of attributes.

The probability for setting a larger composite attribute will be less than that for setting a smaller one. For example, composite attributes with 2 features will appear in the initial population with a higher probability than those with three attributes. Besides, single attributes will have a higher probability than composite attributes.

The cluster part of a gene is randomly set to a number among 1 to K, where K is the number of clusters. The probability distribution is like a normal distribution, with the total attribute number N divided by K (N/K) at the peak.

4.3 Fitness and Selection

In order to develop a good result of attribute clustering from an initial population, the proposed algorithm selects parent chromosomes with high fitness values for mating. A good evaluation (fitness) function is thus needed to achieve the purpose.

The proposed fitness function consists of three factors: cluster accuracy, cluster balance and the total composition penalty. They are described as follows.

The cluster accuracy is used to evaluate the accuracy of a possible clustering result on the given training data. Since one purpose of the proposed

attribute-clustering approach is to reduce the adopted attribute number in approximate reasoning, a reasonable criterion for clustering results is thus the average accuracy of each attribute combination, which is composed of a single or composite attribute from each cluster. Take the chromosome in Figure 3.1 as an example. There are three clusters, with attributes A2 and A3 belonging to the first cluster, A4 and (A1 , A6) to the second cluster, and A5 and A7 to the third cluster. Note that (A1, A6) is a composite attribute. The number of all possible attribute combinations from the chromosome is then 2*2*2, which is 8.

Therefore for K clusters, one single or composite attribute from each cluster is selected and all the K selected attributes gathered together form a possible feature subset. Consider all the possible combinations from a chromosome. If all the combinations are of high accuracy, then any single or composite attribute in each cluster can be chosen to form the possible feature subset. That means the clustering result is good for classification. In this case, if an attribute value of a new data is missing or unavailable due to cost, it can be easily replaced with another single or composite attribute in the same cluster for classification. The replaced attribute subset can be expected to have a high accuracy for classification as well if the clustering result is good. The average accuracy of all the possible attribute combinations in a

chromosome thus provides a reasonable measure for its goodness.

For each combination, it is formed by some single or composite attributes.

However, if there are some composite attributes in the combination, the number of attributes for the combination is larger than K. But the purpose of feature selection is to use less attributes to do classification. Thus some penalty should be given to composite attributes. All the composite attributes have their penalty, which is defined in Chapter 3. The penalty values will be summed into a total_penalty value to be another evaluation criterion.

Another evaluation criterion for the goodness of a clustering result is the cluster balance. When we divide the attributes into K clusters, we hope to cluster the attributes as balanced as possible. If a clustering result is unbalanced, a new object with missing values may not be classified since no other alternative attributes can be used in the single-attribute clusters. A simple example is given below. Assume there are two chromosomes, with their results shown in Figure 4.1 and Figure 4.2, respectively. The one in Figure 4.1 are more balanced than that in Figure 4.2 although the latter may have a better accuracy than the former.

Figure 4.1: A more-balanced clustering result

Figure 4.2: A less-balanced clustering result

In Figure 4.2, a new object with a missing value in any of the two attributes, A3

and A6, can not be classified since no other alternative attributes can be used for

substitution.

The factor of cluster balance is thus important and designed here to avoid the above unbalanced situation. Formally, the factor of the cluster balance for a chromosome C is defined as follows:

| ,

where |clusteri| represents the attribute number in the i-th cluster, and CM is the number of composite and single attributes, which may be less than the total number of attributes N. The measure of cluster balance is mainly based on the principle of entropy. If a clustering result is more balanced, then its value will be larger. For example, we may compare Figure 4.1 and Figure 4.2, in which CM is 6. The numbers of the composite attributes in the three clusters in Figure 4.1 are all 2, and those in Figure 4.2 are 1, 1 and 4, respectively. According to the above formula, the cluster balance for Figure 4.1 is 1.585 and for Figure 4.2 is 1.252. Therefore, the clustering results in Figure 4.1 are better.

According to the above three criteria - accuracy, cluster balance, and composition penalty, a fitness function can be designed to evaluate the goodness of

chromosomes as follows:

where S is the set of all possible feature subsets from the clusters represented by the chromosome C, |S| is its cardinality, accuracyi is the classification accuracy of the i-th feature subset in S for a set of training examples, total_penaltyi is the total penalty of the composite attributes in the i-th feature subset, balance(C) is the cluster balance of the chromosome C, α and β are two parameters to adjust the relative importance of the three criteria, K is the given cluster number, and |Cluster(C)| is the actual cluster number of the chromosome C.

Ideally, all the attributes should be divided into K clusters for a given K. But in some chromosomes, the clustering results may be less than K clusters, so some clustering penalty is given to avoid its occurrence. The denominator K-|Cluster(C)|+1 in the above formula can thus be regarded as the penalty for this situation. If in a chromosome C, all the attributes are divided into K clusters, then the result of K-|Cluster(C)|+1 becomes 1, meaning no penalty on the fitness evaluation. On the contrary, if the actual cluster number of a chromosome C is less than K, then the result of K-|Cluster(C)|+1 is at least 2 and the fitness value will become smaller than that for the actual cluster number being K.

Take the clustering results in Figure 4.2 as an example to illustrate the above fitness evaluation. From Figure 4.2, the following four feasible feature subsets can be obtained: {3, (1, 4), 6}, {3, 2, 6}, {3, 5, 6} and {3, 7, 6}. If γ = 0.05, the total penalty of the feature subset {3, (1, 4), 6} is calculates as [(1-1)/(7-1)]^1.05 + [(2-1)/(7-1)]^1.05 + [(1-1)/(7-1)]^1.05, which is 0 + 0.152 + 0 (= 0.152). The total penalty for the other combinations can be similarly found. The accuracy of each feature subset is then evaluated based on the given training set. At last, the whole cluster balance of the clustering results is calculated according to the formula in above. For example, the accuracy values of the above feature subsets for the Table 3.1 are 0.9, 0.8, 0.6 and 0.5, respectively, and the cluster balance is calculated as 1.252. After that, the fitness of the chromosome representing the clustering results in Figure 4.2 can be obtained.

Assume both α and β are 1. The value of |Cluster(C)| is 3, and the fitness of the clustering results in Figure 4.2 is 0.834.

The fitness values of the chromosomes will be used for selecting individuals to execute crossover, mutation and reproduction. If an individual has a higher fitness value, it will be selected with a higher probability. Different probabilities control different mixed ratios from the parents. Each parent will be selected according to

roulette wheel selection strategy for crossover with a crossover rate Pc.

4.4 Genetic Operators

Genetic operators are very important to the success of specific GA applications.

Two important genetic operators in GAs are crossover and mutation. The crossover operator is first introduced here. It selects pairs of individuals from the current population for crossover. In the paper, the uniform crossover is used in the proposed genetic attribute-clustering approach. For the uniform crossover operator, a mask sequence is randomly generated for each pair to decide which genes will be exchanged. A mask sequence is composed of binary bits, with ‘1’ represents the corresponding attributes of the pair of chromosomes will exchange their gene values and ‘0’ representing no exchange. The following three kinds of crossovers will be executed.

1. Only the composition part exchanges according to the mask sequence. The cluster part will remain unchanged no matter what the mask sequence is.

2. Only the cluster part exchanges according to the mask sequence. The composition part will remain unchanged no matter what the mask sequence is.

3. Both the composition part and the cluster part exchange according to the mask sequence.

Thus, six children will be generated from a pair of chromosomes, and then the best two among them are chosen for competition with the ones from other pairs of chromosome by the reproduction procedure to survive in the next generation. Below, an example is given to show the process of the crossover operation. Assume there are two chromosomes C1 and C2 shown as in Table4.2.

Table 4.2: Two chromosomes for crossover Attribute A1 A2 A3 A4 A5 A6 A7

1 4 2 5 1 6 7 C1

a b b c a a c 2 3 7 4 5 4 3 C₂

a a b c b c a

Also assume the mask sequence for the pair of chromosomes is randomly generated as shown in Table 4.3.

Table 4.3: The mask sequence in the example 0 1 0 1 0 0 1

The three crossover operations are then executed on C1 and C2 according to the mask sequence. For the first kind of crossover operation, only the composition part

exchanges according to the mask sequence. The results are shown in Figure 4.3.

Figure 4.3: The results after the first kind of crossover

In Figure 4.3, A ,A and A in C and C2 4 7 1 2 exchange the composition part of their gene values. Since the attributes A2 and A7 within the composite attribute with the composition number 3 belong to different classes, their clusters have to be adjusted to be the same. In this example, the cluster of A in O7 1 is adjusted from c to b. The cluster adjustment process will be introduced later. The results are then shown in Table 4.4. The clusters of A and A2 7 thus become identical. Similarly, the clusters of A , A and A in O5 6 7 2 is changed to c, a and b, respectively.

Table 4.4: The results after the adjustment for the first kind of crossover

For the second kind of crossover operation, only the cluster part exchanges according to the mask sequence. The results are shown in Figure 4.4.

Attribute A1 A2 A3 A4 A5 A6 A7

Figure 4.4: The results after the second kind of crossover

Again, the adjustment process has to be done to make the composite attributes

consistent in their cluster numbers. The results after the adjustment process are shown in Table 4.5.

Table 4.5: The results after the adjustment for the second kind of crossover operation

At last, for the third kind of crossover operation, both the composition and the cluster parts exchange according to the mask sequence. The results are shown in

Figure 4.5.

Figure 4.5: The results after the third kind of crossover

Again, the adjustment process has to be done to make the composite attributes consistent in their cluster numbers. The results after the adjustment process are shown in Table 4.6.

Table 4.6: The results after the adjustment for the third kind of crossover operation

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 2 4 1 6 3 O5

a a b c a a a 2 4 7 5 5 4 7 O₆

a b b c c b b

Six children are thus generated from the pair of chromosomes C1 and C2. Then the fitness values of the six children are evaluated and the best two of them are kept for competition.

The mutation operation is executed after the crossover is done. It is performed on single chromosomes, instead of pairs of chromosomes as in crossover. The multi-point operation is the most commonly used among the mutation operations. It decides a gene in a chromosome for mutation according to a low mutation probability Pm. In the proposed representation, each gene includes two parts, the composition part

and the cluster part. The one point mutation operation is then modified as selecting a part of a gene in a chromosome for mutation. If the composition part of a gene is selected, it is randomly reset to a number among 1 to N, where N is the number of attributes. If the cluster part of a gene is selected, it is randomly reset to a number among 1 to K, where K is the number of clusters. Appropriate cluster adjustment process may need to be done as well after the mutation operation. An example is given in Figure 4.6 to illustrate the mutation process, in which A3 is selected to mutate for the composition part (from 2 to 3) and A6 is selected for the clustering part (from a to b).

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 2 4 1 6 3 O5

a a b c a a a

Mutation

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3 O5

a a b c a b a Figure 4.6: The results after the mutation

After the mutation, the resulting chromosome has to be adjusted to make the composite attributes consistent with their cluster numbers. In this example, the cluster number of A is changed from b to a to make the composite attribute consistent with 3

the cluster numbers of its components. The results after the adjustment process are shown in Table 4.7.

Table 4.7: The results after the adjustment Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3 O5

a a a c a b a

After crossover and mutation, some offspring chromosomes are generated. A selection mechanism “roulette wheel selection” is then adopted to form the population in the next generation. The above process is then repeated again until some termination criteria are satisfied. The criteria may include number of generations, execution time, or convergence of solutions obtained.

4.5 Cluster Adjustment

As mentioned above, the cluster adjustment process may need to be done after crossover and mutation operations. It is mainly used for avoiding the situation that the components in a composite attribute belong to different clusters, thus causing inconsistency. The process acts in the following way. If the cluster parts of the genes belonging to a certain composite attribute are not the same (with different English

lower-case letters), then the process is activated. It first finds out the cluster with the maximum occurrence number among the components of the composite attribute, and then reset the other clusters in the composite attribute to it. For example, assume a composite attribute includes three attributes, A1, A4 and A7 with their cluster parts being c, a and c, respectively. The cluster part of A4, which is originally a, will be reset to c.

If more than one cluster within a certain composite attributes have the same maximum occurrence number, then a cluster will be randomly generated from them.

For the above example, assume the three attributes, A1, A4 and A7, have their cluster parts as a, b and c respectively, then a cluster among the set {a, b, c} will be randomly chosen. Assume the cluster b is chosen in the example. The cluster parts of the other two attributes A1 and A7 (originally a and c) will be adjusted to b. After the adjustment process, it can be guaranteed that the components within a composite attribute will be located in the same cluster. A simple example is given below to illustrate the idea.

Example 4.2: In Figure 4.7, A2, A3 and A7 is a composite attribute, but the cluster parts are not the same. The cluster adjustment process is then executed. The cluster part of A2 and A3 are ‘a’, and the cluster part of A7 is ‘b’. The cluster with the

maximum occurrence number is thus ‘a’, and we reset ‘b’ to ‘a’. In Figure 4.8, A1 and A5 is a composite attribute, but their cluster parts are not the same. The cluster part of A is ‘a’, and of A1 5 is ‘c’. The cluster with the maximum occurrence number is ‘a’ or

‘c’. Therefore, we randomly choose one from them and reset the other. In Figure 4.8, the cluster part of A1 is reset from ‘a’ to ‘c’.

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3 O5

a a a c A b b

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3 O5

a a a c A b a

Figure 4.7: An example for the adjustment process

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3

a a a c c b a

Attribute A1 A2 A3 A4 A5 A6 A7

1 3 3 4 1 6 3

c a a c c b a

Figure 4.8: An example in the adjustment process with a tie

4.6 The Proposed Algorithm

According to the above description, the proposed GA-based algorithm for composite-attribute clustering is described below.

The proposed algorithm for clustering composite attributes:

INPUT: A training dataset with N attributes and a cluster number K.

OUTPUT: An appropriate K clusters of single or composite attributes.

STEP 1: Generate an initial population of P individuals (chromosomes) randomly, with each being a feasible attribute clustering result.

STEP 2: Calculate the fitness value of each chromosome C by the following sub-steps.

STEP 2.1: Determine all the possible feature subsets from the clustering results represented by the chromosome C.

STEP 2.2: Determine the accuracy of each possible feature subset from the given training examples.

STEP 2.3: Calculate the total penalty of each possible feature subset according to the component numbers of the composite attributes in the feature subset.

在文檔中具複合型屬性之特徵群聚與選取 (頁 33-0)