An example - Attribute Clustering with Unknown Cluster Numbers

CHAPTER 5 Attribute Clustering with Unknown Cluster Numbers

5.3 An example

In this section, a simple example is given to show how the proposed algorithm can be used to cluster the attributes. Table 5.1 is a decision system in which there are five condition attributes A = {SE, IN, TR, CH, AG} standing for Sex, Income, Transport, Children and Age respectively. The values of these attributes are {Male, Female}, {High, Middle, Low}, {Car, Bus, Train}, {Yes, No} and {Young, Middle, Senior}. Besides, there is a decision attribute PE standing for Keeping the Pet, and its possible values are {Yes, No}. Assume the similarity threshold parameter γ is set at 0.65. For the set of data, the proposed algorithm proceeds as follows.

Table 5.1: An example for attribute clustering.

Step 1: The majority similarity measure of each pair of attributes is calculated. Take the similarity between CH and AG as an example. Since X(AG=Young∧CH =Yes) =2 and

Similarly, the dependency degree DepM(CH, AG) for CH to depend on AG is calculated as 0.75. The majority similarity (SimM(CH, AG)) of the two attributes CH and AG is thus calculated as (0.625+0.75)/2, which is 0.688. The majority similarity values for the other pairs can be found in the same way. The resulting majority similarity values for all the pairs of attributes are shown in Table 5.2.

Table 5.2: The resulting majority similarity values for all pairs of attributes.

SE IN TR CH AG SE 1 0.65 0.6 0.5 0.62 IN 0.65 1 0.72 0.7 0.5 TR 0.6 0.72 1 0.68 0.72 CH 0.5 0.7 0.68 1 0.625 AG 0.62 0.5 0.72 0.625 1

Step 2: Initially set C =φ, A^c =φ, copen =φ, A^U = A, k = 1 and a(Ai) = 0 for 1≦i≦|A|.

Step 3: Since Copen is empty now, an attribute is randomly selected from A^U and put into Copen. Assume SE is selected. The attribute SE is then removed from A^U to Copen. Thus, Copen = {SE} and A^U = {IN, TR, CH, AG}.

Step 4 : Since SE has been added to Copen, the affinity of each attribute in Copen∪A^U except for SE is updated. For example, the affinity of IN is calculated as:

( ) ( ) ( , )

0 0.65 0.65.

a IN =a IN +Sim IN SEM

= + =

The affinity of each attribute is calculated and shown in Table 5.3.

Table 5.3: The affinity of each attribute.

SE IN TR CH AG

Affinity 0 0.65 0.6 0.5 0.62

Step 5: Steps 3 and 4 are repeated until the maximum affinity of the attributes in A^U is less than γ C_open . Since the attribute IN has the maximum affinity in A^U (= {IN, TR, CH, AG}) and the similarity threshold parameter γ is set at 0.65, thus a(IN)≧γC_open (i.e., 0.65 ≧ 0.65×1). Steps 3 and 4 are thus executed again. Copen becomes {SE, IN} and A^U becomes {TR,

CH, AG}. The affinity of each attribute except for IN is then updated in Step 4. These two steps are iteratively repeated until the maximum affinity of the attributes in A^U is not greater than the value γ C_open . The entire process mentioned above is listed in Table 5.4. In this example, the three attributes {SE, IN, TR} are added to Copen, such that Copen = {SE, IN, TR}

and A = {CH, AG}.

Table 5.4: The entire process of Steps 3 to 4.

#Iteration Operation a(SE) a(IN) a(TR) a(CH) a(AG)

0 Initialization 0 0 0 0 0

Step 6: The attribute with the minimum affinity in Copen is first found. In this case, SE is thus selected. Since the affinity of SE less than γ

(

Copen ⁻¹

)

(i.e., 1.25 < 0.65×(3-1)), SE is removed from Copen and added into A^U. After the step, Copen = {IN, TR} and A^U = {SE, CH, AG}.

Step 7: If an attribute Aj is removed from Copen, the affinity of each attribute except for Aj has to be updated. For example, since SE has just been removed from Copen, the new

affinity of IN is calculated by subtracting SimM(IN, SE) from a(IN), such that a(IN) = 0.72.

The updated affinity of each attribute after this step is listed in Table 5.5.

Table 5.5: The updated affinity after SE is removed from Copen.

SE IN TR CH AG

Affinity 1.25 0.72 0.72 1.38 1.22

Step 8: Steps 3 to 7 are repeated until the attributes in Copen have converged. In the example, since the attribute CH in A^U has the maximum affinity value and (a CH)>γ C_open (i.e., 1.38 > 0.65 × 2), CH is thus removed from A^U and added to Copen in Step 3. The affinity of each attribute except for CH is then updated in Step 4, with the results shown in Table 5.6.

Table 5.6: The updated affinity after adding CH to Copen.

SE IN TR CH AG

Affinity 1.75 1.47 1.41 1.38 1.908

After that, the attribute with the maximum affinity in A^U is AG. Its affinity value is, however, less than the value γC_open (i.e., a(AG) < 0.65 × 3). Step 6 is then executed. The attribute CH in Copen has the minimum affinity, and its affinity is larger than the value

(

Copen ⁻¹

)

γ (= 0.65 × 2). That is, a⁽CH⁾^>γ

(

Copen ⁻¹

)

. No attribute needs to be removed

from Copen. Copen has thus converged and Step 9 is executed.

Step 9: Since Copen has converged, Copen is then a cluster. In this case, C1 = {IN, TR, CH}

and C = C1.

Step 10: The attribute with the largest affinity in Copen is selected as the representative attribute. In this example, since a(IN) > a(TR) > a(CH), the representative attribute is IN.

Thus, . A^c ={IN}

Step 11: Reset Copen as φ, a(SE) = 0, a(AG) = 0, and set k = 1+1 = 2, to generate the next cluster

Step 12: Steps 3 to 11 are repeated until the set A^U is empty. The final clusters can thus be found as follows:

C1= {IN, TR, CH}, with the representative attribute AL.

C2 = {SE}, with the representative attribute SE.

C3 = {AG}, with the representative attribute AG.

CHAPTER 6 Case-Based Reasoning with Attribute Clustering

Case-based reasoning (CBR) is the process of solving new problems based on the solutions of similar past problems. The success of a CBR system mainly depends on effective and efficient retrieval of similar cases for a new problem. Indexing and matching are thus two very important issues in CBR [36]. Indexing usually uses some features from cases for identification, and matching usually uses a pre-defined matching function for case retrieval.

In this chapter, a case-based reasoning approach with attribute clustering is proposed. Three possible situations are discussed, and the corresponding solutions are designed.

6.1 The Proposed Algorithm

Assume that some preprocessing steps for case-based reasoning have been done. They include case selection, discretization of numerical data, grouping the attributes into adequate cluster numbers, and finding the representative attribute in each cluster. These representative attributes are then used to compare an object (new case) with other cases stored in the case base.

The proposed algorithm can handle the following situations. The first one is that all the representative attributes appear in the case base and the values of the representative attributes for the object to be classified are given explicitly. The second one is that some values of the representative attributes for the object to be classified are unknown. The third one is that

some representative attributes do not appear in the case base. For the second and the third situations, the inference process based on the representative attributes cannot be accomplished. Attribute replacement is then used to achieve approximate inference. The proposed algorithm for case-based reasoning is described in details below.

The algorithm for case-based reasoning with attribute clustering:

Input: A new object x’ with m attribute values, a similarity matrix for the m attributes, g attribute clusters C1 to Cg with their representative attributes A^c = { ,A A₁^c ₂^c, ," A_g^c} and a case base with k cases.

Output: The case xi which is the most similar to the object x’.

Step 0 : Check whether any missing value exists in object x’ and whether any representative attribute doesn’t appear in the case base. According to the check results, three situations occur and each has its corresponding steps.

Situation 1: All the representative attributes appear in the case base and the values of the representative attributes for the object to be classified are given explicitly.

Step 1.1: Calculate the dissimilarity between object x’ and each case xi as follows:

according to the j-th representative attribute A^c_j and g is the number of

representative attributes. Here, the calculation of δ_j( ', )x x_i depends on the type of attribute A^c_j and is divided into the following two cases.

(1) Attribute A^c_j is binary or categorical. In this case, δ_j( ', )x x_i = 0 if the values of A^c_j for object x’ and case xi are the same, and δ_j( ', )x x_i =1 otherwise.

(2) Attribute is ordinal. In this case, the values of the attribute are first mapped into a number list from 1 to M

Aj A^c_j

j, where Mj is the number of values in (This can also be done in the preprocessing steps). Let r

Step 1.2: Find the case xi with the minimum dissimilarity to object x’, and output the class of xi as the most possible class to x’.

Situation 2: All the representative attributes appear in the case base, and some values of the representative attributes for the object x’ to be classified are unknown.

Step 2.1: Initially set Indexset = A^c, where Indexset is the set of attributes to perform classification for object x’.

Step 2.2: Replace attribute A^c_j in Indexset with its most similar attribute A^*_j if is unknown, where denotes the value of attribute

c( ') A xj c( ')

A xj A^c_j for object x’ and A^*_j

denotes the attribute which is the most similar to A^c_j in attribute cluster Cj and is explicit (without missing value). If all the attributes in the attribute cluster C

*_j( ') A x

j are missing for the object x’, find the most similar and existing replacement attribute from the other clusters. This step is iteratively executed for all the attributes with missing values in object x’.

Step 2.3: Calculate the dissimilarity between object x’ and each case xi as follows:

| |

and is the same as that in Situation 1.

Step 2.4: Find the case xi with the minimum dissimilarity to object x’, and output the class of xias the most possible class to x’.

Situation 3: At least one representative attribute does not appear in the case base.

Step 3.1: Initially set Indexset = A^c.

Step 3.2: Replace attribute A^c_j in Indexset with its most similar attribute A^*_j if A^c_j does not appear in the case base or in the object, where A^*_j is the attribute most similar

to A^c_j in attribute cluster Cj and existing in the case base and in the object. If all the attributes in the attribute cluster Cj are missing for the case base and the object, find the most similar and existing replacement attribute from the other clusters.

This step is iteratively executed for all the missing representative attributes in the case base and in the object.

Step 3.3: Calculate the dissimilarity between object x’ and each case xi as follows:

| |

according to the j-th attribute in Indexset.

Step 3.4: Find the case xi with the minimum dissimilarity to object x’, and output the class of xias the most possible class to x’.

6.2 Examples

In this section, three simple examples are given to show how the proposed algorithm can be applied to case-based reasoning. Assume the attribute clustering has been done in advance, with the three attribute clusters formed as shown in Table 6.1, where the gray fields represent the representative attributes. Thus, C1 = {Sex, Income}, C2 = {English, Country} and C3 = {Married, Children, Age}. The similarity between any two attributes is shown in Figure 6.1, where the distance between two attributes in the figure represents how similar they are.

Table 6.1: Three constructed attribute clusters.

C₁ C₂ C₃

Sex English Married Income Country Age

Children

min Children

Married English Age

Country

Sex Income

C₂

C₁

C₃

Figure 6.1: The concept of attribute clusters.

In Example 1, a case base with five cases is used to perform the reasoning for an object with its each attribute value is explicitly given. Besides, all the representative attributes exist in the case base. In Example 2, an object with missing values of the representative attributes is to be classified. The attribute replacement technique is thus used to choose the other attributes to classify it. In Example 3, some representative attributes do not appear in the case base. The attribute replacement technique is also used for approximate reasoning. Example 1 is first introduced below.

Example 1: Given a case base with five cases as shown in Table 6.2, where {Sex, Income, English, Country, Married, Children, Age} is a set of condition attributes and Credit is a decision attribute. The attributes Sex and Country are categorical; Children and Married

are binary; Income, English and Age are ordinal and their attribute values have been mapped to a number sequence.

Table 6.2: The case base in Example 1.

Case Sex English Country Income Age Married Children Credit

x1 Male 3 Singapore 4 2 Yes No Good

x2 Female 3 Taiwan 3 1 No No Average

x3 Male 2 Japan 2 3 Yes Yes Good

x4 Male 1 Taiwan 1 1 Yes No Poor

x5 Female ² ^Japan 2 3 Yes Yes Good

Assume object xa is a new case and needs to be classified. Its condition attributes are shown in Table 6.3.

Table 6.3: An object xa to be classified in Example 1.

Sex English Country Income Age Married Children

Male 4 Singapore 3 2 Yes No

The process of case reasoning for the proposed approach is illustrated as follows.

Step 0: Since all the representative attributes appear in the case base and the values of the three representative attributes of Sex, English and Married for object xa are explicitly given. xa can be classified in the steps of Situation 1.

Step 1.1: The dissimilarity between object xa and each case is computed. For example, the dissimilarity d(xa, x1) is computed in the following substeps:

(1) For the first attribute Sex, which is categorical, δ₁( , ) 0x x_a ₁ = since xa and x1 are both male. Similarly, for the third attribute Married, which is binary, it can also be computed that δ₃( , ) 0x x_a ₁ = .

(2) For the second attribute English, which is ordinal, the normalized attribute values for the object and for the case are first computed. Thus:

The dissimilarity between xa and other cases can be computed in the same way. The results are shown in Table 6.4.

Table 6.4: The dissimilarity between xa and each case in Example 1.

d(xa, x1) 0.11 d(xa, x2) 0.77 d(xa, x3) 0.22 d(xa, x4) 0.66 d(xa, x5) 0.55

Step 1.2: The case with the minimum dissimilarity to object xa is chosen as the most matching case in the case base. In the example, case x1 is found. Since Credit(x1) is Good, the most possible credit of xa is also evaluated as Good.

Example 2: Another object xb with missing values is used to illustrate how to classify it by attribute replacement. The attribute values of xb are shown in Table 6.5, where the values of the two attributes, English and Married, are missed. The proposed algorithm proceeds as follows.

Table 6.5: An object xb to be classified in Example 2.

Sex English Country Income Age Married Children

Male **** Singapore 3 2 **** No

Step 0: Since the values English(xb) and Married(xb) are unknown, xb can be classified by the steps of Situation 2.

Step 2.1: Initially set Indexset = A^c = {Sex, English, Married}.

Step 2.2: The attributes in Indexset are replaced if their values for xb are unknown. In this example, since the values for English and Married are unknown in xb, other appropriate attributes are thus found to replace them. Since the attribute Country is the most similar to English in cluster C2 and the value Country(xb) is explicit, English is thus replaced with Country to be a new member in Indexset. Similarly Married is replaced with Children. Thus, Index = {Sex, Country, Children}.

Step 2.3: The dissimilarity between xb and each case is calculated according to the attributes in Indexset. The results are shown in Table 6.6.

Table 6.6: The dissimilarity between xb and each case in Example 2.

d(xb, x1) 0 d(xb, x2) 0.6 d(xb, x3) 0.6 d(xb, x4) 0.3 d(xb, x5) 1

Step 2.4: Since the case x1 has the minimum dissimilarity to xb, the credit of xb is thus assigned to the credit of x1, which is Good.

In the next example, we will give an example in which some representative attributes do not appear in the cases base. Similar to Example 2, the attribute replacement technique is also used to solve it.

Example 3: Given another case base shown in Table 6.7, the credit of xa can be derived by the following steps.

Table 6.7: Five cases in Example 3.

Case English Country Income Age Children Credit

x1 3 Singapore 4 2 No Good

x2 3 Taiwan 3 1 No Average

x3 2 Japan 2 3 Yes Good

x4 1 Singapore 1 1 No Poor

x5 2 Japan 2 3 Yes Good

Step 0: Since the representative attributes, Sex and Married, do not appear in the case base, xa can be classified by the steps for Situation 3.

Step 3.1: Initially set Indexset = A^c = {Sex, English, Married}.

Step 3.2: The attributes in Indexset are replaced if they do not appear in the case base. In this example, since the two attributes, Sex and Married, do not appear in the case base, other appropriate attributes are thus found to replace them. For the attribute Sex, Income is most similar to it in cluster C1 and Income(xa) is explicit. Sex is thus replaced with Income to be an attribute in Indexset. Similarly, Married is replaced with Children. Thus, new Indexset = {Income, English, Children}.

Step 3.3: The dissimilarity between xa and each case is calculated according to the attributes in Indexset. The results are shown in Table 6.8.

Table 6.8: The dissimilarity between xa and each case in Example 3.

d(xa, x1) 0.27 d(xa, x2) 0.13 d(xa, x3) 0.67 d(xa, x4) 0.53 d(xa, x5) 0.67

Step 3.4: Since the case x2 has the minimum dissimilarity to xb, the credit of xa is thus predicted as Average (= Credit(x2)).

CHAPTER 7 The k-Nearest-Neighbors Classifier with Attribute Clustering

The k-Nearest-Neighbors (k-NN) classifier [18] is an easy and popular classification tool.

It is time-consuming when the amount of the training data is huge and the dimension of the feature space is high. In this chapter, we propose an approach which integrates attribute clustering and the k-NN classifier to speed up the execution time of the classification. Unlike the conventional k-NN approach, only a part of attributes are used to search for the k closest training objects. Besides, if a test object has some missing values for the selected attributes, the attributes grouped in the same clusters can also be used to achieve approximate inference.

7.1 The Proposed Algorithm

In this section, we propose an approach which integrates the proposed attribute clustering and the k-NN classifier. Assume that the condition attributes have been partitioned into g clusters C1 to Cg, the set of representative attributes A^c = { ,A A₁^c ₂^c, ," A^c_g} have been found, and the similarity matrix for all attributes have been computed. Therefore, only the representative attributes are considered to search for the k closest objects among the training set. The clustered attributes are also useful when some test objects have missing values for the representative attributes. The proposed algorithm is described in details below.

The proposed k-nearest-neighbors classifier with attribute clustering:

Input: A test object x’ with m attribute values, g attribute clusters C1 to Cg, the set of representative attributes A^c = {A A₁^c, ₂^c, ," A_g^c}, the similarity matrix for the attributes and a set of training objects U = {x1, x2, …, xN} with L classes.

Output: The probabilities that object m belongs to the classes.

Step 1: Initially set Indexset = A^c, where the variable Indexset is a set of attributes to perform classification with object x’; also set θi = 1, where θi is the similarity coefficient of the j–th attribute of the i-th training object due to attribute replacement.

Step 2: Replace attribute A^c_j in Indexset with its most similar attribute A^*_j if is unknown, where is the value of attribute

c( ') A xj c( ')

A xj A^c_j for object x’, A^*_j is the attribute most similar to A^c_j in the attribute cluster Cj and is explicit (without missing value). If all the attributes in the attribute cluster C

*_j( ') A x

j are missing for the object x’, find the most similar and existing replacement attribute from the other clusters. This step is iteratively executed for all the attributes with missing values in object x’.

Step 3: For each training object xi, if its value of the j-th attribute Aj in Indexset is missing, replace attribute Aj with its most similar attribute A^*_j in the attribute cluster Cj

which has its values in both x’ and xi. If all the attributes in the attribute cluster Cj

cannot meet the condition, find the most similar and existing replacement attribute

from the other clusters. Update the similarity coefficient of θi as:

θ θ_i = ×_i Sim A A( ,^*_j _j).

This step is iteratively executed for all the attributes with missing values in object xi. Use the modified attribute set for the evaluation of the dissimilarity between xi and x’ in Step 4.

Step 4: Calculate the dissimilarity between object x’ and each training object xi as follows:

according to the j-th attribute Aj generated in Step 3; otherwise, go to Step 5. Here, the calculation of δ_j( ', )x x_i depends on the type of attribute A^c_j and is divided into the representative attribute A^c_j for an object x is then calculated as follow:

Step 5: Search for the k closest objects to the test object x’ according to the dissimilarity in Step 4 and put them into the set S.

在文檔中利用屬性分群之特徵選擇及其應用 (頁 52-0)