• 沒有找到結果。

Labeling unclustered categorical data into clusters based on the important attribute values

N/A
N/A
Protected

Academic year: 2021

Share "Labeling unclustered categorical data into clusters based on the important attribute values"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

Labeling Unclustered Categorical Data into Clusters Based on the Important

Attribute Values

Hung-Leng Chen, Kun-Ta Chuang and Ming-Syan Chen

Department of Electrical Engineering

National Taiwan University

Taipei, Taiwan, ROC

E-mail: {kidd,doug}@arbor.ee.ntu.edu.tw, mschen@cc.ee.ntu.edu.tw

Abstract

Sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sam-pling applied, those points which are not sampled will not have their labels. Although there is a straightforward ap-proach in the numerical domain, the problem of how to allocate those unlabeled data points into proper clusters remains as a challenging issue in the categorical domain. In this paper, a mechanism named MAximal Resemblance Data Labeling (abbreviated as MARDL) is proposed to al-locate each unlabeled data point into the corresponding appropriate cluster based on the novel categorical clus-tering representative, namely, Node Importance Represen-tative(abbreviated as NIR), which represents clusters by the importance of attribute values. MARDL has two ad-vantages: (1) MARDL exhibits high execution efficiency; (2) after each unlabeled data is allocated into the proper cluster, MARDL preserves clustering characteristics, i.e., high intra-cluster similarity and low inter-cluster similar-ity. MARDL is empirically validated via real and synthetic data sets, and is shown to be not only more efficient than prior methods but also attaining results of better quality.

Keywords: data mining, categorical clustering, data la-beling.

1

Introduction

The clustering problem has been deemed an important issue in the data mining, statistical pattern recognition, ma-chine learning, and information retrieval because of its use in a wide range of applications [11]. Given a set of data points, the goal of clustering is to partition those data points into several groups of similar points according to the pre-defined similarity measurement [2]. However, finding the optimal clustering result has been proved to be an NP-hard problem [12]. As the size of data grows at rapid pace,

clus-tering a very large database inevitably involves a very time-consuming process.

To improve the efficiency, sampling is usually used to scale down the size of the database [13]. In particular, sam-pling has been employed to speed up clustering algorithms in [3][14]. A typical way to utilize sampling techniques on clustering is to randomly choose a small set from the original database, and then the clustering algorithm is exe-cuted on the small sampled set. The clustering result which is expected to be similar to that obtained from the original database can hence be efficiently obtained.

However, the problem of how to allocate the unclustered data into appropriate clusters has not been fully explored in the previous works. This can be explained by the reason that in the numerical domain, there is a common solution to measure the similarity between an unclustered data point and a cluster based on the distance between the unclustered data point and the centroid of that cluster [11]. Each un-clustered data point can be allocated to the cluster with the minimal distance. Previous works usually deal with such a issue by this straightforward method. However, much of the data in the existing database is categorical. In the cat-egorical domain, the above procedure is infeasible because the centroid of cluster is difficult to define. Without loss of generality, the goal of clustering is to allocate every data point into an appropriate cluster. A partial clustering result obtained from the sampled database is usually not what the user really wants. Therefore, in the categorical domain, the problem of how to allocate the unclustered data remains as a challenging issue..

As a result, we propose in this paper a mechanism, named MAximal Resemblance Data Labeling (abbreviated as MARDL), to allocate each categorical unclustered data point into the corresponding proper cluster. The allocating process is referred to as Data Labeling: to give each unclus-tered data point a cluster label. The unclusunclus-tered data points are also called unlabeled data points. Figure 1 shows the

(2)

Database

Sampling Sampleddata Clustering

Unclustered data Data

Labeling Partial clustering result Total clustering result Cluster Analysis NIR MARDL

Figure 1. The framework of clustering a cat-egorical very large database with sampling and MARDL.

entire framework on clustering a very large database based on sampling and MARDL. In particular, MARDL is inde-pendent of clustering algorithms, and any categorical clus-tering algorithm can in fact be utilized in this framework. In MARDL, those unlabeled data points will be allocated into clusters via two phases, namely, the Cluster Analysis phase and the Data Labeling phase. The work doing in each phase is described below.

Cluster Analysis Phase: In the cluster analysis phase, a

cluster representative is generated to characterize the clus-tering result. However, in the categorical domain, there is no common way to decide cluster representative. Hence, a categorical cluster representative, named "Node Importance

Representative" (abbreviated as NIR), is devised in this

pa-per. NIR represents clusters by the attribute values, and the importance of an attribute value is measured by the follow-ing two concepts: (1) the attribute value is important in the cluster when the frequency of the attribute value is high in this cluster; (2) the attribute value is important in the clus-ter if the attribute value appears prevalently in this clusclus-ter rather than in other clusters. NIR identifies the significant components of the cluster by the important attribute values. Moreover, based on these two concepts to measure the im-portance of attribute values, NIR considers both the intra-cluster similarity and the inter-intra-cluster similarity to represent the cluster.

Data Labeling Phase: In the data labeling phase, each

un-labeled data point is given a label of appropriate cluster according to NIR. By referring to the vector-space model [1], the similarity between the unlabeled data point and the cluster is designed analogously to the similarity between the query string and the document. According to this similarity measurement, MARDL allocates each unlabeled data point into the cluster which possesses the maximal resemblance.

There are two advantages in MARDL: (1) high effi-ciency. MARDL is linear with respect to the data size. MARDL is efficient in essence and able to preserve the benefit of sampling on clustering very large database; (2) retaining cluster characteristics. MARDL gives each unla-beled data point a label of the cluster based on the partial clustering result obtained by clustering sampled data set. Since NIR considers the importance of the attribute value, MARDL will preserve clustering characteristic: high intra-cluster similarity and low inter-intra-cluster similarity.

This paper is organized as follows. In Section 2, we re-view several related works. Section 3 formulates the prob-lem and presents the concepts of NIR and MARDL. Section 4 reports our performance study on real and synthetic data sets. The paper concludes with Section 5.

2

Related Works

A survey on clustering techniques can be found in [2]. Here, we focus on reviewing the techniques of cluster rep-resentative and data labeling on the categorical data, which are most related to our work.

Cluster representative is used to summarize and charac-terize the clustering result [11]. Since, in the categorical domain, the cluster representative is not well discussed, we review several categorical clustering algorithms and explain the sprite of cluster representative in each algorithm.

In k-modes [9], a cluster is represented by "mode", which is composed by the most frequent attribute value in each attribute domain in this cluster. Suppose that there are t attributes in the dataset. Only t attribute values, each of which is the most frequent attribute value in each attribute, will be selected to represent the cluster. Although this clus-ter representative is simple, only use one attribute value in each attribute domain to represent a cluster is questionable. For example, suppose that there is a cluster which contains 51% male and 49% female in attribute gender. Only using male to represent this cluster will lose the information from female, which is almost a half in this cluster.

In algorithm ROCK [7], clusters are represented by sev-eral representative points. This representative does not pro-vide a summary of cluster, and thus cannot be efficiently used for the post-processing. For example, in the data la-beling, the similarity between unclustered data points and clusters is needed to be measured. It is time consuming to measure the similarity between unclustered data points and each representative point, especially when a large amount of representative points is needed for the better representabil-ity.

In algorithm CACTUS [5], clusters are represented by the attribute values. The basic idea behind CACTUS is to calculate the co-occurrence for attribute-value pairs. Then, the cluster is composed of the attribute values with high co-occurrence. However, this representative does not measure the importance of the attribute values. A cluster is

(3)

repre-sented only by several attribute values and each attribute value has the equally representability in the cluster.

In this paper, we present NIR, which is based on the idea of representing the clusters by the importance of the attribute values because the summarization and characteris-tic information of a cluster can be obtained by the attribute values. Utilizing the summarization and characteristic in-formation to execute data labeling is more efficient than uti-lizing the representative points.

Furthermore, data labeling is used to allocate an unla-beled data point into the corresponding appropriate cluster. The technique of data labeling has been studied in CURE [6]. However, CURE is a special numerical clustering algo-rithm to find non-spherical clusters. A specific data labeling algorithm is defined to assign each unlabeled data point into the cluster which contains the representative point closest to the unlabeled data point. In addition, ROCK [7], a cat-egorical clustering algorithm, also utilizes data labeling to speed up the entire clustering procedure. The data labeling method in ROCK is independent of the proposed cluster-ing algorithm, and is performed as follows. First, a fraction of points is obtained to represent each cluster. Then, each unlabeled data point is assigned to the cluster such that the data point contains the maximum neighbors in the fraction of points from the cluster. Two data points are said to be the neighbor of each other if the Jaccard-coefficient [10] is larger than or equal to the user defined threshold. How-ever, the threshold in ROCK data labeling is difficult to be determined by users. Moreover, it is time consuming to compute the neighbor relationship between an unclustered data point and all representative points.

In this paper, we present MARDL which analyzes the attribute values in each cluster by NIR, and offers a unique data labeling measurement without the need of user speci-fied parameters. The model of MARDL and the detail tech-niques will be presented in the next section.

3

Model of MARDL

In this section, we introduce our MARDL mechanism. The problem and several notations are defined in Section 3.1. In Section 3.2, we will introduce a novel categorical cluster representative which is named NIR. Then, MARDL techniques are presented in Section 3.3. Section 3.4 shows the implementation issues and complexity of MARDL.

3.1 Problem formulation

The problem of allocating unlabeled data points into ap-propriate clusters is formulated as follows. Suppose that a prior clustering resultF ={f1,f2, ... ,fq} is given, where

fl, 1  l  q, is the l-th cluster. The cluster fl, with a

labelf

l, is composed ofpldata points, i.e.,fl = {s(l>1),

s(l>2), ... ,s(l>pl)}, where each data point is a vector of t

attribute values, i.e.,s(l>m) = (s1(l>m),s2(l>m), ... ,st(l>m)). Let

D = {D1,D2, ... ,Dt}, where Dn is then-th categorical

Clusterf1 D1 D2 D3 a m c b m b c f c a m a a m c Clusterf2 D1 D2 D3 c f a c m a c f a a f b b m a Clusterf3 D1 D2 D3 c m c c f b b m b b m c a f a Unlabeled datasetX D1 D2 D3 a m c c m a b f b a f c ... ... ...

Figure 2. An example dataset with three clus-ters and several unlabeled data points.

attribute, 1  n  t. In addition, the unlabeled data set X = {s(X>1), ...,s(X>m)} is given, where s(X>m)is them-th

data point in data setX . Without loss of generality, X con-tains the same attribute setD. Based on the foregoing, the objective of MARDL can be stated as "to decide the most appropriate cluster labelf

l for each data point inX ".

Figure 2 shows an example of this problem. There are three clustersf1,f2, andf3, and the attribute setD has 3

attributes, D1, D2, andD3. The task of data labeling is given each unlabeled data point inX the most appropriate cluster label, i.e., one off

1,f2, orf3.

For ease of presentation, we first define node as follows.

DEFINITION 1 (Node): A node,gw, is defined as attribute

name + attribute value.

The term node which is defined to represent attribute value in this paper avoids the ambiguity which might be caused by identical attribute values. If there are two differ-ent attributes with the same attribute value, e.g., the age is in the range 50~59 and the weight is in the range 50~59, the attribute value 50~59 is confusing when we separate the at-tribute value from the atat-tribute name. Nodes [age=50~59] and [weight=50~59] avoid this ambiguity. Note that if the attribute name and the attribute value are both the same in the nodesg1andg2,g1andg2are said to be equal. For

example, in Figure 2 cluster f1, [A1=a] and [A2=m] are

nodes.

3.2 Node Importance Representative

We next describe the novel categorical cluster represen-tative which is named NIR (standing for Node Importance

Representative). The basic idea behind NIR is to represent

a cluster as the distribution of the nodes, which is defined in Definition 1. Moreover, in order to measure the repre-sentability of each node in a cluster, the importance of node

(4)

Number of node Node a b c :cluster1 :cluster2

Figure 3. An example of an attribute distribu-tion in the two clusters, where each bar cor-responds to each node.

is evaluated based on following two concepts: (1) The node is important in the cluster when the frequency of the node is high in this cluster. (2) The node is important in the cluster if the node appears prevalently in this cluster rather than in other clusters. The first concept characterizes the impor-tance of the node in the cluster. The rationale for us to adopt the second concept to measure the importance of the node can be explained by Figure 3, where an attribute distribution in the two clusters is given. The node b is the most frequent node in the cluster 1. However, in the all data points which contain node b, there are only around 40% data points which belong to the cluster 1. In contrast, although the node c is less frequent than node b in the cluster 1, node c mostly oc-curs in the cluster 1. Only considering the first concept will cause the importance of node to be high simply because the node is frequent in the database. However, the representabil-ity of the node in this cluster is likely to be overestimated because the other clusters also contain this node with high frequency. Consequently, both the two concepts should be employed to evaluate the importance of the node.

Note that the good cluster criteria is high intra-cluster similarity, where the sum of distances between objects in the same cluster is minimized, and low inter-cluster similar-ity, where the distances between different clusters are max-imized. Suppose that there is a node with high frequency in the cluster. This means that most of the data points in the cluster contain this node, and the intra-cluster similarity will be high. Hence, the first concept considers the distribution of the node in the cluster, which can be deemed as the intra-cluster similarity. In addition, suppose that a node occurs in one cluster and does not appear in other clusters. This means that most of the data points which contain this node only occur in this cluster. The distances between different clusters will be large. Hence, the second concept consid-ers the distribution of the node between clustconsid-ers, which can be deemed as the inter-cluster similarity. Therefore, NIR represents cluster by nodes and the importance of nodes, which considers both the intra-cluster similarity and the

Cluster NIR

Data

points Nodes

Cluster Analysis

Figure 4. The concept of NIR to represent a cluster.

inter-cluster similarity.

As shown in Figure 4, the cluster is represented by NIR. The ellipses in the right side of Figure 4 illustrate the nodes in the cluster, and the importance of the nodes is presented by the size of each ellipse. After the process of cluster analysis, a cluster with data points is represented by NIR. To achieve this, the theory of NIR technique is presented below.

According to Definition 1, each data point can be decom-posed into a set of nodes. Note that the number of nodes in that set ist because each data point consists of t attributes. For example, in Figure 2, the data points(1>1)in the first row of the clusterf1can be decomposed into the set: {[A1=a],

[A2=m], and [A3=c]}, which contains three nodes.

Based on the foregoing, clusterflcan be represented by

nodes. Each data point in the clusterflis first decomposed

into nodes, and then, the frequency of nodes in the cluster is calculated. The node decomposed from the data point may be equal to the node decomposed from the previous data points. In such cases, the frequency of this node is increased by one. After all the data points are decomposed into nodes in the clusterfl, suppose thatflcontainsw nodes,

and each nodegnwhich occurs in the clusterflis

abbrevi-ated bygln, and, the frequency of nodeglnis|gln|. Then,

the node importance and NIR are defined as follows.

DEFINITION 2 (node importance and NIR): The node im-portance of the nodeglnis calculated as the following

equa-tions: z(fl> gln) = i (gln) |Pwgln| {=1|gl{| (1) i (gln) = 1  1 log q q X |=1 s(g|n) log (s(g|n)) , wheres(g|n) = Pq|g|n| }=1|g}n| (2) and the NIR of clusterflcan be represented as a table of the

pairs (gln> z(fl> gln)) for the all nodes, i.e., gl1,gl2, ...,glw,

in the clusterfl.

z(fl> gln) represents the importance of node glnin

(5)

weighting function i (gln). Based on the concepts of the

importance of a node, the probability ofglninflis

calcu-lated to compute the frequency ofglnin the clusterfl, and

the weighting function is designed to measure the distribu-tion of the node between clusters based on the informadistribu-tion theorem [16]. Entropy is the measurement of information and uncertainty on a random variable. Formally, if[ is a random variable, V([) is the set of values which [ can take, ands({) is the probability function of [, the entropy H([) is defined as shown in Eq. (3).

H([) =  X

{5V([)

s({) log(s({)) (3) The entropyH([) has maximum when the random vari-able [ has the uniform distribution, which means that [ possesses maximal uncertainty or minimum information when we obtain a value of [. The weighting function i (gln) measures the entropy of the node between clusters.

Suppose that there is a node which occurs in all clusters uniformly. Then, the node which contains the maximum uncertainty provides less clustering characteristics. There-fore, this node should have a small weight. Note that Eq. (2) normalizes the entropy of the node from zero to one by dividing logq because the original range in the entropy of a node ranges from zero to logq. After normalization, Eq. (2) minus normalization entropy by one so that the node con-taining a large entropy will obtain a small weight. Eq. (1) multiply the probability ofgln infl and the weight of the

nodegnto obtain the importance of the nodeglnin cluster

fl.

Example 1: Consider the data set in Figure 2.

Clus-ter f1 contains eight nodes ([A1=a], [A1=b], [A1=c],

[A2=m], [A2=f], etc.). The node [A1=a] occurs three

times (¯¯g1>[D1=d]¯¯ = 3) in f1, once in f2, and once in

f3. The weight of the node [A1=a],i (g[D1=d]) = 1 

1

log 3(35log35+ 15log15 + 15log15) = 0=135. The

impor-tance of node [A1=a] in clusterf1is: z(f1> [D1 = d]) =

0=135 3

15 = 0=027. Note that in the cluster f1, node [A3=c]

also occurs three times. However, this node does not occur inf2but occurs twice in f3. Therefore, in clusterf1, the node [A3=c] is more significant than node [A1=a].

Cor-responding to the node importance, z(f1> [D3 = f]) =

i (g[D3=f]) 

3

15 = 0=387  153 = 0=077 A z(f1> [D1 =

d]) = 0=027. Although these two nodes both occur three times in clusterf1, node [A3=c] provides more information

on clusterf1than node [A1=a].

Finally, the NIR of clusterflcan be represented as a table

of the pairs (gln> z(fl> gln)) for the all nodes in the cluster

fl. The table in Figure 5 shows the NIR of the three clusters

in Figure 2.

3.3 Maximal Resemblance Data Labeling

The goal of MARDL, MAximal Resemblance Data La-beling, is to decide the most appropriate cluster labelf

l for Clusterf1 g1m z(g1m) [A1=a] 0.027 [A1=b] 0.004 [A1=c] 0.005 [A2=m] 0.009 [A2=f] 0.005 [A3=a] 0.014 [A3=b] 0.004 [A3=c] 0.077 Clusterf2 g2m z(g2m) [A1=a] 0.009 [A1=b] 0.004 [A1=c] 0.016 [A2=m] 0.005 [A2=f] 0.016 [A3=a] 0.056 [A3=b] 0.004 Clusterf3 g3m z(g3m) [A1=a] 0.009 [A1=b] 0.007 [A1=c] 0.011 [A2=m] 0.007 [A2=f] 0.011 [A3=a] 0.014 [A3=b] 0.007 [A3=c] 0.052

Figure 5. The NIR table of cluster c1, c2, and c3 in Figure 2.

the unlabeled data point. Specifically, suppose that an unla-beled data points(X>m)is given. MARDL computes the

sim-ilarityV(fl> s(X>m)) between s(X>m)and clusterfl, 1 l 

q, and finds the cluster which has max(V(fl> s(X>m))). The

similarity betweens(X>m)andflcan be obtained in light of

the concept of calculating the similarity between the query string and the document in the vector-space model as men-tioned before [1]. The cluster represented by NIR can be mapped to a node vector, which is similar to the term vec-tor used in the vecvec-tor-space model to describe document. Moreover, the unlabeled data point can be seen as a query string which consists of nodes. As a result, in MARDL, the similarity betweens(X>m)andflcan be deemed as the

sim-ilarity between a query string and a document. In view of the above, the similarity, referred to as resemblance in this paper, is defined below.

DEFINITION 3 (Resemblance and Maximal Resem-blance): Given an unlabeled data points(X>m) and a NIR table of clusterfl, the resemblance is defined by the

follow-ing equation: U(s(X>m)> fl) = t P {=1 z(fl> gl{)> (4)

wheregl{is one entry in the NIR table of clusterfl.

The value of resemblanceU(s(X>m)> fl) can be directly

obtained by summing up the importance of nodes in the NIR table of the clusterfl, where these nodes are

decom-posed from the unlabeled data points(X>m). This equation

which sums the nodes importance considers how much the unlabeled data point is similar to the cluster based on the nodes in the unlabeled data point. When an unlabeled data point contains nodes which are more important in the clus-terfl than the clusterfm,U(s(X>m)> fl) will be larger than

U(s(X>m)> fm).

Finally, an unlabeled data points(X>m) is labeled to the

cluster which obtains the maximal resemblance. The deci-sion function is defined by Eq. (5).

Odeho = arg max

f l

(6)

Since we measure the similarity between the unlabeled data points(X>m)and the clusterflas theU(s(X>m)> fl), the

cluster with the maximal resemblance is the most appropri-ate cluster for the unlabeled data point.

Example 2: Consider the example which is shown in

Fig-ure 2. The first row of the unlabeled data points(X>1)can be decomposed into three nodes : {[A1=a], [A2=m], and

[A3=c]}. The resemblance of data points(X>1)and cluster

f1,U(s(X>1)> f1), is calculated by the following equation:

U(s(X>1)> f1) = z(f1> [D1= d]) + z(f1> [D2= p])

+ z(f1> [D3= f]) = 0=027 + 0=009 + 0=077 = 0=113

U(s(X>1)> f2) and U(s(X>1)> f3) can be computed

analo-gously. The NIR table of f2 and f3 is used to provide the nodes importance inf2 andf3. After looking up the NIR table shown in Figure 5, U(s(X>1)> f2) = 0=014 and

U(s(X>1)> f3) = 0=068. According to Eq. (5), the first row

of unlabeled data points(X>1)is allocated to clusterf1

be-cause clusterf1obtains the maximal resemblance.

3.4 Implementation and Complexity of MARDL

The algorithm MARDL is outlined below, where MARDL can be divided into two phases, the cluster analy-sis phase and the data labeling phase.

Algorithm MARDL: MARDL(F,X) // clustering

re-sultF, unclustered data set X

Procedure main(): The main procedure of MARDL

1 NIR hash tableQ W deoh = ClusterAnalysis(F); 2. DataLabeling(QW deoh,X);

Procedure ClusterAnalysis(F): analyze input clustering result and return the NIR hash table

1. while has next tuple inF {

2. read in data points(l> m) from F ; 3. divideds(l> m) into nodes ;

4. update node frequency in clusterfl;

5. }

6. for each nodegl1toglw

7. compute weighti (gl{) ;

8. for each clusterf1tofq{

9. for each nodegl1toglw{

10. calculate node importancezl>gl{ ;

11. add (gl{,zl>gl{) into NIR table QW deoh ;

12. } 13. }

14. returnQW deoh ;

Procedure DataLabeling(QW deoh,X): give each unclus-tered data point a cluster label

1. while has next tuple inX {

2. read in data points(x> m) from X ; 3. divideds(x> m) into nodes ; 4. for each clusterf1tofq

5. calculate ResemblanceU(s(x> m)> fl);

6. find Maximal Resemblancefp

7. give labelfptos(x> m);

8. }

The next two paragraphs will present several design is-sues in these two phases.

The main purpose of the cluster analysis phase is to rep-resent the prior clustering result with NIR. NIR reprep-resents cluster by a table which contains all the pairs of a node and its node importance. For better execution efficiency, the technique of hash can be applied on the represented table [15]. In a well-designed implementation of hash tables, all of these operations have a time complexity of O(1). Since the node names are never repeated, node is suitable to be a hash key for efficient execution.

The main purpose of the data labeling phase is to de-cide the most appropriate cluster label for each unlabeled data point. Each unlabeled data point is labeled and then classified to the cluster which attains the maximal resem-blance. The resemblance value of the specific cluster is computed efficiently by the sum of each node importance through looking up the NIR hash tablet times. After all the resemblance values is computed and recorded, the maximal resemblance value is found, and the unlabeled data point is labeled to the cluster which obtains the maximal resem-blance value. Note that after executing the data labeling phase, the labeled data point just obtains a cluster label but is not really added to the cluster. Therefore, NIR table will not be modified in the data labeling phase. This is because the MARDL framework does not cluster data, but rather, presents the original clustering characteristics to the incom-ing unlabeled data points.

Time Complexity on Cluster Analysis Phase: The time

complexity on the cluster analysis phase isR(t |F|) be-cause the main procedure in this phase is to decompose each data point inF to t nodes, where|F| is the number of data points inF. t |F| is also the number of nodes in the worst case. Actually, the number of node bounds the execution time because the hash table efficiency depends on the size of the hash table. In practice, the number of nodes is much less thant |F| because nodes usually occur repeatedly.

Time Complexity on Data Labeling Phase: The time

complexity on the data labeling phase isR(t  q |X|), where|X| is the number of data points in X. This is because each unlabeled data point will be divided intot nodes, and the resemblance value of each unlabeled data point has to be calculated with q clusters to find the maximal resem-blance. As a consequence, the complexity of MARDL is R(t |F|) + R(t  q  |X|).

4

Experimental Results

In this section, we demonstrate the scalability and accu-racy of MARDL. In Section 4.1, the test environment and the data sets used in this study are described. Section 4.2 presents the efficiency and the scalability on the evaluation

(7)

result of MARDL, and Section 4.3 presents the accuracy on the evaluation result of MARDL.

4.1 Test Environment and Data Sets

The experiments are conducted on a PC with an Intel Pentium 4 2.0GHz processor and 512 MB memory running the Windows 2000 Server operating system. In the all ex-periments, the random sample technique is used for data sampling, and the clustering algorithms EM [4] is chosen to cluster the sampled data set. We compare MARDL with the ROCK data labeling phase [7] on both scalability and accuracy evaluations.

Synthetic data sets: The synthetic data sets are used on

scalability evaluation. We generate some synthetic data sets by varying the size of data points from 10K to 100K, and the dimensionality in the range of 10 to 50. Each dimension obtains 20 attribute values in all synthetic data sets.

Real data sets: The real data sets are used on accuracy

evaluation. We employed the following three real data sets: Mushroom data set: Mushroom data set is obtained from the UCI Repository [8]. Each data point describes the phys-ical characteristics of a single mushroom. There is a poiso-nous or edible field for each mushroom (which is not used for clustering). All of the twenty two attributes are categor-ical and the set contains 8,124 data points in total (4,208 edible mushrooms and 3,916 poisonous ones).

Primate splice-junction gene sequences (DNA) data set: DNA data set is also obtained from the UCI Repository [8]. Each data point is a position in the middle of a window 60 DNA sequence elements. There is an intron/exon/neither field for each DNA sequence (which is not used for clus-tering). All of the sixty attributes are categorical and the data set contains 3190 data points (768 intron, 767 exon and 1655 neither).

University student data set: Each data point in this data set describes the information of a freshman in the university. All of the nine attributes are categorical and the set contains 17,773 data points.

4.2 Evaluation on Efficiency and Scalability

Figure 6 shows the scalability with data size of MARDL. This study fixes the dimensionality to 10, and the cluster number to 5, and also varies the data size form 10K to 100K. The line named EM in Figure 6 is the result of clus-tering entire data set, and the execution time of the lines MARDL and ROCK are computed by both the time of clus-tering sampled dataset and the time of labeling the remain-ing data points. Note that the time axis are log scale. It can be seen that MARDL is linear with respect to the data size, and MARDL saves significant execution time compared to do EM clustering on the entire data set. Moreover, we com-pare the different sample size: Figure 6 (a) uses 1% sample size, and Figure 6 (b) samples 5% from the entire database. The execution time of MARDL does not increase when the sample size increases. In contrast, the execution time of

Figure 6. Execution time comparison between MARDL, ROCK and EM: scalability with data size.

Figure 7. Execution time comparison between MARDL and ROCK: scalability with data di-mensionality.

ROCK data labeling increases 5 times when the sample size increases from 1% to 5%. This is owing to the reason that MARDL is linear with respect to the size of sampled data set and unlabeled data set. The feature is important because when the clustering result is not good at a low sample rate, increasing the sample rate is the typical solution to enhance clustering quality. Consequently, MARDL can ensure effi-cient execution whatever the sample size is chosen.

Figure 7 shows the scalability with data dimensionality of MARDL. We fix the data size to 30K, and the cluster number to 5, and also vary the number of dimensions form 10 to 50. MARDL is also linear with respect to the data dimensionality. In addition, the sampled rate has little in-fluence on the execution time of MARDL, showing the ro-bustness of MARDL.

4.3 Evaluation on Accuracy

In this evaluation, the accuracy of MARDL is compared against the result of clustering entire database. First, we per-form EM clustering on the entire database to obtain the an-swer of the clustering result. Then, the framework presented

(8)

MARDL / ROCK 1% 5% 10% Mush- n=2 0.84 0.80 0.93 0.92 0.94 0.93 room n=5 0.76 0.75 0.93 0.93 0.93 0.93 School n=3 0.78 0.79 0.93 0.93 0.95 0.95 n=5 0.58 0.57 0.68 0.68 0.72 0.73 DNA n=3 0.76 0.72 0.88 0.87 0.91 0.9 n=5 0.64 0.66 0.86 0.82 0.9 0.89

Table 1. The accuracy comparison of MARDL and ROCK data labeling.

in this paper is adopted. The only difference in this evalu-ation is between MARDL and ROCK data labeling phase to label unlabeled data points. We compute the accuracy of the labeling result, and the accuracy is calculated by the number of right allocation on the unlabeled data set / the number of total unlabeled data set.

The results of accuracy comparison are shown in Table 1. These three real data sets apply EM clustering algorithm with different number of cluster q and different sampled size. Each value in Table 1 is the average of fifty experi-ments, and the parameter in ROCK data labeling is ad-justed to the best accuracy result. The study shows that the quality of MARDL and that of ROCK data labeling are very close to each other. In addition, the accuracy is mostly more than 85% even when we just sampled 5% of the entire data-base to perform clustering, indicating the merit of MARDL.

5

Conclusions

In this paper, we proposed MARDL to allocate each un-labeled data point into the appropriate cluster when the sam-pling technique is utilized to cluster a very large categorical database. In addition, we also developed a categorical ter representative technique, named NIR, to represent clus-ters which are obtained from the sampled data set by the distribution of the nodes. The experimental evaluation val-idates our claim that MARDL is of linear time complexity with respect to the data size, and MARDL preserves clus-tering characteristics, high intra-cluster similarity and low-inter cluster similarity. It is shown that MARDL is signifi-cantly more efficient than prior works while attaining results of high quality.

Acknowledgements

The work was supported in part by the National Science Council of Taiwan, R.O.C., under Contracts NSC93-2752-E-002-006-PAE.

References

[1] R. Baeza-Yates and B. Riberiro-Neto. Modern Infor-mation Retrieval. Addison-Wesley, 1999.

[2] P. Berkhin. Survey of clustering data mining tech-niques. Technical report, Accrue Software, 2002. [3] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling

clustering algorithms to large databases. In Knowl-edge Discovery and Data Mining, 1998.

[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi-mum likelihood from incomplete data via the em algo-rithm. Journal of the Royal Statistical Society, 1977. [5] V. Ganti, J. Gehrke, and R. Ramakrishnan.

CACTUS-Clustering Categorical Data Using Summaries. In Proc. of ACM SIGKDD, 1999.

[6] S. Guha, R. Rastogi, and K. Shim. CURE: An Effi-cient Clustering Algorithm for Large Databases. In Proc. of the ACM SIGMOD Conf., 1998.

[7] S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proc. of the 15th ICDE, 1999.

[8] S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/mlrepository.html, 1998.

[9] Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 1998.

[10] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentiche Hall, 1988.

[11] A. K. Jain, M. N. Murty, and P. J. Flynn. Data cluster-ing: a review. ACM Computing Surveys, 1999. [12] D. S. Johnson M. R. Garey and H. S. Witsenhausen.

The complexity of the generalized lloyd-max problem. IEEE Trans. Inf. Theory, 1982.

[13] N. Mishra, D. Oblinger, and L. Pitt. Sublinear time approximate clustering. In In Proc. of the ACM-SIAM symposium on Discrete algorithms, 2001.

[14] R. T. Ng and J. Han. CLARANS: A Method for Clus-tering Objects for Spatial Data Mining. IEEE Trans-actions on Knowledge and Data Engineering, 2002. [15] R. Ramakrishnan and J. Gehrke. Database

Manage-ment Systems. McGraw-Hill, 2003.

[16] C.E. Shannon. A mathematical theory of communica-tion. Bell System Techical Journal, 1948.

數據

Figure 1. The framework of clustering a cat- cat-egorical very large database with sampling and MARDL.
Figure 2. An example dataset with three clus- clus-ters and several unlabeled data points.
Figure 4. The concept of NIR to represent a cluster.
Figure 5. The NIR table of cluster c1, c2, and c3 in Figure 2.
+3

參考文獻

相關文件

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

FIGURE 5. Item fit p-values based on equivalence classes when the 2LC model is fit to mixed-number data... Item fit plots when the 2LC model is fitted to the mixed-number

• Use table to create a table for column-oriented or tabular data that is often stored as columns in a spreadsheet.. • Use detectImportOptions to create import options based on

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.

MTL – multi-task learning for STM and LM, where they share the embedding layer PSEUDO – train an STM with labeled data, generate labels for unlabeled data, and retrain STM.

This bioinformatic machine is a PC cluster structure using special hardware to accelerate dynamic programming, genetic algorithm and data mining algorithm.. In this machine,

In this chapter, the results for each research question based on the data analysis were presented and discussed, including (a) the selection criteria on evaluating