• 沒有找到結果。

Chapter 2.  Backgrounds

2.2.  Clustering Algorithms

Clustering can be thought as a kind of classification method. When there are several data which have some kinds of similar properties clustering methods can be used to explore the data and to group similar ones together under certain criteria. A clustering example is illustrated in Figure 2-8. In the literature, clustering has already been well developed and many different algorithms have been developed. We will discuss some commonly used algorithms in this section.

(a) (b) Figure 2-8 A clustering example (a) data points (b) clustering result

2.2.1. K- MEANS C LUSTERING

K-means is a simple and fast clustering algorithm. It was originally proposed in [8]. The main idea of K-means clustering is to iteratively minimize the variance of each cluster. At the beginning, k centroids are initialized and they represent the centers of clusters. Then, each datum is classified to a cluster according to the distances between the data point and the centroids. The data point is assigned to the cluster which has the shortest distance between its centroid and this data point. Finally, the mean of each cluster is calculated and is used to update the new centroid. The process is repeated until the positions of the centroids converge. The followings are the detailed steps of the k-means algorithm:

1. In the data space, choose k points as the initial centroids of clusters.

2. Assign each data point to the cluster which has the shortest distance between its centroid and that data point.

3. Recalculate the k centroids by averaging the data points.

9

4. Repeat Step 2 and Step 3 until these centroids are almost fixed. Then we get the final clustering result.

The advantages of the k-means method are its simplicity and low computational cost. It is very easy to implement the K-means algorithm. However, this method still has several disadvantages. For example, it is very sensitive to the choice of the initial centroids. It only minimizes the intra-cluster variance, but not the global variance. In other words, this method does not guarantee global minimization but only a local minimization. The global minimization depends on the appropriate selection of the initial centroids. There is an example of k-means clustering shown in Figure 2-9.

(a) (b)

(c) (d) Figure 2-9 An example of the process of K-means clustering (a) centroids

initialization (b)-(d) iteration (centroids recalculation)

2.2.2. F UZZY C-M EANS C LUSTERING

Fuzzy c-means clustering technique [9] is similar to k-means but it allows data to belong to more than one cluster. This is why it is called fuzzy. We illustrate the difference between k-means clustering and fuzzy c-means clustering in Figure 2-10.

Here we consider 1-D data points and two clusters (red and green). For k-means

10

clustering, each data point only belongs to one cluster, as shown in Figure 2-10 (a).

With fuzzy c-means clustering, however, each data point can belong to more than one cluster with different degrees of cluster membership, as shown in Figure 2-10 (b).

(a) (b) Figure 2-10 The comparison of (a) k-means clustering (b) fuzzy c-means cluster algorithm

The objective of fuzzy c-means clustering and k-means clustering are the same. That is, we find the clusters that minimize their variances. Similar to k-means, the fuzzy c-means clustering needs to define an initial condition and then iteratively update the cluster centers. However, the difference is that the fuzzy c-means clustering directly initializes the degrees of the data points in each cluster and update them in each iteration. The detailed fuzzy c-means algorithm is described as follows:

1. Initialize uij, the degree of xi in the cluster j, where xi is a data point. number between 0 and 1.

Although fuzzy c-means clustering requires more computations than k-means clustering, it usually can find better solution. However, it still possesses some problems of k-means clustering. For example, it can only find a local minimum. The

Degree of cluster membership

P

Degree of cluster membership

1

N

11

clustering result is also sensitive to the initialization of the degrees.

2.2.3. H IERARCHICAL C LUSTERING

Unlike k-means clustering and fuzzy c-means clustering, the hierarchical clustering algorithm [10] does not need to set the number of clusters. Compared with k-means (or fuzzy c-means) clustering, this method uses the concept of mergence, instead of the concept of partition. It considers each data point a cluster initially and then merges data points gradually to reach a proper set of clusters. Figure 2-11 illustrates the simple merging process. Here we take each creature as a data point and we gradually clustering these six creatures into clusters.

Figure 2-11 An example of merging process

The followings are the detailed steps of the hierarchical clustering algorithm:

1. Consider each data point a cluster. Define the distances between each pair of clusters.

2. Find the pair of clusters which has the closest distance.

3. Merge the pair of clusters with the closest distance into a new cluster.

The number of clusters reduces one.

4. Repeat Step 2 and Step 3 until the number of clusters reduces to a value we desire.

Generally, the hierarchical clustering method better suits the characteristics of data. It does not need assign the number of clusters and can always reach the same result. However, this method has a major problem: its high computational cost. Its complexity is at least O(n2). Besides, because of the mergence, this method cannot undo what have been done previously.

Cat Lion Dog Wolf Human Orangutan

12

相關文件