Numerous data analysis techniques, such as regression and principal component analysis (PCA), possess time or space complexity and are thus impractical for large datasets [6, 29].
Therefore, instead of applying such techniques directly to the entire dataset, researchers adopt cluster analysis and apply these techniques to each cluster, which consists of only a portion of the original data. Depending on the type of cluster analysis, the number of clusters, and the accuracy with which the clusters represent the data, the results can be comparable with those that would have been obtained by using all data. Cluster analysis techniques have recently been applied to microarray data, image analysis, and marketing science [13, 31].
Cluster analysis [11] is a core issue in data mining with innumerable applications spanning many fields. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to a particular cluster. The measure of similarity is usually data dependent.
The clustering aims to optimize a cost function that is defined over all possible groupings.
Moreover, the cost function depends on the manner by which the data are decomposed and has limited meaning on one separate item [20]. In this technique, the collected information is divided into various clusters to show the system behavior patterns effectively. In other words, patterns in the same group are similar in some sense and patterns in different groups are dissimilar in the same sense [4, 5]. In terms of analysis of variance (ANOVA), the within-variance is low and between-variance is high. Here, “variance” means the sample variance among all possible linear combination of observations [8]. We will apply this property to the proposed methods, in which PCA is employed as the data analysis technique for image coding. In this study, we adopt the K-means algorithm [10, 28, 39] proposed by Mac Queen (1967) to minimize the sum of the distance from each data to its cluster center.
2
The K-means algorithm is a popular clustering method for its capability to group huge datasets efficiently. In addition, the subtractive clustering method is a fast clustering algorithm designed for the high dimensional data set with moderate number of data points, because its computation grows linearly with the data dimensions and the square of the number of data points. The subtractive clustering is an extensive method of the grid-based mountain clustering proposed by Yager and Filev [40].
Data reduction techniques aim to efficiently represent data [14, 15, 33]. One example is the Karhunen–Loeve Transform (KLT), in which a higher dimensional input space is mapped to a lower dimensional feature space through linear transformation [19]. As an alternative approach to feature extraction in the n-dimensional space, PCA finds the m (m<n) basis components, such that the projection to the corresponding subspace possesses the largest variations [32]. In a similar fashion, PCA computes for the covariance matrix of input data with zero mean. After solving the eigenvalues of a covariance matrix, PCA extracts the eigenvectors corresponding to the maximum eigenvalues [7,16]. Dimension reduction is achieved by using the eigenvectors with the most significant eigenvalues, which form an orthogonal basis for a low dimension subspace. Every vector in the original space can be approximated by a corresponding to a vector in the subspace [9, 22]. Dimensionality reduction is frequently used as a pre-processing step in data mining. Selecting a smaller number of features carries a significant role in applications involving hundreds or thousands of features. Besides relevant features, there might be derogatory features, indifferent features, and redundant (dependent) ones. Removal of these features not only makes the learning task easier, by reducing computational constraint but also often improves the performance of the classifier [4, 5]. Such data reduction is applied to images to achieve image compression. In this work, we separately use PCA for each cluster, which consists of some specified block images, to reconstruct the original (or input) image [34, 38].
3
But the approach is not feasible when the dimensions of the covariance matrix become too large to be evaluated. To overcome this problem, algorithms based on neural networks are proposed. Neural principal component analysis is firstly proposed by Oja (1982) [23-26] who uses a single neuron to extract the first principal component from the input. The Oja's rule can be viewed as the modified Hebbian (1949) rule that is the simplest unsupervised learning. To extract more than one principal component, Sanger (1989) [30] proposes the generalized Hebbian algorithm (GHA) which extracts a specified number of principal components. In this thesis, we partition the training set into some clusters using K-means and the subtractive clustering methods, and then apply GHA to each of the clusters to achieve the purpose of image compression.
The genetic algorithm (GA) [17, 18, 21], originally developed by Holland over the course of the 1960s and 1970s, is a biological analogy. In the selective breeding of plants or animals, for example, offspring are produced as a combination of the parent chromosomes according to certain characteristics that are determined at the genetic level. When the fitness landscape (or cost surface) of the problem is unclear or riddled with a large number of local optima, the GA usually has good searching capability because the candidate solutions will not become stuck at the local optima [27]. Some GA-based clustering algorithms such as stochastic clustering algorithms based on GA, Simple GA (SGA), Hybrid Niching GA (HNGA), and multi-objective GA are mentioned in [1]. In the latter study, these methods are considered only able to find compact hyperspherical, equisized, and convex clusters like those detected by the K-means algorithm [2]. If clusters of different geometric shapes are present in the same data set, the above methods will not be able to find all of them perfectly [3]. This thesis provides a preliminary study in this direction. The GA has been successfully applied to many fields of science and engineering [12].
In the thesis, we partition the dataset into several clusters, and the number of principal
4
components using PCA can vary for each cluster. We use GA to determine the optimal number of principal components for each cluster. PCA would be perform on the entire dataset obtained from an image to achieve image compression. To improve the reconstructed image quality, we use K-means to partition the dataset and then independently apply PCA to each cluster. In this method, different numbers of principal components are allowed, and we use GA to find the optimal number of principal components for each cluster.
In the proposed iterative clustering method for PCA image coding, we use the partition result of K-means as the initial clustering. In each iteration process of repartition, every member in the entire dataset is moved to the cluster producing the least coding errors after it joins the cluster. The procedure stops after a pre-specified number of iterations is satisfied.
This method improves the homogeneity in each cluster by increasing the within-group correlation corresponding to the PCA image coding. If the total number of variables to store are approximately the same, the proposed algorithm removes the redundant variables of clusters with simple structures and increases the number of principal components to improve the reconstructed quality of clusters with complex structures. Therefore, the algorithm effectively increases the image quality and improves the visual effect.
In another proposed repartition algorithm, we partition the dataset into numerous clusters, in which the numbers of principal components using PCA can vary. In this work, we use GA as a framework with three phases, namely, GA operation, repartition clustering, and clustering PCA for image coding. In repartition clustering, the clustering and the number of principal components for each cluster are determined progressively [35-37]. The method can improve the homogeneity in each cluster by increasing the within-group correlation corresponding to PCA image coding. Under the condition that the total numbers of variables to store are roughly the same, the proposed algorithm removes redundant variables in clusters with simple structures and increases the number of principal components to improve the reconstructed
5
quality of certain clusters with complex structures. Experimental results show that the proposed method can effectively increase image quality and improve the visual effect.
This thesis is organized as follows. In Chapter 2, we introduce the Traditional PCA image coding in which the clustering technique would not be employed. The non-adaptive clustering PCA applied in the image coding such as K-means and the subtract clustering algorithm will be discussed in Chapter 3 and 4, respectively. Here, the “non-adaptive” means that the number of principal components are pre-specified, and vice versa. In Chapter 5 and 6, there are two adaptive clustering PCA methods to be described corresponding to the iterative and repartition grouping algorithm. The conclusions and future work are stated in Chapter 7.
6