The goal of data clustering is to group data samples into clusters such that samples from the same cluster are more similar to each other than samples from different clusters. It is also known as unsupervised learning, numerical taxonomy, typological analysis and vector quantization [1, 2]. Among the various clustering approaches, probabilistic model-based clustering is the one derived from statistical learning; and it has been successfully applied to many tasks, for example, speaker recognition [3, 4], speech recognition [5], handwritten recognition [6], image segmentation [7], and clustering of microarray expression data [8, 9].
In model-based clustering, data samples are grouped by learning a finite mixture model (usually a Gaussian mixture model, i.e., GMM), in which each mixture component represents a cluster. There are two major learning methods for model-based clustering:
the mixture likelihood approach, where the likelihood of each data sample is a mixture of all the component likelihoods of the data sample; and the classification likelihood approach, where the likelihood of each data sample is generated by its winning component only [10, 11, 12, 13, 14, 15, 16, 17, 18]. In both approaches, when the globally optimal estimation of the model parameters cannot be obtained analytically, iterative learning algorithms that only guarantee obtaining locally optimal solutions are usually employed. The expectation-maximization (EM) algorithm for mixture likelihood learning [19, 20, 21, 22] and the classification EM (CEM) algorithm for classification likelihood learning [17] are two such algorithms and have been the dominant approaches in this task. The conventional EM or CEM-based model-based clustering has three critical aspects, namely the initialization of model parameters, the model complexity1, and the topology-preserving ability. These aspects are discussed as follows.
• For model initialization: The learning performance of EM and CEM are very sen-sitive to the initial conditions of the model’s parameters. To address this issue, the
1In this thesis, it denotes the number of mixture components for a mixture model.
authors in [17] proposed a simulated annealing implementation for CEM, which re-duces the initial-position dependence based on random perturbations. Rather than applying the randomization power of simulated annealing, the authors in [23] pro-posed a deterministic annealing EM (DAEM) algorithm that tackles the initializa-tion issue via a deterministic annealing process. DAEM was originally derived as a DA variant of EM; however, as shown in Section 2.4.3, it is also a DA variant of CEM.
Although DAEM has been reported achieving decent performance, there is still no guarantee that it finds the globally optimal model parameters because, like EM, it is a iterative, single-token search scheme. In order to search the parameter space in multiple pathes, the authors in [24, 25] and [26, 27] proposed multi-thread search strategies when employing EM and DAEM, respectively, where the search pathes are adapted in the search process according to the eigen-decomposition of the Hessian matrix of the target log-likelihood function. As another kind of multiple-path search, a GA-EM algorithm was proposed in [28], where EM learning is integrated into a Genetic search procedure. It is less sensitive to the initialization because of the sto-chastic search nature of the Genetic Algorithms (GA). Some heuristic-like learning algorithms have also been proposed. For example, an initialization approach for EM which is based on subsampling was presented in [29]. The authors in [30] proposed an SMEM algorithm that finds the appropriate initial conditions for EM learning by using split and merge operations. Similarly, Young and Woodland [31] proposed a component-splitting approach to learn a GMM, which iteratively splits the mean vector of the Gaussian component with the largest weight into two new ones, and then performs EM to update all the Gaussian components. Another method based on component splitting is presented in [32]. In [33], the authors suggested a simple way that one can perform several short runs of EM (by early stopping) first, then select the best model from the results and use them as the initial model for the long run (standard) EM learning. As a common and simple way, one can apply K-means clustering or hierarchical agglomerative clustering (HAC) to locate the initial mean vectors of Gaussian components for the EM learning [1, 34, 35].
Note that all the approaches mentioned above are performed with a given target number of mixture components.
• For model complexity: Assessment of the number of mixture components (data clusters) is an important issue in model-based clustering. The mixture model would over-fit the data if it contains too many mixture components; in contrast, it would not be flexible enough to describe the structure of the data if the number of com-ponents is too small. Various approaches have been proposed to address this issue.
In [36], Furman and Lindsay developed two hypothesis test procedures based on the moment estimators to assess the number of components for a GMM. As
an-other hypothesis test-based approach, McLachlan and Khan estimated the number of components by likelihood ratio test, where the re-sampling process is applied to assess its null hypothesis [37]. In [38], the authors estimated the mixture complex-ity by comparing an information theoretic-derived nonparametric estimator with the best parametric fit of a given complexity. As another information theoretic-based approach, a maximum entropy-based approach with a modified EM algorithm was proposed to assess the model complexity of GMM in [39]. Moreover, model se-lection criteria, also known as penalized-likelihood criteria, have been proposed to assess the model complexity; for example, Akaike’s Information Criterion (AIC) [40], Bayesian Information Criterion (BIC) [41, 10, 42], Integrated Completed Like-lihood (ICL) [43], Approximate Weight Evidence (AWE) [18], Minimum Description Length (MDL) [44] (which is formally identical to BIC), and Minimum Message Length (MML) [45]. BIC is derived on the basis of mixture likelihood, ICL and AWE are derived on the basis of classification likelihood, and AIC, MDL and MML are information-theoretic-derived criteria. A common way to applying these criteria to assess model complexity is that defining the upper bound, Gmax, of the compo-nent number of candidate models first, and then choosing the one with the best score calculated using the employed criterion as the best model. For example, when using BIC, the best component number is
G = arg maxˆ
G {2 log p(X ; ˆΘG) − P enalty(G)|G = 1, 2, . . . , Gmax}, (1.1) where ˆΘG is the maximum likelihood estimate of parameters of the mixture model with G components, P enalty(G) is a monotonically increasing function of G that penalizes more for a more complex model [10]. However, there are two potential drawbacks with this model-selection-based approach. First, it needs to define the upper bound of the component number, Gmax, beforehand. On the one hand, if Gmax is too large (much larger than the best component number determined by the model selection criterion), the learning process will waste a lot of computation time. On the other hand, if Gmax is too small, the selected model may be not flexible enough to describe the structure of the data. Second, ˆΘG is usually obtained by EM, whose performance is highly dependent on the model initialization.
Rather than assessing the model complexity by incrementally adding mixture com-ponents during the learning process, as discussed above, the variational Bayesian framework in [46, 47] automatically determines the number of components by setting a larger component number initially and then suppressing unwanted components.
• For topology-preserving ability: Conventional model-based clustering cannot preserve the topological relationships among data samples and clusters after the
clustering procedure. To overcome this shortcoming, the clustering task can be per-formed by using Kohonen’s self-organizing map (SOM) [48, 49]. The SOM, rather than being a supervised neural network model for pattern recognition, is an unsuper-vised model for data clustering and visualization. After SOM’s clustering procedure, the topological relationships among data samples and clusters can be preserved (or visualized) on the network, which is usually a two dimensional lattice. Kohonen’s sequential and batch SOM learning algorithms have proved successful in many prac-tical applications [48, 49]. However, they also suffer from some shortcomings, such as the lack of an objective (cost) function, a general proof of convergence, and a prob-abilistic framework [50]. Some related works that have addressed these issues are as follows. In [51, 52], the behavior of Kohonen’s sequential learning algorithm was studied in terms of energy functions, based on which, Cheng [53] proposed an energy function for SOM whose parameters can be learned by a K-means type algorithm.
Luttrell [54, 55] proposed a noisy vector quantization model called the topographic vector quantizer (TVQ), whose training process coincides with the learning process of SOM. The cost function of TVQ represents the topographic distortion between the input data and the output code vectors in terms of Euclidean distance. Graepel et al. [56, 57] derived a soft topographic vector quantization (STVQ) algorithm by applying a deterministic annealing process to the optimization of TVQ’s cost function. Based on the topographic distortion concept, Heskes [58] applied a dif-ferent DA implementation from that of STVQ, and obtained an algorithm identical to STVQ when the quantization error is expressed in terms of Euclidean distance.
In [59], Chow and Wu proposed an on-line algorithm for STVQ; later, motivated by STVQ, they proposed a data visualization method that integrates SOM and multi-dimensional scaling [60]. Based on the Bayesian analysis of SOMs in [61], Anouar et al. [62] proposed a probabilistic formalism for SOM, where the para-meters are learned by a K-means type algorithm. To help users select the correct model complexity for SOM by probabilistic assessment, Lampinen and Kostiainen [63] developed a generative model in which the SOM is trained by Kohonen’s algo-rithm. Meanwhile, Van Hulle [64] developed a kernel-based topographic formation in which the parameters are adjusted to maximize the joint entropy of the kernel outputs. He subsequently developed a new algorithm with heteroscedastic Gaussian mixtures that allows for a unified account of vector quantization, log-likelihood, and Kullback-Leibler divergence [65]. Another probabilistic formulation is proposed in [66], whereby a normalized neighborhood function of SOM is used as the posterior distribution in the E-step of the EM algorithm for a mixture model to enforce the self-organizing of the mixture components. Sum et al. [67] interpreted Kohonen’s sequential learning algorithm in terms of maximizing the local correlations (cou-pling energies) between neurons and their neighborhoods for the given input data.
They then proposed an energy function for SOM that reveals the correlations, and a gradient ascent learning algorithm for the energy function.