Model-Based Clustering by Probabilistic Self-Organizing Maps

(1)

Model-Based Clustering by Probabilistic

Self-Organizing Maps

Shih-Sian Cheng, Hsin-Chia Fu, Member, IEEE, and Hsin-Min Wang, Senior Member, IEEE

Abstract—In this paper, we consider the learning process of a

probabilistic self-organizing map (PbSOM) as a model-based data clustering procedure that preserves the topological relationships between data clusters in a neural network. Based on this concept, we develop a coupling-likelihood mixture model for the PbSOM that extends the reference vectors in Kohonen’s self-organizing map (SOM) to multivariate Gaussian distributions. We also derive three expectation–maximization (EM)-type algorithms, called the SOCEM, SOEM, and SODAEM algorithms, for learning the model (PbSOM) based on the maximum-likelihood crite-rion. SOCEM is derived by using the classification EM (CEM) algorithm to maximize the classification likelihood; SOEM is derived by using the EM algorithm to maximize the mixture like-lihood; and SODAEM is a deterministic annealing (DA) variant of SOCEM and SOEM. Moreover, by shrinking the neighbor-hood size, SOCEM and SOEM can be interpreted, respectively, as DA variants of the CEM and EM algorithms for Gaussian model-based clustering. The experimental results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to that of the deterministic annealing EM (DAEM) approach, while maintaining the topology-preserving property.

Index Terms—Classification expectation–maximization (CEM)

algorithm, deterministic annealing expectation–maximization (DAEM) algorithm, expectation–maximization (EM) algo-rithm, model-based clustering, probabilistic self-organizing map (PbSOM), self-organizing map (SOM).

I. INTRODUCTION

I

N model-based clustering, data samples are grouped by learning a mixture model (usually a Gaussian mixture model) in which each mixture component represents a group or cluster. There are two major learning methods for model-based clustering: the mixture-likelihood approach, where the likeli-hood of each data sample is a mixture of all the component likelihoods of the data sample, and the classification-likelihood

approach, where the likelihood of each data sample is generated

by its winning component only [1]–[9]. In both approaches, when the globally optimal estimation of the model parameters

Manuscript received November 01, 2007; revised November 30, 2008; ac-cepted January 02, 2009. First published March 31, 2009; current version pub-lished May 01, 2009. This work was supported in part by the National Science Council of Taiwan under Grants 2628-E-001-024-MY3 and NSC96-3113-H-001-012.

S.-S. Cheng is with the Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan and also with the Institute of Information Science, Academia Sinica, Taipei, Taiwan (e-mail: [email protected]). H.-C. Fu is with the Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan (e-mail: [email protected]).

H.-M. Wang is with the Institute of Information Science, Academia Sinica, Taipei, Taiwan (e-mail: [email protected]).

Digital Object Identifier 10.1109/TNN.2009.2013708

cannot be obtained analytically, iterative learning algorithms that only guarantee obtaining locally optimal solutions are usu-ally employed. The expectation–maximization (EM) algorithm for mixture-likelihood learning [10], [11] and the classification EM (CEM) algorithm for classification-likelihood learning [8] are two such algorithms. However, a critical aspect of the EM and CEM algorithms is that their learning performance is very sensitive to the initial conditions of the model’s param-eters. To address this issue, Ueda and Nakano [12] proposed a deterministic annealing EM (DAEM) algorithm that tackles the initialization issue via a deterministic annealing process, which performs robust optimization based on an analogy to the cooling of a system in statistical physics. Some heuristic-like learning algorithms have also been proposed. For example, in [13], the authors propose an algorithm that finds the appropriate initial conditions for EM learning by using split and merge operations. Another method, proposed in [14], overcomes the initialization issue of EM by iteratively splitting the mixture components using the Bayesian information criterion as the splitting validity measure.

In addition to the initialization issue of the learning al-gorithms, conventional model-based clustering suffers from another limitation in that it cannot preserve the topological relationships among clusters after the clustering procedure. To overcome this shortcoming, the clustering task can be performed by using Kohonen’s self-organizing map (SOM) [15], [16], a well-known neural network model for data clus-tering and visualization. After the clusclus-tering procedure, the topological relationships among data clusters can be preserved (or visualized) on the network, which is usually a 2-D lattice. Kohonen’s sequential and batch SOM learning algorithms have proved successful in many practical applications [15], [16]. However, they also suffer from some shortcomings, such as the lack of an objective (cost) function, a general proof of convergence, and a probabilistic framework [17]. The following are some related works that have addressed these issues. In [18] and [19], the behavior of Kohonen’s sequential learning algorithm was studied in terms of energy functions, based on which, Cheng [20] proposed an energy function for SOM whose parameters can be learned by a -means-type algo-rithm. Luttrell [21], [22] proposed a noisy vector quantization model called the topographic vector quantizer (TVQ), whose training process coincides with the learning process of SOM. The cost function of TVQ represents the topographic distortion between the input data and the output code vectors in terms of Euclidean distance. Graepel et al. [23], [24] derived a soft topographic vector quantization (STVQ) algorithm by applying a deterministic annealing process to the optimization of TVQ’s cost function. Based on the topographic distortion concept, 1045-9227/$25.00 © 2009 IEEE

(2)

Heskes [25] applied a different DA implementation from that of STVQ, and obtained an algorithm identical to STVQ when the quantization error is expressed in terms of Euclidean distance. In [26], Chow and Wu proposed an online algorithm for STVQ; later, motivated by STVQ, they proposed a data visualization method that integrates SOM and multidimensional scaling [27]. Based on the Bayesian analysis of SOMs in [28], Anouar et

al. [29] proposed a probabilistic formalism for SOM, where

the parameters are learned by a -means-type algorithm. To help users select the correct model complexity for SOM by probabilistic assessment, Lampinen and Kostiainen [30] developed a generative model in which the SOM is trained by Kohonen’s algorithm. Meanwhile, Van Hulle [31] developed a kernel-based topographic formation in which the parameters are adjusted to maximize the joint entropy of the kernel outputs. He subsequently developed a new algorithm with heteroscedastic Gaussian mixtures that allows for a unified account of vector quantization, log likelihood, and Kullback–Leibler divergence [32]. Another probabilistic formulation is proposed in [33], whereby a normalized neighborhood function of SOM is used as the posterior distribution in the E-step of the EM algorithm for a mixture model to enforce the self-organizing of the mixture components. Sum et al. [34] interpreted Kohonen’s sequential learning algorithm in terms of maximizing the local correlations (coupling energies) between neurons and their neighborhoods for the given input data. They then proposed an energy function for SOM that reveals the correlations, and a gradient-ascent learning algorithm for the energy function.

In Kohonen’s SOM architecture, neurons in the network as-sociate with reference vectors in the data space. This contrasts with a SOM whose neurons associate with reference models that present probability distributions, such as the isotropic Gaus-sians used in [33] and the heteroscedastic GausGaus-sians used in [29] and [32]. In this paper, we call the latter a probabilistic SOM (PbSOM). Motivated by the coupling energy concept in Sum et al.’s work [34], we develop a coupling-likelihood mix-ture model for the PbSOM that uses multivariate Gaussian dis-tributions as the reference models. In the proposed model, local coupling energies between neurons and their neighborhoods are expressed in terms of probabilistic likelihoods; and each mix-ture component expresses the local coupling-likelihood between one neuron and its neighborhood. Based on this model, we de-velop CEM, EM, and DAEM algorithms for learning PbSOMs, namely, the SOCEM, SOEM, and SODAEM algorithms, re-spectively. Because they inherit the properties of the CEM and EM algorithms, the proposed algorithms are characterized by re-liable convergence, low cost per iteration, economy of storage, and ease of programming. From our experiments on the or-ganizing property, we observe that SOEM is less sensitive to the initialization of the parameters than SOCEM when using a small-fixed neighborhood, while SODAEM overcomes the ini-tialization problem of SOCEM and SOEM through an annealing process. Furthermore, we show that SOCEM and SOEM can be interpreted, respectively, as deterministic annealing variants of the CEM and EM algorithms for Gaussian model-based clus-tering, where the neighborhood shrinking is interpreted as an annealing process. We conducted experiments on data sets from the University of California at Irvine (UCI) Machine Learning

Database Repository [35]. The experiment results show that the proposed PbSOM learning algorithms achieve comparable data clustering performance to the DAEM algorithm, while main-taining the topology-preserving property.

The remainder of this paper is organized as follows. In Section II, we review the EM, CEM, and DAEM algorithms for model-based clustering. Then, the proposed coupling-likeli-hood mixture model, and the SOCEM, SOEM, and SODAEM algorithms are described in Section III. The experimental results are detailed in Section IV. The differences and relations between the proposed algorithms and other ones are discussed in Section V. We then present our conclusions in Section VI.

II. THEEM, CEM,ANDDAEM ALGORITHMS FOR

MODEL-BASEDCLUSTERING

A. The Mixture-Likelihood Approach and EM Algorithm

In the mixture-likelihood approach for model-based clustering, it is assumed that the given data set

is generated by a set of indepen-dently and identically distributed (i.i.d.) random vectors from a mixture model

(1) where is the mixing weight of the mixture component

, subject to for ;

; and denotes the parameter set of .

The maximum-likelihood estimate of the parameter set of the mixture model

can be obtained by maximizing the following log-likelihood function:

(2) This is usually achieved by using the EM algorithm [10], [11]. After learning the mixture model, we derive a partition of ,

, by assigning each to the

mix-ture component that has the largest posterior probability for ,

i.e., if .

1) The EM Algorithm for Mixture Models: If the

maximum-likelihood estimation of the parameters cannot be accomplished analytically, the EM algorithm is normally used as an alternative approach when the given data is incomplete or contains hidden information.

In the case of the mixture model, suppose that denotes the current estimate of the parameter set, and is the hidden variable that indicates the mixture component from which the observation is generated. The E-step of the EM algorithm then computes the following so-called auxiliary function:

(3)

where

(4) and

(5)

denotes the posterior probability of the th mixture component for with the given . Then, in the following M-step, the

that satisfies

(6) is chosen as the new estimate of the parameter set. By itera-tively creating the auxiliary function in (3) and performing the subsequent maximization step, the EM algorithm guarantees to converge to a local maximum of the log-likelihood function in (2).

When cannot be maximized analytically,

the M-step is modified to find some such that

. This type of algorithm, called generalized EM (GEM), is also guaranteed to converge to a local maximum [10], [11].

B. The Classification-Likelihood Approach and CEM Algorithm

In the classification-likelihood approach for model-based clustering [6]–[8], instead of maximizing the log-likelihood function of the mixture model in (2), the objective is to find

the partition of and the model

parameters that maximize

(7) or

(8)

The relation between and is

(9) where denotes the number of samples in . If all the

mix-ture components are equally weighted,

be-comes a constant, such that and are equivalent.

1) The CEM Algorithm for Mixture Models: Celeux and

Govaert [8] proposed the classification EM (CEM) algorithm for estimating the parameter set and partition . Like the EM algorithm, the CEM algorithm is also an iterative learning approach. In each iteration, CEM inserts a classification step (C-step) between the E-step and M-step of the EM algorithm. In the E-step, the posterior probability of each mixture component

is calculated for each data sample. In the C-step, to obtain the partition of the data samples, each sample is assigned to the mixture component that yields the largest posterior probability for that sample. In the M-step, the maximization

process is applied to individually for .

For example, if a multivariate Gaussian is used as the mixture component, the reestimated mean vector and covariance matrix are the mean vector and the covariance matrix of the data samples in , respectively, while the reestimated mixture weight is . From a practical point of view, CEM is a -means-type algorithm that represents the prototypes with probability distributions [8].

C. The DAEM Algorithm

In the DAEM algorithm for learning a mixture model [12], the objective is to minimize the following system energy function during the annealing process:

(10)

where corresponds to the temperature that controls the an-nealing process. The auxiliary function in this case is

(11)

where

(12)

is the posterior probability derived by using the maximum en-tropy principle.

Ueda and Nakano [12] showed that can be

itera-tively minimized by iteraitera-tively minimizing . When

using DAEM to learn a mixture model, is initialized with a small value (less than 1) such that the energy function itself is simple enough to be optimized. Then, the value of is gradu-ally increased to 1. During the learning process, the parameters learned in the current learning phase are used as the initial pa-rameters of the next phase.1_{In the case of} _,

and are the negatives of the log-likelihood function in (2) and the -function in (3), respectively, thus, minimizing is equivalent to maximizing the log-likelihood func-tion.

According to [12], (10) can be rewritten as

(13) where

(14)

1_Each_{value corresponds to a learning phase. The algorithm proceeds to the}

(4)

and

(15)

is the entropy of the posterior distribution. When , the

rational function approximates to a zero-one

func-tion; thus, the entropy term In this case,

is equivalent to the negative of the objective function for CEM in (8). Therefore, DAEM can be viewed as a DA variant of CEM.

III. THEEM-TYPEALGORITHMS FORLEARNINGPBSOMS

A. Formulation of the Coupling-Likelihood Mixture Model

In this paper, we define a PbSOM as a SOM that consists of

neurons in a network with a

neighbor-hood function that defines the strength of lateral interaction

between two neurons and , for ; and

each neuron associates with a reference model that repre-sents some probability distribution in the data space.

Sum et al. [34] interpreted Kohonen’s sequential SOM learning algorithm in terms of maximizing the local corre-lations (coupling energies) between the neurons and their neighborhoods with the given input data. Given a data sample , the local coupling energy be-tween and its neighborhood is defined as

(16)

where denotes the response of neuron to , which

is modeled by an isotropic Gaussian density. Then, the coupling energy over the network for is defined as

(17)

and the energy function to be maximized is

(18)

In (16), the term can be considered as the

neighborhood response of , where the conjunction between the neuron responses is implemented using the summing operation.

In this study, we express the neuron response as a multivariate Gaussian distribution as follows:

(19)

for , and formulate the neighborhood response

of as

(20) where the conjunction between the neuron responses in the neighborhood of is implemented using the multiplicative operation. Then, for a given , we define the local coupling energy between and its neighborhood as the following coupling likelihood:

(21) where is the set of reference models, and denotes the given neighborhood function.2 _{Then, we define the coupling}

likeli-hood of over the network as the following (unnormalized) mixture likelihood:

(22)

where for is fixed at . Note that,

theoretically, the mixture weights can be learned automatically. When maximizing the local coupling likelihood

for each neuron , , the topological order

between neuron and its neighborhood for the given data sample is learned in the learning process; therefore, we use equal mixture weights in the mixture model to take account of the topological order learning induced by the neurons faithfully (with equal prior importance). In fact, this is important for learning an ordered map. From our experimental analysis, if the mixture weights are updated in the learning process, the learning of topological order is frequently dominated by some particular mixture components, which makes it difficult to ob-tain an ordered map. For details, one can refer to the Appendix. Comparing the network structure of the proposed coupling-likelihood mixture model in (22) with that of the Gaussian mix-ture model (GMM), as shown in Fig. 1, the proposed model inserts a coupling-likelihood layer between the Gaussian-like-lihood layer and the mixture-likeGaussian-like-lihood layer to take account of the coupling between the neurons and their neighborhoods. When the neighborhood size is reduced to zero (i.e., ), the coupling-likelihood mixture model becomes a GMM with equal mixture weights.

Note that other probability distributions are possible for in the formulation of the coupling-likelihood mixture model, although we use the multivariate Gaussian distribution in this paper.

2_{From (21), it is obvious that, in our formulation, the coupling between}_{r and}

its neighboring neurons is considered jointly, whereas Sum et al.’s formulation considers it in a pairwise manner, as shown in (16). Note that we use the term “coupling likelihood” instead of “coupling energy” for two reasons: 1) (21) is a coupling of Gaussian likelihoods; and 2) using “coupling-likelihood” can help describe the link between our proposed approaches and model-based clustering.

(5)

Fig. 1. (a) Network structure of a Gaussian mixture model, and (b) the proposed coupling-likelihood mixture model. Here,r (x ; ) denotes the multivariate Gaussian distribution described in (19).

B. The SOCEM Algorithm

The self-organizing process of PbSOM can be described as a model-based data clustering procedure that preserves the spa-tial relationships between the clusters in a network. Based on the classification-likelihood criterion for data clustering [8], the computation of the coupling likelihood of a data sample is re-stricted to its winning neuron. Thus, the goal is to estimate the

partition of , , and the set of reference

models , so as to maximize the accumulated classification log likelihood over all the data samples as follows:

(23)

As for is fixed at , the objective

function can be rewritten as

(24) Similar to the derivation of the CEM algorithm for model-based clustering in [8], the CEM algorithm for the proposed PbSOM, i.e., the SOCEM algorithm, is derived as follows.

— E-step: Given the current reference model set , com-pute the posterior probability of each mixture component

of for each as follows:

(25)

for , and .

— C-step: Assign each to the cluster whose corresponding mixture component has the largest posterior probability for

, i.e., if .

— M-step: After the C-step, the partition of (i.e., ) is formed, and the objective function defined in (24) becomes

(26)

Similar to the derivation of the M-step of the EM algo-rithm for learning a Gaussian mixture model [10], we can obtain the reestimation formulas for the mean vectors and covariance matrices by substituting (19) into (26), taking the derivative of with respect to individual parameters, and then setting it to zero. The reestimation formulas are as follows:

(27)

(28)

for . When the neighborhood size is

(6)

TABLE I

DAEM ALGORITHM FORLEARNINGGMMSWITHEQUALMIXTUREWEIGHTS ANDSOCEM ALGORITHM

CEM algorithm for learning GMMs with equal mixture weights.

1) SOCEM—A DA Variant of CEM for GMM: Similar to

Ko-honen’s sequential or batch algorithm, the SOCEM algorithm is applied in two stages. First, it is applied to a large neighbor-hood to form an ordered map near the center of the data samples. Then, the reference models are adapted to fit the distribution of the data samples by gradually shrinking the neighborhood.

Without loss of generality, we suppose the neighborhood function is the widely adopted (unnormalized) Gaussian kernel (29) where is the Euclidean distance between two neurons and in the SOM network. Initially, SOCEM is applied with a large value, which is reduced after the algorithm converges. Then, we use the new value and the learned parameters as the initial condition of the next learning phase. This process is repeated until the value of is reduced to the predefined min-imum value . The above shrinking of the neighborhood (reduction of the value) can be interpreted as an annealing process, where a large value corresponds to a high temper-ature. Table I lists the learning rules of the DAEM algorithm for learning GMMs with equal mixture weights [12] and the SOCEM algorithm. To facilitate the interpretation, we rewrite the objective function and reestimation formulae of SOCEM in (24) and (27)-(28), respectively, with the new variable , which denotes the index of the winning neuron of . For sim-plicity, we only list the reestimation formulas of the mean vec-tors of the Gaussian components.

By analyzing these two algorithms carefully, one may view as a kind of posterior probability of for in the network domain. More precisely, is initially projected into in the network domain; then, is applied to (29) as an observation of the Gaussian kernel centered at to ob-tain the value of . In both the DAEM and SOCEM algo-rithms, when the temperature ( or ) is high, the posterior distribution becomes almost uniform; hence, all the reference models will be moved to locations near the center of the data samples in this learning phase. By gradually reducing the tem-perature, the influence of each becomes more localized, and the reference models gradually spread out to fit the distribution of the data samples. When the temperature approaches zero, the probabilistic assignment strategy for the data samples becomes

the winners-take-all strategy, and the objective functions and learning rules of DAEM and SOCEM are equivalent to those of CEM. The major difference between DAEM and SOCEM seems to be that the posterior distribution in SOCEM is con-strained by the network topology, but DAEM does not have this property.

To visualize the transition of the objective function, we show a simulation on a simple one-dimension, two-component Gaussian mixture problem in Fig. 2.3_{The training data contains}

200 observations drawn from

(30)

where the Gaussian means are , and the

Gaussian variances are .4_{The PbSOM network}

structure is a 1 2 lattice in . The two reference models are

and , where . The

objective function in (23) is calculated with different setups for to form the log-likelihood surface. From Fig. 2, we ob-serve that a larger for yields a simpler objective function for optimization. The log-likelihood surface is symmetric along because of the symmetric lattice structure and equal weighting of the reference models. For the case of , the log-likelihood value is close to the global maximum of the sur-face when both and are close to the center of the data (2.39 in this case). With the reduction in the value of , the location of

for the global maximum moves toward and

.

2) Relation to Kohonen’s Batch Algorithm: There are two

differences between the SOCEM algorithm and Kohonen’s batch algorithm. First, SOCEM considers the neighborhood information when selecting the winning neuron, but Ko-honen’s algorithm does not. Second, SOCEM extends the reference vectors in Kohonen’s algorithm with multivariate

Gaussians. In other words, if we set in SOCEM to

, instead of the setting in (25),

3_{Visualization of how deterministic annealing EM/CEM works for function}

optimization is illustrated in detail in [12].

4_{The data is generated using the function gmmsamp.m in Netlab software}

(7)

Fig. 2. SOCEM’s objective function becomes more complex with the reduction of neighborhood size ( in h ). (a) = 0:6. (b) = 0:4. (c) = 0:3. (d) = 0 (i.e., h = ).

we obtain a probabilistic variant of Kohonen’s batch algorithm (denoted as KohonenGaussian), where Kohonen’s winner se-lection strategy is applied and the reference vectors are replaced with multivariate Gaussians. Thus, we may view Kohonen-Gaussian as an approximate implementation of SOCEM that optimizes SOCEM’s objective function. Moreover, if we set the covariance matrices in KohonenGaussian to be diagonal with small, identical variances, KohonenGaussian is equivalent to Kohonen’s batch algorithm. Therefore, we can interpret the neighborhood shrinking of Kohonen’s algorithms as a deterministic annealing process, and thereby explain why they need to start with a large neighborhood size.

Recently, Zhong and Ghosh [3] interpreted the neighborhood size of the SOM algorithms that apply Kohonen’s winner selec-tion strategy as a temperature parameter in a deterministic an-nealing process. However, their interpretations were not based on the optimization of an objective function, which is the essen-tial part of DA-based optimization. In contrast, in SOCEM, the neighborhood shrinking leads to the transition of the objective function from a simpler one to a more complex one, as illus-trated in Fig. 2.

3) Computational Cost: It is clear from Table I that the

com-putational cost of DAEM is , where , , and are

the numbers of reference models, data samples, and learning

iterations, respectively. Compared to DAEM, SOCEM needs additional multiplication and addition operations for winner selection in each iteration, while KohonenGaussian needs additional multiplications and additions.

C. The SOEM Algorithm

As is obvious from (23), in the formulation of the objective function of the SOCEM algorithm, only the local coupling like-lihoods associated with the winning neurons are considered. Al-ternatively, we can compute the coupling likelihood of using the mixture likelihood defined in (22) and apply the EM algo-rithm to maximize the objective log-likelihood function

(31)

The steps of the EM algorithm for the proposed PbSOM, i.e., the SOEM algorithm, are as follows.

— E-step: With the mixture model in (22), we form the aux-iliary function as

(8)

where is the same as (25). Since , (32) can be rewritten as

(33)

As for is fixed at , by

substi-tuting (21) into (33), the auxiliary function can be rewritten as

(34)

— M-step: By replacing the response in (34) with

the multivariate Gaussian density in (19) and setting the derivative of with respect to individual mean vectors and covariance matrices to zero, we obtain the following reestimation formulas:

(35)

(36)

for . When the neighborhood size is

re-duced to zero (i.e., ), SOEM reduces to the EM

algorithm for learning GMMs with equal mixture weights. There are two major differences between the SOCEM and SOEM algorithms. First, they learn maps based on the classifi-cation-likelihood criterion and the mixture-likelihood criterion, respectively. Second, SOEM adapts the reference models in a more global way than SOCEM. To explain this perspective, we can consider the learning of SOCEM and SOEM in the sense of sequential learning. As illustrated in Fig. 3, in the SOCEM algo-rithm [cf., (27) and (28)], each data sample only contributes to the adaptation of the winning reference model and its neigh-borhood (i.e., only contributes to the learning of the topolog-ical order between the winning reference model and its neigh-borhood). However, in the SOEM algorithm [cf.. (35) and (36)], each data sample contributes proportionally to the adaptation of each reference model and its neighborhood according to the

posterior probabilities for .

1) SOEM—A DA Variant of EM for GMM: As with the

SOCEM algorithm, we can apply SOEM to a large neighbor-hood and obtain different map configurations by gradually

reducing the neighborhood size. The term in (35)

Fig. 3. For each data samplex , the adaptation of the reference models in SOCEM is restricted to the winning reference model and its neighborhood. However, in SOEM, the winner is relaxed to the weighted winners by the pos-terior probabilities , for k = 1; 2; . . . ; G. Each data sample x contributes proportionally to the adaptation of each reference model and its neighborhood according to the posterior probabilities. (a) SOCEM. (b) SOEM.

and (36) can be considered as a kind of posterior probability of the reference model for , which is also constrained by the neighborhood function. With a large value

in [see (29)], for , will be

nearly a uniform distribution due to the small variation in the

values of for , and the small variation in the

values of for , for each case of . Hence, all

the reference models will be moved to locations near the center of the data samples. When the neighborhood size is reduced

to zero (i.e., ), the SOEM algorithm becomes the

EM algorithm for learning GMMs with equal mixture weights. As with the annealing interpretation of SOCEM, SOEM can be viewed as a topology-constrained deterministic annealing variant of the EM algorithm for learning GMMs with equal mixture weights.5

2) Computational Cost: Comparing (34)–(36) to (26)–(28),

we can see that, in each learning iteration, SOEM and SOCEM have a similar computational cost in the E-step, but the former needs additional multiplication and addition operations for updating the model parameters in the M-step.

5_{SOEM yielded a similar result on the one-dimension, two-component}

Gaussian mixture problem in Fig. 2; however, we do not present it here to avoid redundancy.

(9)

D. The SODAEM Algorithm

Similar to the derivation of the DAEM algorithm for learning GMMs [12], we developed a DAEM algorithm for the proposed PbSOM, called the SODAEM algorithm. With the mixture like-lihood defined in (22), DAEM first derives the posterior den-sity in the E-step using the principle of maximum entropy. Fol-lowing the derivation of the posterior probability in [12] with the current model’s parameter set , we obtain the posterior probability of the th mixture component for as follows:

(37)

Then, the auxiliary function to be minimized is

(38) and the reestimation formulas for the mean vectors and covari-ance matrices are

(39)

(40)

for .

Note that the reestimation formulas for SODAEM are the same as those for SOEM, except that is replaced by . corresponds to the temperature that controls the annealing process, in which a high temperature is applied initially. Then, the system is cooled down by gradually reducing the

temper-ature. When , the SODAEM algorithm becomes the

SOEM algorithm; however, when , it is equivalent

to the SOCEM algorithm. In other words, SODAEM can be viewed as a deterministic annealing variant of SOEM and SOCEM.

By considering certain cases and approximations of SO-DAEM, SOEM, and SOCEM, we summarize the family of EM-based approaches for Gaussian model-based clus-tering discussed in this section in Fig. 4. Both EM under the mixture-likelihood criterion and CEM under the classifica-tion-likelihood criterion are widely used model-based data clustering methods. SOEM (SOCEM) can be applied instead of EM (CEM) in model-based clustering if we want to preserve the spatial relationships between the resulting data clusters on a network. Since SODAEM is a DA variant of SOEM and

Fig. 4. Family of Gaussian model-based clustering algorithms derived from the SODAEM, SOEM, and SOCEM algorithms. = 1 if k = l; otherwise, = 0.

SOCEM, it can be applied in model-based data clustering under both mixture-likelihood and classification-likelihood criteria.

1) Computational Cost: Comparing (39)-(40) to (35)-(36),

we can see that SODAEM and SOEM have similar computa-tional costs in each learning iteration.

IV. EXPERIMENTALRESULTS

A. Experiments on the Organizing Property

Data Set Description: We conducted experiments on two

types of data: a synthetic data set and a real-world data set. The synthetic data set consisted of 500 points uniformly distributed in a unit square. For the real-world data set, we used the training set of class “0” in the “Pen-Based Recognition of Handwritten Digits” database (denoted as PenRecDigits_C0) in the UCI Ma-chine Learning Database Repository [35]. The data set con-sists of 802 16-dimensional vectors. To demonstrate the map-learning process, we used the first two dimensions of the fea-ture vectors as data for simulations. As a preprocessing step, we scaled down each element of the vectors in PenRecDigits_C0 to 1/100 of its original value to avoid numerical traps.

Experimental Setup: In the experiments, an 8 8 equally spaced square lattice in a unit square was used as the structure of the SOM network. For the neighborhood function, we used the Gaussian kernel in (29).

We evaluated SOCEM, SOEM, SODAEM, and Kohonen-Gaussian (Kohonen’s batch algorithm that uses Kohonen-Gaussian ref-erence models) in 20 independent random initialization trials and two setups for in . For each trial, data samples were randomly selected from the data set as the initial mean

vec-tors, , of the reference models, which were

mul-tivariate Gaussians with full covariance matrices. The initial co-variance matrix was set as , where

, for . To avoid the singularity problem,

we applied the variance limiting step to the covariance matrices during the learning process. If the value of any element of the covariance matrix was less than 0.001, it was set at 0.001.

(10)

Fig. 5. Map-learning process obtained by running the SOCEM algorithm on the synthetic data. (a) and (b) Simulation 1: When SOCEM is run with the random initialization in (a) and = 0:15, it converges to the unordered map in (b). (a) and (c)-(f) Simulation 2: SOCEM starts with = 0:6 and the random initialization in (a). Then, the value of is reduced to 0.15 in 0.15 decrements. (a) Random initialization. (b) = 0:15 with random initialization. (c) = 0:6 with random initialization. (d) = 0:45. (e) = 0:3. (f) = 0:15.

1) Results Using the Synthetic Data: We first demonstrate

the map-learning processes of SOCEM, SOEM, and SODAEM using one of the 20 random initializations by showing the con-figurations of the Gaussian means on the maps, and then sum-marize the overall results of all the initializations.

Simulations Using SOCEM: Fig. 5 shows two simulations

using the SOCEM algorithm. In the first simulation, SOCEM is run with the random initialization in Fig. 5(a) and a fixed of 0.15 in . As shown in Fig. 5(b), the algorithm’s learning con-verges to an unordered map. In the second simulation, SOCEM starts with the same random initialization as that in Fig. 5(a), but with a larger of 0.6. When it converges at the current value, is reduced by 0.15. Then, the algorithm is applied again with the new value and the reference models obtained in the previous phase. This process continues until SOCEM converges

at . Fig. 5(c)–(f) depicts the maps obtained when

and , respectively. We can explain the second simulation in terms of annealing (cf., Section III-B1). When using SOCEM, we start with a larger value (a higher temperature) so that the objective function is simple enough to be optimized. Then, we obtain the target map configuration by gradually reducing the value of (the temperature). Though the reduction in produces a more complex objective function for optimization, SOCEM can still learn well because the reference models obtained at the larger value provide a sound initializa-tion for the next learning phase at the smaller value.

Simulations Using SOEM: We conducted two similar

sim-ulations using the SOEM algorithm. In the first simulation, SOEM was run with the random initialization in Fig. 6(a) [the same as that in Fig. 5(a)] and a fixed of 0.15. As shown in Fig. 6(b), the learning of SOEM converged to an unordered

map. In the second simulation, SOEM started with the random initialization in Fig. 6(a) and a larger of 0.6. Then, the value of was gradually reduced to 0.15 in 0.15 decrements. Fig. 6(c)–(f) depicts the maps obtained when SOEM

con-verges at and , respectively. Similar to

SOCEM, we can interpret the reduction of in SOEM as an annealing process (cf., Section III-C1), which overcomes the initialization issue. Comparing Fig. 6(c)-(d) to Fig. 5(c)-(d), we observe that the map obtained by SOEM is more concentrated than that obtained by SOCEM for the same value. This may be because SOEM learns the map in a more global manner than SOCEM, as noted in Section III-C. In other words, each data sample contributes to all the neurons in a more global manner in SOEM than in SOCEM.

Simulations Using SODAEM: Fig. 7 depicts the simulations

using the SODAEM algorithm with the same random initial-ization as that in Figs. 5(a) and 6(a). The value of is also fixed at 0.15, and the initial value of is set to 0.16. When SO-DAEM converges at a value, it is applied again with

and the reference models obtained in the previous phase.

We stop the learning process at . In our

experi-ence, it is appropriate to set the maximum value of within

the range 10–20 for practical applications. When ,

the temperature is high enough to ensure a smooth objective function. Therefore, according to the parameter update rules of SODAEM, the reference models form a compact ordered map via lateral interactions near the center of the data samples, even though the neighborhood size is small ( in this case).

When and , SODAEM is almost equivalent

to SOEM and SOCEM, respectively. In these two cases, SO-DAEM converges to the ordered maps in Fig. 7(f) and (i),

(11)

re-Fig. 6. Map-learning process obtained by running the SOEM algorithm on the synthetic data. (a) and (b) Simulation 1: When SOEM is run with the random initialization in (a) and = 0:15, it converges to the unordered map in (b). (a) and (c)-(f) Simulation 2: SOEM starts with = 0:6 and the random initialization in (a). Then, the value of is reduced to 0.15 in 0.15 decrements. (a) Random initialization. (b) = 0:15 with random initialization. (c) = 0:6 with random initialization. (d) = 0:45. (e) = 0:3. (f) = 0:15.

TABLE II

RESULTS OFSIMULATIONSUSINGKOHONENGAUSSIAN, SOCEM, SOEM,

ANDSODAEMIN20 INDEPENDENTRANDOMINITIALIZATIONTRIALS ON THESYNTHETICDATA. THEALGORITHMSWERERUNWITHTWOSETUPS FORINh . WHEN = 0:15, KOHONENGAUSSIANSUCCEEDED IN

CONVERGING TO ANORDEREDMAP INONERANDOMINITIALIZATION

CASE(S:1), BUTFAILED IN THEREMAININGCASES(F:19)

spectively. However, as shown in Figs. 5(a)-(b) and 6(a)-(b), SOCEM and SOEM do not converge to an ordered map when , which demonstrates that the annealing process of SODAEM overcomes the initialization problem of SOCEM and

SOEM when . Note that SODAEM may not be able to

obtain any ordered map during the annealing process if the value of is too small to form an ordered map at a small value.

Discussion: The experiment results obtained by the three

proposed algorithms and KohonenGaussian for the 20 random initializations are summarized in Table II. Several conclusions can be drawn from the results. First, SOEM often converges to

an ordered map even at a small, fixed value ( in

the experiments), but KohonenGaussian and SOCEM seldom do so. This may be because SOEM learns the map in a more global way, as noted in Section III-C; hence, it is less sensitive

to the initialization of the parameters when is small. The re-sults for KohonenGaussian and SOCEM are similar. This may be because they only differ in the winner selection strategy. Second, the initialization issue of KohonenGaussian, SOCEM, and SOEM can be overcome by using a larger value (0.6 in the experiments) initially, and then gradually reducing the value to the target value (0.15 in the experiments). The re-duction of can be interpreted as an annealing process (cf., Sections III-B1, III-B2, and III-C1). Third, the experiment re-sults show that SODAEM overcomes the initialization issue of SOCEM and SOEM at a small value (0.15 in the experiments) using the annealing process, which is controlled by the temper-ature parameter .

2) Results Using PenRecDigits_C0: We also conducted

experiments on real-world data using the setups for the neighborhood function described in Section IV-A1. Table III summarizes the results obtained by the four PbSOM learning algorithms. From the results, we can draw the same conclusions as those made for the experiment results on the synthetic data. Figs. 8–10 demonstrate, respectively, the map-learning pro-cesses of SOCEM, SOEM, and SODAEM using one of the 20 random initializations. Comparing Figs. 8–10, we observe that these three algorithms obtain rather different results. SOCEM and SOEM usually obtain different maps because they learn the maps based on different clustering criteria (classification like-lihood versus mixture likelike-lihood). SODAEM and SOEM (or SOCEM) usually obtain different results because SODAEM’s annealing is achieved by increasing the value, while SOEM’s (or SOCEM’s) annealing is achieved by decreasing the value. Comparing Figs. 9(f) and 10(f), although SODAEM becomes equivalent to SOEM when the value of is increased to 1.04,

(12)

Fig. 7. Map-learning process obtained by running the SODAEM algorithm on the synthetic data. The value of is fixed at 0.15, while value of is initialized at 0.16 and increased in multiples of 1.6 up to 17.592. (a) Random initialization. (b) = 0:15; = 0:16. (c) = 0:15; = 0:256. (d) = 0:15; = 0:409. (e) = 0:15; = 0:655. (f) = 0:15; = 1:04. (g) = 0:15; = 2:68. (h) = 0:15; = 6:871. (i) = 0:15; = 17:592.

their search paths on the objective function surface are different because they have rather different seed models [Fig. 10(e) versus Fig. 9(e)]. Therefore, they converge to different local maxima of the objective function and obtain different maps. Likewise, although SODAEM becomes equivalent to SOCEM when the value of is increased to 17.592, they converge to different local maxima of the objective function and obtain different maps [Fig. 10(i) versus Fig. 8(f)].

B. Experiments to Evaluate the Performance of Data Clustering and Visualization

Data Set Description: In this section, we evaluate the

per-formance of data clustering and visualization of the proposed algorithms on two data sets from the UCI Machine Learning Database Repository [35]: the test set of the “image segmen-tation” database (denoted as ImgSeg), which consists of 2100 19-dimensional feature vectors, and the Ecoli data set (denoted as Ecoli), which consists of 336 8-dimensional feature vectors. Here, we used the full vector, rather than only two dimensions, in the experiments. As a preprocessing step, we scaled down

each element of the data vectors in ImgSeg to 1/100 of its orig-inal value to avoid numerical traps.

Experimental Setup: To avoid the singularity problem that

often occurs when using CEM or EM to learn full covariance GMMs, we used diagonal covariance Gaussians in the experi-ments. We also applied the variance limiting step, in which the minimum value for a variance was set at 0.01.

For the PbSOM learning algorithms, we used five configu-rations for the network structure; they are 3 3, 4 4, 5 5, 6 6, and 7 7 lattices equally spaced in a unit square. We used the Gaussian kernel in (29) as the neighborhood function.

To avoid ambiguity, when the DAEM and SODAEM algorithms are applied in data clustering based on the clas-sification-likelihood criterion, they are denoted as DAEM_C and SODAEM_C; and they are denoted as DAEM_M and SODAEM_M when applied in data clustering based on the mixture-likelihood criterion.

All the algorithms discussed here were run with random initializations generated in the same way described in Section IV-A.

(13)

Fig. 8. Map-learning process obtained by running the SOCEM algorithm on PenRecDigits_C0. (a) and (b) Simulation 1: When SOCEM is run with the random initialization in (a) and = 0:15, it converges to the unordered map in (b). (a) and (c)-(f) Simulation 2: SOCEM starts with = 0:6 and the random initialization in (a). Then, the value of is reduced to 0.15 in 0.15 decrements. (a) Random initialization. (b) = 0:15 with random initialization. (c) = 0:6 with random initialization. (d) = 0:45. (e) = 0:3. (f) = 0:15.

Fig. 9. Map-learning process obtained by running the SOEM algorithm on PenRecDigits_C0. (a) and (b) Simulation 1: When SOEM is run with the random initialization in (a) and = 0:15, it converges to the unordered map in (b). (a) and (c)-(f) Simulation 2: SOEM starts with = 0:6 and the random initialization in (a). Then, the value of is reduced to 0.15 in 0.15 decrements. (a) Random initialization. (b) = 0:15 with random initialization. (c) = 0:6 with random initialization. (d) = 0:45. (e) = 0:3. (f) = 0:15.

1) Experiments on ImgSeg by Using SOCEM and SO-DAEM_C: First, we evaluated the data clustering performance

of KohonenGaussian, SOCEM, and SODAEM_C in terms of the classification log likelihood defined in (7). The performance

was compared with that of CEM and DAEM_C. The setting for each algorithm was as follows.

• DAEM_C: The value of was set at 0.2 initially, and

(14)

Fig. 10. Map-learning process obtained by running the SODAEM algorithm on PenRecDigits_C0. The value of is fixed at 0.15, while value of is initialized at 0.16 and increased in multiples of 1.6 up to 17.592. (a) Random initialization. (b) = 0:15; = 0:16. (c) = 0:15; = 0:256. (d) = 0:15; = 0:409. (e) = 0:15; = 0:655. (f) = 0:15; = 1:04. (g) = 0:15; = 2:68. (h) = 0:15; = 6:871. (i) = 0:15; = 17:592.

• SOCEM: The value of in was set at 0.7 initially, and reduced to 0 (i.e., ) in 0.02 decrements.

• SODAEM_C: Both the values of and in were set

at 0.2 initially. To perform data clustering using the classi-fication-likelihood criterion, the value of was increased

to 10 by the formula first; then, the value

of was reduced to 0 in 0.02 decrements.

• KohonenGaussian: The value of in was set at 0.7 initially, and reduced to 0 in 0.02 decrements every 30 learning iterations.6

We ran the algorithms except CEM with 20 independent trials using 9, 16, 25, 36, and 49 Gaussian components. To conduct a fair comparison of CEM and the proposed approaches, we ran CEM many trials until the accumulated execution time was close to that of one SOCEM trial. The mean and standard de-viations (error bars) of the classification log-likelihood values over the trials for each algorithm and the best results of CEM (denoted as CEM-best) are shown in Fig. 11. Note that, in the figure, we slightly separate the results associated with a specific Gaussian component number in order to distinguish between

6_{In our implementation for SOCEM, SOEM, and SODAEM, the phase}

tran-sition occurs when the likelihood increase is below a threshold or the number of learning iterations exceeds 30 in the current phase. However, KohonenGaus-sian does not have the convergence property; thus, we ran 30 iterations for each phase of the algorithm.

Fig. 11. Data clustering performance of CEM, DAEM_C, SOCEM, SO-DAEM_C, and KohonenGaussian on ImgSeg in terms of the classification log likelihood.

them. From the figure, we observe that the clustering perfor-mance of SOCEM, SODAEM_C, and KohonenGaussian is close to that of DAEM_C. Moreover, they obtain larger and

(15)

Fig. 12. Data visualization for ImgSeg by running KohonenGaussian (b), SOCEM (c) and (d), and SODAEM_C (e) and (f) with the random initialization in (a). The network structure is a 72 7 equally spaced square lattice in a unit square. (a) Random initialization. (b) KohonenGaussian ( = 0). (c) SOCEM ( = 0:06). (d) SOCEM( = 0). (e) SODAEM_C ( = 10; = 0:14). (f) SODAEM_C ( = 10; = 0).

more stable classification log likelihoods than CEM. These results are rational since SOCEM is a topology-constrained DA variant of the CEM algorithm, and SODAEM_C is an annealing variant of SOCEM with the settings for and here.

Next, we evaluated the data visualization ability of Koho-nenGaussian, SOCEM, and SODAEM_C. To visualize the data clusters on the network, each data sample was assigned to its winning reference model, and then randomly plotted within

the neuron that associates to the reference model [36]. Here, the winner selection strategy for SODAEM_C was the same as that of SOCEM (i.e., the C-step of SOCEM). Fig. 12 shows the projections of the data samples on 7 7 lattices obtained by different algorithms. The ImgSeg data set comprises seven classes, namely, brickface: B, sky: S, foliage: F, cement: C, window: W, path: P, and grass: G; each class consists of 300 data samples. Fig. 12(a) depicts the initial mapping of the data

(16)

TABLE III

RESULTS OFSIMULATIONSUSINGKOHONENGAUSSIAN, SOCEM, SOEM,

ANDSODAEMIN20 INDEPENDENTRANDOMINITIALIZATIONTRIALS ON

PENRECDIGITS_C0. THEALGORITHMSWERERUNWITHTWOSETUPS FOR INh . WHEN = 0:15, KOHONENGAUSSIANSUCCEEDED INCONVERGING TO ANORDEREDMAP INONERANDOMINITIALIZATIONCASE(S:1), BUT

FAILED IN THEREMAININGCASES(F:19)

Fig. 13. Learning a Gaussian mixture model by applying EM, DAEM_M, SOEM, and SODAEM_M to ImgSeg.

obtained with a random initialization for the reference models. As we can see from the figure, the data clusters are randomly projected to the neurons (lattice nodes) and the network does not preserve the topological (spatial) relationships among the clusters. Fig. 12(b)–(f) shows the results of the three PbSOM learning algorithms obtained with the random initialization in Fig. 12(a). We see that they can preserve the topological rela-tionships among the data clusters on the network. Moreover, it seems that the data samples of classes “S,” “G,” “P,” and “C” are more distinguishable and well grouped on the network than those of the other classes. In particular, from Fig. 12(b)–(d), we see that only class “S” is separated from the other classes with empty nodes; thus, we may infer that the separability between “S” and the other classes is higher than that between the remaining classes.

For SOCEM, as shown in Fig. 12(c) and (d), the network

con-tains less empty nodes at than at . This may

be because in the former case the lateral interactions have van-ished, and thus the reference models are adapted to better fit the data distribution than the latter case. Comparing Fig. 12(b) to Fig. 12(d), we see that the data projection results of Kohonen-Gaussian and SOCEM are rather different although they obtain

similar classification log likelihoods in Fig. 11. However, we can draw similar observations from the two figures. For example, the data samples of class “S” are more close to those of class “C” and “P” than those of class “G.” Fig. 12(e) and (f) shows the results obtained by SODAEM_C. We see that the result in Fig. 12(f) is rather different from that in Fig. 12(d) although

SO-DAEM_C has become equivalent to SOCEM when .

This may be because these two approaches search on the ob-jective function surface along different paths and converge to different local maxima, as the explanation for the difference of Figs. 9(f) and 10(f) shows in Section IV-A2.

2) Experiments on ImgSeg by Using SOEM and SO-DAEM_M: First, we evaluated the performance of SOEM

and SODAEM_M in learning a Gaussian mixture model with equal mixture weights. The objective function was the log mixture-likelihood function in (2) with equal mixture weights. We compared the performance with that of EM and DAEM_M. The setting for each algorithm was as follows.

• DAEM_M: The value of was set at 0.2 initially, and

in-creased to 1 by the formula .

• SOEM: The value of in was set at 0.6 initially, and reduced to 0 (i.e., ) in 0.02 decrements.

• SODAEM_M: Both the values of and in were set at 0.2 initially. To perform data clustering using the mixture-likelihood criterion, the value of was increased to 1 by

the formula first; then, the value of was

reduced to 0 in 0.02 decrements.

We ran DAEM_M, SOEM, and SODAEM_M with 20 inde-pendent random initialization trials. Similar to the experiments on CEM, we ran EM many trials until the accumulated exe-cution time was close to that of one SOEM trial. The mean and standard deviations (error bars) of the log mixtulikeli-hood values over the trials for each algorithm and the best re-sults of EM (denoted as EM-best) are shown in Fig. 13. From the figure, it is clear that DAEM_M, SOEM, and SODAEM_M achieve similar performance. Moreover, they obtain larger and more stable log mixture-likelihoods than EM. The results are rational since SOEM is a topology-constrained DA variant of the EM algorithm, and SODAEM_M is an annealing variant of SOEM with the settings for and here.

Next, we evaluated the data visualization ability of SOEM and SODAEM_M. We ran these two algorithms with a 7 7 lat-tice and the initial reference models used in Section IV-B1 for evaluating SOCEM; therefore, the initial projection of the data was the same as that shown in Fig. 12(a). When clustering the data samples, each sample was assigned to its winning reference model using SOCEM’s winner selection strategy. From Fig. 14, we observe that these two algorithms can preserve topological relationships among data clusters (samples). Similar to the re-sults revealed by Fig. 12, data samples of classes “S,” “G,” “P,” and “C” are more distinguishable than those of the other classes. Comparing Fig. 14(b) to Fig. 12(b) and (d), it is clear that SOEM produces less empty nodes than KohonenGaussian and SOCEM when the value of is reduced to zero. It may be ex-plained as follows. For KohonenGaussian and SOCEM, in the

case of , they become the CEM ( -means-type)

algo-rithm where each data sample only adapts its winner. However,

(17)

Fig. 14. Data visualization for ImgSeg by running SOEM (a) and (b) and SODAEM_M (c) and (d) with the random initialization in Fig. 12(a). The network struc-ture is a 72 7 equally spaced square lattice in a unit square. (a) SOEM ( = 0:06). (b) SOEM ( = 0). (c) SODAEM_M ( = 1; = 0:14). (d) SODAEM_M ( = 1; = 0).

data sample adapts all the reference models according to their posterior probabilities; thus, the models are more adapted to fit the data than the models of the other two algorithms.

3) Experiments on Ecoli: We conducted experiments on

Ecoli using the algorithms applied to ImgSeg in Sections IV-B1 and IV-B2. Fig. 15(a) and (b) shows the data clustering per-formance of each algorithm in terms of the classification log likelihood and the log mixture likelihood, respectively. Similar to the results on ImgSeg, the PbSOM learning algorithms also achieve decent data clustering performance on Ecoli.

In Fig. 16, for each algorithm, we show the result at the value that the class separability can be best visualized on the network. The Ecoli data set comprises eight classes, namely, cp: C, im: I, pp: P, imU: U, om: O, omL: M, imL: L, and imS: S. The numbers of data samples are 143, 77, 52, 35, 20, 5, 2, and 2, respectively. From the figure, we can see that topological relationships among data clusters are preserved well and data classes can be roughly separated on the network.

V. RELATION TOOTHERALGORITHMS

In this section, we explore the differences and relations be-tween the proposed algorithms and other related algorithms.

A. For SOCEM

In [37], Ambroise and Govaert proposed a topology pre-serving EM (TPEM) algorithm that introduces topological constraints in the CEM algorithm. If Kohonen’s winner se-lection strategy is applied, SOCEM is equivalent to TPEM whose mixture weights are equally fixed. In SOCEM, the co-variance matrix of a Gaussian component can have different parameterizations for different geometric interpretations [1].

When for (where is a small positive

constant and denotes the identity matrix), the clusters are spherical and of equal volume. In this case, the SOCEM algo-rithm is equivalent to the TVQ algoalgo-rithm in [22], which was developed for noisy vector quantization. It is also equivalent to the batch SOM learning algorithm described in [20], which employs an energy function in the learning phase of a SOM. However, SOCEM was developed from a different perspec-tive. We consider the learning of a PbSOM as a model-based clustering process. By this perspective, a coupling-likelihood mixture model is developed first, and an objective function is then formulated based on the classification-likelihood criterion. Moreover, the connection between the coupling-likelihood mixture model and the Gaussian mixture model helps interpret

(18)

Fig. 15. Data clustering performance on Ecoli in terms of (a) the classification log likelihood and (b) the log mixture likelihood.

SOCEM as a topology-constrained DA variant of the CEM algorithm for GMM.

B. For SOEM and SODAEM

In SODAEM, when for , SODAEM

is equivalent to the STVQ algorithm [23], which learns the pa-rameters by maximizing their density function predicted by the maximum entropy principle. In STVQ, the inverse temperature is the Lagrange multiplier introduced for the constrained op-timization induced by the maximum entropy principle. Heskes [25] extends TVQ’s cost function to an expected quantization error. Then, an objective function is obtained by weighting the quantization error with the inverse temperature and pulsing it to an entropy term that introduces the annealing process. With the resulting objective function, Heskes obtained an algorithm identical to STVQ. The implementations for deterministic an-nealing in STVQ and Heskes’ algorithm can also be found in [38] and [39], where the DA is applied for vector quantization.

SODAEM differs from Graepel et al.’s STVQ and Heskes’ algorithm in the following ways. First, the deterministic an-nealing processes are implemented differently. SODAEM is a DAEM algorithm developed to learn the mixture models with a deterministic annealing process, which is implemented based on predicting the posterior distribution in the E-step using the

maximum entropy principle. Second, the case of was

not well addressed in Graepel et al.’s and Heskes’ papers. This may be because their original goal was to develop a DA learning for TVQ. When is fixed at 1, however, SODAEM becomes the SOEM algorithm. Moreover, the connection between the proposed coupling-likelihood mixture model and the Gaussian mixture model helps interpret SOEM as a topology-constrained DA variant of the EM algorithm for GMM.

VI. CONCLUSION

Considering the learning of a PbSOM as a model-based clus-tering process, we develop a coupling-likelihood mixture model

for PbSOM, and derive three EM-type learning algorithms, namely, the SOCEM, SOEM, and SODAEM algorithms, for learning the model (PbSOM). The proposed algorithms improve Kohonen’s learning algorithms by including a cost function, an EM-based convergence property, and a proba-bilistic framework.

In addition, the proposed algorithms provide some insights into the choice of neighborhood size that would ensure topo-graphic ordering. From the experiment results, we observe that the learning performance of SOCEM is very sensitive to the initial setting of the reference models when the neighborhood is small. Conversely, it is not sensitive to the initial condition when the neighborhood is sufficiently large. To deal with the initialization problem, we first run SOCEM with a large neigh-borhood, and then gradually reduce the neighborhood size until the learning converges to the desired map. When using a small neighborhood, SOEM is less sensitive to the initialization than SOCEM. However, to learn an ordered map, SOEM still needs to start with a large neighborhood. In both SOCEM and SOEM, the neighborhood shrinking can be interpreted as an annealing process that overcomes the initialization issue. Alternatively, we can apply SODAEM, which is a deterministic annealing variant of SOCEM and SOEM, to learn a map. In our experiments, SODAEM overcomes the initialization issue of SOCEM and SOEM via the annealing process controlled by the temperature parameter. Moreover, through the comparison of SOCEM and Kohonen’s batch algorithm, we can also apply the DA interpre-tation of neighborhood shrinking to Kohonen’s algorithms to explain why they need to start with a large neighborhood size.

We have also shown that the SOCEM and SOEM algorithms can be interpreted, respectively, as topology-constrained de-terministic annealing variants of the CEM and EM algorithms for Gaussian model-based clustering. The experimental results show that our proposed PbSOM learning algorithms achieve an effective data clustering performance, while maintaining the topology-preserving property.

(19)

Fig. 16. Data visualization for Ecoli by running (b) KohonenGaussian, (c) SOCEM, (d) SOEM, (e) SODAEM_C, and (f) SODAEM_M with the random ini-tialization in (a). The network structure is a 72 7 equally spaced square lattice in a unit square. (a) Random initialization. (b) KohonenGaussian ( = 0:06). (c) SOCEM( = 0:06). (d) SOEM ( = 0:08). (e) SODAEM_C ( = 10; = 0:2). (f) SODAEM_M ( = 1; = 0:2).

APPENDIX

Theoretically, the mixture weights of the coupling-likelihood mixture model in (22) can be learned automatically. Following the derivations of the SOCEM, SOEM, and SODAEM algo-rithms in Sections III-B–III-D, the learning rules for the mixture weights are derived as follows.

• Posterior distribution:

— for SOCEM and SOEM

(20)

Fig. 17. Map-learning process obtained by running the SOCEM algorithm on the synthetic data with an ordered initialization in (a). (a)–(e) Simulation 1: The mixture weights are initialized at 1/16, and updated in the learning process; the algorithm starts with the initialization in (a) and converges to the unordered map in (e). (a) and (f) Simulation 2: SOCEM is performed with equal mixture weights throughout the learning process; the algorithm starts with the initialization in (a) and converges to the map in (f). The network structure is a 42 4 square lattice; the value of is set at 0.4. (a) Initialization. (b) Weights are updated, iter = 5. (c) Weights are updated,iter = 10. (d) Weights are updated, iter = 16. (e) Weights are updated, iter = 18. (f) Fixed equal weights.

TABLE IV

THEMIXTUREWEIGHTSLEARNED BYSOCEM WITH THEINITIALIZATION INFIG. 17(A). THEMIXTUREWEIGHTSAREINITIALIZED AT1/16

— for SODAEM (42) • Reestimation formulas: — for SOCEM (43) — for SOEM (44) — for SODAEM (45) The mean vectors and covariance matrices in SOCEM, SOEM, and SODAEM algorithms are updated using (27)-(28), (35)-(36), and (39)-(40), respectively, where and are computed by (41) and (42), respectively.

However, in our experience, if the mixture weights are learned in the three algorithms, the learning of topological order is frequently dominated by some particular mixture components, which makes it difficult to obtain an ordered map. As an example, we applied SOCEM to the synthetic data set, which consisted of 500 points uniformly distributed in a unit square. The network structure was a 4 4 equally spaced square lattice in a unit square. All the mixture weights

(21)

were set at 1/16 initially. The value of in the neighborhood function [i.e., (29)] was set at 0.4. The results are shown in Fig. 17(a)–(e). From the figures, we observe that the map shrinks to near a line after the algorithm converges (with 18 iterations). This phenomenon can be verified by inspecting the values of mixture weights during the learning process. As shown in Table IV, after the algorithm converges, most of the mixture weights become zero and the learning only maximizes the local coupling likelihoods of neurons 4 and 13, whose mixture weights are 0.504 and 0.496, respectively. In contrast, as shown in Fig. 17(f), if the mixture weights are equally fixed at 1/16 throughout the learning process, SOCEM converges to an ordered map. For SOEM and SODAEM, we obtained the similar results.

REFERENCES

[1] C. Fraley and A. E. Raftery, “How many clusters? Which clustering method? Answers via model-based cluster analysis,” Comput. J., vol. 41, pp. 578–588, 1998.

[2] C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” J. Amer. Statist. Assoc., vol. 97, no. 458, pp. 611–631, 2002.

[3] S. Zhong and J. Ghosh, “A unified framework for model-based clus-tering,” J. Mach. Learn. Res., vol. 4, no. 6, pp. 1001–1037, 2003. [4] C. Fraley and A. E. Raftery, “Bayesian regularization for normal

mix-ture estimation and model-based clustering,” J. Classification, vol. 24, no. 2, pp. 155–181, 2007.

[5] M. S. Oh and A. E. Raftery, “Model-based clustering with dissimilari-ties: A Bayesian approach,” J. Comput. Graph. Statist., vol. 16, no. 3, pp. 559–585, 2007.

[6] M. J. Symons, “Clustering criteria and multivariate normal mixture,”

Biometrics, vol. 37, pp. 35–43, 1981.

[7] S. Ganesalingam, “Classification and mixture approach to clustering via maximum likelihood,” Appl. Statist., vol. 38, no. 3, pp. 455–466, 1989.

[8] G. Celeux and G. Govaert, “A classification EM algorithm for clus-tering and two stochastic versions,” Comput. Statist. Data Anal., vol. 14, no. 3, pp. 315–332, 1992.

[9] J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussian clustering,” Biometrics, vol. 49, no. 3, pp. 803–821, 1993. [10] J. A. Bilmes, “A gentle tutorial of the EM algorithm and its

applica-tion to parameter estimaapplica-tion for Gaussian mixture and hidden Markov models,” Int. Comput. Sci. Inst., Berkeley, CA, Tech. Rep. TR-97-021, 1998.

[11] G. J. McLachlan and T. Krishnan, The EM Algorithm and

Exten-sions. New York: Wiley, 1997.

[12] N. Ueda and R. Nakano, “Deterministic annealing EM algorithm,”

Neural Netw., vol. 11, no. 2, pp. 271–282, 1998.

[13] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, “SMEM al-gorithm for mixture models,” Neural Comput., vol. 12, no. 9, pp. 2109–2128, 2000.

[14] S. S. Cheng, H. M. Wang, and H. C. Fu, “A model-selection-based self-splitting Gaussian mixture learning with application to speaker identification,” EURASIP J. Appl. Signal Process., vol. 2004, no. 17, pp. 2626–2639, 2004.

[15] T. Kohonen, Self-Organizing Maps. New York: Springer-Verlag, 2001.

[16] T. Kohonen, “The self-organizing maps,” Neurocomputing, vol. 21, pp. 1–6, 1998.

[17] C. Bishop, M. Svensén, and C. Williams, “The generative topographic mapping,” Neural Comput., vol. 10, no. 1, pp. 215–234, 1998. [18] V. V. Tolat, “An analysis of Kohonen’s self-organizing maps using a

system of energy functions,” Biol. Cybern., vol. 64, no. 2, pp. 155–164, 1990.

[19] E. Erwin, K. Obermayer, and K. Schulten, “Self-organizing maps: Or-dering, convergence properties and energy functions,” Biol. Cybern., vol. 67, no. 1, pp. 47–55, 1992.

[20] Y. Cheng, “Convergence and ordering of Kohonen’s batch map,”

Neural Comput., vol. 9, no. 8, pp. 1667–1676, 1997.

[21] S. P. Luttrell, “Self-organization: A derivation from fist principles of a class of learning algorithm,” in Proc. IEEE Int. Joint Conf. Neural

Netw., 1989, pp. II-495–II-498.

[22] S. P. Luttrell, “Code vector density in topographic mappings: Scalar case,” IEEE Trans. Neural Netw., vol. 2, no. 4, pp. 427–436, Jul. 1991. [23] T. Graepel, M. Burger, and K. Obermayer, “Phase transitions in stochastic self-organization maps,” Phys. Rev. E, Stat. Phys. Plasmas

Fluids Relat. Interdiscip. Top., vol. 56, no. 4, pp. 3876–3890, 1997.

[24] T. Graepel, M. Burger, and K. Obermayer, “Self-organizing maps: Generalizations and new optimization techniques,” Neurocomputing, vol. 21, pp. 173–190, 1998.

[25] T. Heskes, “Self-organizing maps, vector quantization, and mixture modeling,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1299–1305, Nov. 2001.

[26] T. W. S. Chow and S. Wu, “An online cellular probabilistic self-or-ganizing map for static and dynamical data sets,” IEEE Trans. Circuit

Syst. I, Reg. Papers, vol. 51, no. 4, pp. 732–747, Apr. 2004.

[27] S. Wu and T. W. S. Chow, “PRSOM: A new visualization method by hybridizing multidimensional scaling and self-organizing map,” IEEE

Trans. Neural Netw., vol. 16, no. 6, pp. 1362–1380, Nov. 2005.

[28] S. P. Luttrell, “A Bayesian analysis of self-organizing maps,” Neural

Comput., vol. 6, no. 5, pp. 767–794, 1994.

[29] F. Anouar, F. Badran, and S. Thiria, “Probabilistic self-organizing map and radial basis function networks,” Neurocomputing, vol. 20, pp. 93–96, 1998.

[30] J. Lampinen and T. Kostiainen, “Generative probability density model in the self-organizing map,” in Self-Organizing Neural Networks:

Re-cent Advances and Applications, U. Seiffert and L. Jain, Eds. Berlin, Germany: Physica Verlag, 2002, pp. 75–94.

[31] M. M. Van Hulle, “Joint entropy maximization in kernel-based topo-graphic maps,” Neural Comput., vol. 14, no. 8, pp. 1887–1906, 2002. [32] M. M. Van Hulle, “Maximum likelihood topographic map formation,”

Neural Comput., vol. 17, no. 3, pp. 503–513, 2005.

[33] J. J. Verbeek, N. Vlassis, and B. J. A. Kröse, “Self-organizing mixture models,” Neurocomputing, vol. 63, pp. 99–123, 2005.

[34] J. Sum, C. S. Leung, L. W. Chan, and L. Xu, “Yet another algorithm which can generate topography map,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 1204–1207, Sep. 1997.

[35] Univ. California Irvine, UCI Machine Learning Repository, Irvine, CA [Online]. Available: http://archive.ics.uci.edu/ml/

[36] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical

Learning. New York: Springer-Verlag, 2001.

[37] C. Ambroise and G. Govaert, “Constrained clustering and Kohonen self-organizing maps,” J. Classification, vol. 13, no. 2, pp. 299–313, 1996.

[38] K. Rose, E. Gurewitz, and G. C. Fox, “Vector quantization by determin-istic annealing,” IEEE Trans. Inf. Theory, vol. 38, no. 4, pp. 1249–1257, Jul. 1992.

[39] K. Rose, “Deterministic annealing for clustering, compression, classi-fication, regression, and related optimization problems,” Proc. IEEE, vol. 86, no. 11, pp. 2210–2239, Nov. 1998.

Shih-Sian Cheng received the B.S. degree in

mathematics from National Kaohsiung Normal University, Kaohsiung, Taiwan, in 1999 and the M.S. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, in 2002. He is currently working towards the Ph.D. degree at the Department of Computer Science, National Chiao Tung University.

In 2002, he joined the Spoken Language Group, Chinese Information Processing Laboratory, In-stitute of Information Science, Academia Sinica, Taipei, Taiwan, as a Research Assistant. His research interests include machine learning, pattern recognition, speech processing, and neural networks.