The self-organizing process of PbSOM can be described as a model-based data clustering procedure that preserves the spatial relationships between the data samples and clusters in a network. Based on the classification likelihood criterion for data clustering [17], the computation of the coupling-likelihood of a data sample is restricted to its winning neuron. Thus, the goal is to estimate the partition of X , ˆP = { ˆP1, ˆP2, · · · , ˆPG}, and the set of reference models, ˆΘ, so as to maximize the accumulated classification log-likelihood over all the data samples as follows:
Cs(P, Θ; X , h) =
XG k=1
X
xi∈Pk
log(ws(k)ps(xi|k; Θ, h))
=
XG
k=1
X
xi∈Pk
log(ws(k) exp(
XG
l=1
hkllog rl(xi; θl))). (4.7)
As ws(k) for k = 1, 2, · · · , G is fixed at 1/G, the objective function can be rewritten as
Cs(P, Θ; X , h) =
XG k=1
X
xi∈Pk
XG l=1
hkllog rl(xi; θl) + Const. (4.8)
Similar to the derivation of the classification EM (CEM) algorithm for model-based clus-tering in [17], the CEM algorithm for the proposed PbSOM, i.e., the SOCEM algorithm, is derived as follows.
E-step: Given the current reference model set, Θ(t), compute the posterior probability
of each mixture component of ps(xi; Θ(t), h) for each xi as follows:
γk|i(t) = ps(k|xi; Θ(t), h)
= ps(xi, k; Θ(t), h) ps(xi; Θ(t), h)
= exp(PGl=1hkllog rl(xi; θ(t)l ))
PG
j=1exp(PGl=1hjllog rl(xi; θ(t)l )), (4.9) for k = 1, 2, · · · , G, and i = 1, 2, · · · , N .
C-step: Assign each xi to the cluster whose corresponding mixture component has the largest posterior probability for xi, i.e., xi ∈ ˆPj(t) if j = arg maxkγk|i(t).
M-step: After the C-step, the partition of X (i.e., ˆP(t)) is formed, and the objective function Cs defined in Eq. (4.8) becomes
Cs(Θ; ˆP(t), X , h) =
XG
l=1
XG
k=1
X
xi∈ ˆPk(t)
hkllog rl(xi; θl) + Const. (4.10)
Similar to the derivation of the M-step of the EM algorithm for learning a Gaussian mixture model [20], we can obtain the re-estimation formulae for the mean vectors and covariance matrices by taking the derivative of Cs with respect to individual parameters, and then setting it to zero. The re-estimation formulae are as follows:
µ(t+1)l =
PG
k=1
P
xi∈ ˆPk(t)hklxi
PG
k=1| ˆPk(t)|hkl , (4.11)
Σ(t+1)l =
PG
k=1
P
xi∈ ˆPk(t)hkl(xi− µ(t+1)l )(xi− µ(t+1)l )T
PG
k=1| ˆPk(t)|hkl
(4.12) for l = 1, 2, · · · , G. When the neighborhood size is reduced to zero (i.e., hkl=δkl), SOCEM reduces to the CEM algorithm for learning GMMs with equal mixture weights, as in Eqs.
(2.25)-(2.26).
4.2.1 SOCEM - a DA variant of CEM for GMM
Similar to Kohonen’s sequential or batch algorithm, the SOCEM algorithm is applied in two stages. First, it is applied to a large neighborhood to form an ordered map near the center of the data samples. Then, the reference models are adapted to fit the distribution of the data samples by gradually shrinking the neighborhood.
Without loss of generality, we suppose the neighborhood function is the widely adopted (unnormalized) Gaussian kernel in Eq. (2.4). As shown in Algorithm 3, initially, SOCEM
is applied with a large σ value, which is reduced after the algorithm converges. Then, we use the new σ value and the learned parameters as the initial condition of the next learning phase. This process is repeated until the value of σ is reduced to the pre-defined minimum value σmin. The above shrinking of the neighborhood (reduction of the σ value) can be interpreted as an annealing process, where a large σ value corresponds to a high temperature. Table 4.1 lists the learning rules of the DAEM algorithm for learning GMMs with equal mixture weights [23] and the SOCEM algorithm. To facilitate the interpretation, we rewrite the objective function and re-estimation formulae of SOCEM in Eq. (4.8) and Eqs. (4.11)-(4.12), respectively, with the new variable wini, which denotes the index of the winning neuron of xi. For simplicity, we only list the re-estimation formulae of the mean vectors of the Gaussian components.
By analyzing these two algorithms carefully, one may view hwin(t)
i las a kind of posterior probability of θ(t)l for xi in the network domain. More precisely, xi is initially projected into rwin(t)
i in the network domain; then, rwin(t)
i is applied to Eq. (2.4) as an observation of the Gaussian kernel centered at rl to obtain the value of hwin(t)
i l. In both the DAEM and SOCEM algorithms, when the temperature (1/β or σ) is high, the posterior distribution becomes almost uniform; hence, all the reference models will be moved to locations near the center of the data samples in this learning phase. By gradually reducing the tempera-ture, the influence of each xi becomes more localized, and the reference models gradually spread out to fit the distribution of the data samples. When the temperature approaches zero, the probabilistic assignment strategy for the data samples becomes the winner-take-all strategy, and the objective functions and learning rules of DAEM and SOCEM are equivalent to those of CEM. The major difference between DAEM and SOCEM seems to be that the posterior distribution in SOCEM is constrained by the network topology, but DAEM does not have this property.
To visualize the transition of the objective function, we show a simulation on a simple one-dimension, two-component Gaussian mixture problem in Figure 4.22. The training data contains 200 observations drawn from
p(x; {m1, v1}, {m2, v2}) = 0.3 v1√
2π exp(−(x − m1)2
2v12 ) + 0.7 v2√
2πexp(−(x − m2)2
2v22 ), (4.13) where the Gaussian means are (m1,m2)=(-5,5); and the Gaussian variances are (v12,v22)=(1, 1). The PbSOM network structure is a 1 × 2 lattice in [0,1]. The two reference models are θ1 = {µ1, Σ1} and θ2 = {µ2, Σ2}, where Σ1 = Σ2 = 1. The objective function in Eq. (4.7) is calculated with different setups for (µ1,µ2) to form the log-likelihood surface.
From Figure 4.2, we observe that a larger σ for hkl yields a simpler objective function for optimization. The log-likelihood surface is symmetric along µ1=µ2 because of the
2Visualization of how deterministic annealing EM/CEM works for function optimization is illustrated in detail in [23].
Algorithm 3 The SOCEM algorithm with a shrinking neighborhood size (σ) Require: X = {x1, x2, · · · , xN}: the input data set;
σini: the initial σ value for hkl in Eq. (2.4);
ε: the decreasing step for σ;
σmin: the target σ value;
Θ(0) = {θ(0)1 , θ(0)2 , · · · , θ(0)G }: the initial reference models, where θ(0)l = {µ(0)l ,Σ(0)l } are the initial mean vector and covariance matrix of the lth Gaussian component
Ensure: ˆΘ = { ˆθ1, ˆθ2, · · · , ˆθG}: the estimated parameter set, where ˆθl = {ˆµl, ˆΣl} are the estimated mean vector and covariance matrix of the lth Gaussian component Begin
1. ˆΘ←Θ(0); σ ← σini;
2. create the lookup table for hkl; 3. //CEM:
repeat
E-step: for i ← 1, 2, · · · , N and k ← 1, 2, · · · , G, compute γk|i(t) in Eq. (4.9) using ˆΘ;
C-step: assign xi to ˆPj(t) if j = arg maxkγk|i(t);
M-step: for l ← 1, 2, · · · , G, update ˆµl and ˆΣl with Eqs. (4.11)-(4.12);
until the convergence condition is met 4. if (σ = σmin)
goto End;
σ ← σ − ε;
if (σ < σmin) σ ← σmin; goto 2.;
End
symmetric lattice structure and equal weighting of the reference models. For the case of σ = 0.6, the log-likelihood value is close to the global maximum of the surface when both µ1 and µ2 are close to the center of the data (2.39 in this case). With the reduction in the value of σ, the location of (µ1,µ2) for the global maximum moves toward (m1,m2) and (m2,m1).
4.2.2 Relation to Kohonen’s batch algorithm
There are two differences between the SOCEM algorithm and Kohonen’s batch algorithm.
First, SOCEM considers the neighborhood information when selecting the winning neu-ron, but Kohonen’s algorithm does not. Second, SOCEM extends the reference vectors in Kohonen’s algorithm with multivariate Gaussians. In other words, if we set γk|i(t) in
Table 4.1: The DAEM algorithm for learning GMMs with equal mixture weights and the SOCEM algorithm.
Algorithm DAEM SOCEM
Objective function Fβ(Θ; X ) in Eq. (2.32) PN
i=1
PG
l=1hwinillog rl(xi; θl) + Const where p(xi, l; Θ) = G1rl(xi; θl)
Posterior distribution f (l|xi; Θ(t)) = PGrl(xi;θ(t)l )β
j=1rj(xi;θ(t)j )β hwin(t)
i l= exp(−
krwin(t) i
−rlk2 2σ2 ) l = 1, 2, · · · , G l = 1, 2, · · · , G
Temperature 1/β σ
Re-estimation formulae µ(t+1)l = PN
i=1f (l|xi;Θ(t))xi
PN
i=1f (l|xi;Θ(t)) µ(t+1)l = PN
i=1h
win(t) i lxi
PN
i=1h
win(t) i l
l = 1, 2, · · · , G l = 1, 2, · · · , G
SOCEM to rk(xi;θ(t)k )
PG
j=1rj(xi;θ(t)j ), instead of the setting in Eq. (4.9), we obtain a probabilistic variant of Kohonen’s batch algorithm (denoted as KohonenGaussian), where Kohonen’s winner selection strategy is applied and the reference vectors are replaced with multivari-ate Gaussians. Thus, we may view KohonenGaussian as an approximmultivari-ate implementation of SOCEM that optimizes SOCEM’s objective function. Moreover, if we set the covariance matrices in KohonenGaussian to be diagonal with small, identical variances, Kohonen-Gaussian is equivalent to Kohonen’s batch algorithm. Therefore, we can interpret the neighborhood shrinking of Kohonen’s algorithms as a deterministic annealing process, and thereby explain why they need to start with a large neighborhood size.
Recently, Zhong and Ghosh [12] interpreted the neighborhood size of the SOM algo-rithms that apply Kohonen’s winner selection strategy as a temperature parameter in a deterministic annealing process. However, their interpretations were not based on the optimization of an objective function, which is the essential part of DA-based optimiza-tion. In contrast, in SOCEM, the neighborhood shrinking leads to the transition of the objective function from a simpler one to a more complex one, as illustrated in Figure 4.2.
4.2.3 Computational cost
It is clear from Table 4.1 that the computational cost of DAEM is O(GNM ), where G, N, and M are the numbers of reference models, data samples, and learning iterations, respectively. Compared to DAEM, SOCEM needs additional O(G2N) multiplication and addition operations for winner selection in each iteration, while KohonenGaussian needs additional O(GN) multiplications and additions.