The SOCEM algorithm for learning PbSOMs - 機率式模型分群法之研究與其應用

The self-organizing process of PbSOM can be described as a model-based data clustering procedure that preserves the spatial relationships between the data samples and clusters in a network. Based on the classification likelihood criterion for data clustering [17], the computation of the coupling-likelihood of a data sample is restricted to its winning neuron. Thus, the goal is to estimate the partition of X , ˆP = { ˆP₁, ˆP₂, · · · , ˆP_G}, and the set of reference models, ˆΘ, so as to maximize the accumulated classification log-likelihood over all the data samples as follows:

C_s(P, Θ; X , h) =

XG k=1

xi∈Pk

log(w_s(k)p_s(x_i|k; Θ, h))

k=1

xi∈Pk

log(w_s(k) exp(

l=1

h_kllog r_l(x_i; θ_l))). (4.7)

As w_s(k) for k = 1, 2, · · · , G is fixed at 1/G, the objective function can be rewritten as

C_s(P, Θ; X , h) =

XG k=1

xi∈Pk

XG l=1

h_kllog r_l(x_i; θ_l) + Const. (4.8)

Similar to the derivation of the classification EM (CEM) algorithm for model-based clus-tering in [17], the CEM algorithm for the proposed PbSOM, i.e., the SOCEM algorithm, is derived as follows.

E-step: Given the current reference model set, Θ^(t), compute the posterior probability

of each mixture component of p_s(x_i; Θ^(t), h) for each x_i as follows:

γ_k|i^(t) = p_s(k|x_i; Θ^(t), h)

= p_s(x_i, k; Θ^(t), h) p_s(x_i; Θ^(t), h)

= exp(^P^G_l=1h_kllog r_l(x_i; θ^(t)_l ))

P_G

j=1exp(^P^G_l=1h_jllog r_l(x_i; θ^(t)_l )), (4.9) for k = 1, 2, · · · , G, and i = 1, 2, · · · , N .

C-step: Assign each x_i to the cluster whose corresponding mixture component has the largest posterior probability for xi, i.e., xi ∈ ˆP_j^(t) if j = arg maxkγ_k|i^(t).

M-step: After the C-step, the partition of X (i.e., ˆP^(t)) is formed, and the objective function C_s defined in Eq. (4.8) becomes

C_s(Θ; ˆP^(t), X , h) =

l=1

k=1

xi∈ ˆP_k^(t)

h_kllog r_l(x_i; θ_l) + Const. (4.10)

Similar to the derivation of the M-step of the EM algorithm for learning a Gaussian mixture model [20], we can obtain the re-estimation formulae for the mean vectors and covariance matrices by taking the derivative of C_s with respect to individual parameters, and then setting it to zero. The re-estimation formulae are as follows:

µ^(t+1)_l =

P_G

k=1

xi∈ ˆP_k^(t)h_klx_i

P_G

k=1| ˆP_k^(t)|h_kl , (4.11)

Σ^(t+1)_l =

P_G

k=1

xi∈ ˆP_k^(t)hkl(xi− µ^(t+1)_l )(xi− µ^(t+1)_l )^T

P_G

k=1| ˆP_k^(t)|h_kl

(4.12) for l = 1, 2, · · · , G. When the neighborhood size is reduced to zero (i.e., h_kl=δ_kl), SOCEM reduces to the CEM algorithm for learning GMMs with equal mixture weights, as in Eqs.

(2.25)-(2.26).

4.2.1 SOCEM - a DA variant of CEM for GMM

Similar to Kohonen’s sequential or batch algorithm, the SOCEM algorithm is applied in two stages. First, it is applied to a large neighborhood to form an ordered map near the center of the data samples. Then, the reference models are adapted to fit the distribution of the data samples by gradually shrinking the neighborhood.

Without loss of generality, we suppose the neighborhood function is the widely adopted (unnormalized) Gaussian kernel in Eq. (2.4). As shown in Algorithm 3, initially, SOCEM

is applied with a large σ value, which is reduced after the algorithm converges. Then, we use the new σ value and the learned parameters as the initial condition of the next learning phase. This process is repeated until the value of σ is reduced to the pre-defined minimum value σ_min. The above shrinking of the neighborhood (reduction of the σ value) can be interpreted as an annealing process, where a large σ value corresponds to a high temperature. Table 4.1 lists the learning rules of the DAEM algorithm for learning GMMs with equal mixture weights [23] and the SOCEM algorithm. To facilitate the interpretation, we rewrite the objective function and re-estimation formulae of SOCEM in Eq. (4.8) and Eqs. (4.11)-(4.12), respectively, with the new variable win_i, which denotes the index of the winning neuron of x_i. For simplicity, we only list the re-estimation formulae of the mean vectors of the Gaussian components.

By analyzing these two algorithms carefully, one may view h_win^(t)

i las a kind of posterior probability of θ^(t)_l for x_i in the network domain. More precisely, x_i is initially projected into r_win^(t)

i in the network domain; then, r_win^(t)

i is applied to Eq. (2.4) as an observation of the Gaussian kernel centered at rl to obtain the value of h_win^(t)

i l. In both the DAEM and SOCEM algorithms, when the temperature (1/β or σ) is high, the posterior distribution becomes almost uniform; hence, all the reference models will be moved to locations near the center of the data samples in this learning phase. By gradually reducing the tempera-ture, the influence of each x_i becomes more localized, and the reference models gradually spread out to fit the distribution of the data samples. When the temperature approaches zero, the probabilistic assignment strategy for the data samples becomes the winner-take-all strategy, and the objective functions and learning rules of DAEM and SOCEM are equivalent to those of CEM. The major difference between DAEM and SOCEM seems to be that the posterior distribution in SOCEM is constrained by the network topology, but DAEM does not have this property.

To visualize the transition of the objective function, we show a simulation on a simple one-dimension, two-component Gaussian mixture problem in Figure 4.2². The training data contains 200 observations drawn from

p(x; {m₁, v₁}, {m₂, v₂}) = 0.3 v₁√

2π exp(−(x − m₁)²

2v₁² ) + 0.7 v₂√

2πexp(−(x − m₂)²

2v²₂ ), (4.13) where the Gaussian means are (m₁,m₂)=(-5,5); and the Gaussian variances are (v₁²,v₂²)=(1, 1). The PbSOM network structure is a 1 × 2 lattice in [0,1]. The two reference models are θ1 = {µ1, Σ1} and θ2 = {µ2, Σ2}, where Σ1 = Σ2 = 1. The objective function in Eq. (4.7) is calculated with different setups for (µ₁,µ₂) to form the log-likelihood surface.

From Figure 4.2, we observe that a larger σ for h_kl yields a simpler objective function for optimization. The log-likelihood surface is symmetric along µ₁=µ₂ because of the

2Visualization of how deterministic annealing EM/CEM works for function optimization is illustrated in detail in [23].

Algorithm 3 The SOCEM algorithm with a shrinking neighborhood size (σ) Require: X = {x₁, x₂, · · · , x_N}: the input data set;

σini: the initial σ value for hkl in Eq. (2.4);

ε: the decreasing step for σ;

σ_min: the target σ value;

Θ⁽⁰⁾ = {θ⁽⁰⁾₁ , θ⁽⁰⁾₂ , · · · , θ⁽⁰⁾_G }: the initial reference models, where θ⁽⁰⁾_l = {µ⁽⁰⁾_l ,Σ⁽⁰⁾_l } are the initial mean vector and covariance matrix of the lth Gaussian component

Ensure: ˆΘ = { ˆθ₁, ˆθ₂, · · · , ˆθ_G}: the estimated parameter set, where ˆθ_l = {ˆµ_l, ˆΣ_l} are the estimated mean vector and covariance matrix of the lth Gaussian component Begin

1. ˆΘ←Θ⁽⁰⁾; σ ← σini;

2. create the lookup table for h_kl; 3. //CEM:

repeat

E-step: for i ← 1, 2, · · · , N and k ← 1, 2, · · · , G, compute γ_k|i^(t) in Eq. (4.9) using ˆΘ;

C-step: assign x_i to ˆP_j^(t) if j = arg max_kγ_k|i^(t);

M-step: for l ← 1, 2, · · · , G, update ˆµ_l and ˆΣ_l with Eqs. (4.11)-(4.12);

until the convergence condition is met 4. if (σ = σ_min)

goto End;

σ ← σ − ε;

if (σ < σ_min) σ ← σmin; goto 2.;

End

symmetric lattice structure and equal weighting of the reference models. For the case of σ = 0.6, the log-likelihood value is close to the global maximum of the surface when both µ1 and µ2 are close to the center of the data (2.39 in this case). With the reduction in the value of σ, the location of (µ1,µ2) for the global maximum moves toward (m1,m2) and (m₂,m₁).

4.2.2 Relation to Kohonen’s batch algorithm

There are two differences between the SOCEM algorithm and Kohonen’s batch algorithm.

First, SOCEM considers the neighborhood information when selecting the winning neu-ron, but Kohonen’s algorithm does not. Second, SOCEM extends the reference vectors in Kohonen’s algorithm with multivariate Gaussians. In other words, if we set γ_k|i^(t) in

Table 4.1: The DAEM algorithm for learning GMMs with equal mixture weights and the SOCEM algorithm.

Algorithm DAEM SOCEM

Objective function Fβ(Θ; X ) in Eq. (2.32) P_N

i=1

P_G

l=1hwinillog rl(xi; θl) + Const where p(xi, l; Θ) = _G¹rl(xi; θl)

Posterior distribution f (l|xi; Θ^(t)) = P_G^r^l^(xⁱ^;θ^(t)l )^β

j=1rj(xi;θ^(t)j )^β h_win(t)

i l= exp(−

krwin(t) i

−rlk² 2σ² ) l = 1, 2, · · · , G l = 1, 2, · · · , G

Temperature 1/β σ

Re-estimation formulae µ^(t+1)_l = P_N

i=1f (l|xi;Θ^(t))xi

P_N

i=1f (l|xi;Θ^(t)) µ^(t+1)_l = P_N

i=1h

win(t) i lxi

P_N

i=1h

win(t) i l

l = 1, 2, · · · , G l = 1, 2, · · · , G

SOCEM to ^r^k^(xⁱ^;θ^(t)k )

j=1rj(xi;θ^(t)j ), instead of the setting in Eq. (4.9), we obtain a probabilistic variant of Kohonen’s batch algorithm (denoted as KohonenGaussian), where Kohonen’s winner selection strategy is applied and the reference vectors are replaced with multivari-ate Gaussians. Thus, we may view KohonenGaussian as an approximmultivari-ate implementation of SOCEM that optimizes SOCEM’s objective function. Moreover, if we set the covariance matrices in KohonenGaussian to be diagonal with small, identical variances, Kohonen-Gaussian is equivalent to Kohonen’s batch algorithm. Therefore, we can interpret the neighborhood shrinking of Kohonen’s algorithms as a deterministic annealing process, and thereby explain why they need to start with a large neighborhood size.

Recently, Zhong and Ghosh [12] interpreted the neighborhood size of the SOM algo-rithms that apply Kohonen’s winner selection strategy as a temperature parameter in a deterministic annealing process. However, their interpretations were not based on the optimization of an objective function, which is the essential part of DA-based optimiza-tion. In contrast, in SOCEM, the neighborhood shrinking leads to the transition of the objective function from a simpler one to a more complex one, as illustrated in Figure 4.2.

4.2.3 Computational cost

It is clear from Table 4.1 that the computational cost of DAEM is O(GNM ), where G, N, and M are the numbers of reference models, data samples, and learning iterations, respectively. Compared to DAEM, SOCEM needs additional O(G²N) multiplication and addition operations for winner selection in each iteration, while KohonenGaussian needs additional O(GN) multiplications and additions.

在文檔中機率式模型分群法之研究與其應用 (頁 52-57)