Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are encouraged to visit:
http://www.elsevier.com/authorsrights
Learning local factor analysis versus mixture of factor analyzers with automatic model selection
Lei Shi
a, Zhi-Yong Liu
b, Shikui Tu
a, Lei Xu
a,naDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
bThe State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
a r t i c l e i n f o
Article history:
Received 3 June 2013 Received in revised form 29 August 2013
Accepted 15 September 2013 Available online 31 March 2014 Keywords:
Automatic model selection Mixture of factor analyzers Local factor analysis Variational Bayes Bayesian Ying-Yang Dirichlet–Normal–Gamma
a b s t r a c t
Considering Factor Analysis (FA) for each component of Gaussian Mixture Model (GMM), clustering and local dimensionality reduction can be addressed simultaneously by Mixture of Factor Analyzers (MFA) and Local Factor Analysis (LFA), which correspond to two FA parameterizations, respectively. This paper investigates the performance of Variational Bayes (VB) and Bayesian Ying-Yang (BYY) harmony learning on MFA/LFA for the problem of automatically determining the component number and the local hidden dimensionalities (i.e., the number of factors of FA in each component). Similar to the existing VB learning algorithm on MFA, we develop an alternative VB algorithm on LFA with a similar conjugate Dirichlet–
Normal–Gamma (DNG) prior on all parameters of LFA. Also, the corresponding BYY algorithms are developed for MFA and LFA. A wide range of synthetic experiments shows that LFA is superior to MFA in model selection under either VB or BYY, while BYY outperforms VB reliably on both MFA and LFA. These empiricalfindings are consistently observed from real applications on not only face and handwritten digit images clustering, but also unsupervised image segmentation.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Mixture models[1,2], such as Gaussian Mixture Model (GMM) [3,4], have been widely used in many applications. By exploiting the Factor Analysis (FA) [5] in each Gaussian component, the correlated high dimensional data can be represented by fewer latent factors without requiringOðd2Þ parameters for each Gaus- sian covariance matrix, where d is the dimensionality of the data.
The mixture model can be regarded as a constrained GMM, and has been studied under the name of Mixture of Factor Analyzers (MFA) [2,6] or Local Factor Analysis (LFA)[7,8]in the literature.
MFA and LFA separately employ two parameterizations of FA, shortly called as FA-a that takes the form of a free factor loading matrix and an identity covariance matrix for the latent factors, and FA-b that constrains the factor loading matrix to be a rectangular orthogonal matrix, and allows a diagonal covariance matrix for the latent variables, respectively in[9].
Learning MFA/LFA includes parameter learning for estimating all the unknown parameters and model selection for determining the component number k and the hidden dimensionalities fhigki ¼ 1. Parameter learning is usually implemented under the maximum
likelihood principle by an Expectation–Maximization (EM) algo- rithm [1,10,11]. A conventional model selection approach is featured by a two-stage implementation. Thefirst stage conducts parameter learning for each kAM to get a set of candidate models, where k ¼ fk; fhigg for MFA/LFA. The second stage selects the best candidate by a model selection criterion, e.g., Akaike's Information Criterion (AIC)[12]. However, this two-stage imple- mentation suffers from a huge computation because it requires parameter learning for each kAM. Moreover, a larger k often implies more unknown parameters, and then parameter estima- tion becomes less reliable so that the criterion evaluation reduces its accuracy (see Section 2.1 in[13]for a detailed discussion).
To reduce the computation, an Incremental Mixture of Factor Analyzers (IMoFA) algorithm was proposed on MFA in[14] with the validation likelihood as the criterion to judge whether to split a component, or add a hidden dimension, or terminate. Although such an incremental procedure can save the costs to some extent, it usually leads to a suboptimal solution[13,15].
Another road is referred to as automatic model selection, which starts from a large enough k, and has an intrinsic force to drive extra structures diminished, and thus automatically determines k during parameter learning. An early effort is Rival Penalized Competitive Learning (RPCL) on GMM [16,17]. Two Bayesian related approaches can be implemented with a nature of auto- matic model selection. One is Bayesian Ying-Yang (BYY) learning, Contents lists available atScienceDirect
journal homepage:www.elsevier.com/locate/neucom
Neurocomputing
http://dx.doi.org/10.1016/j.neucom.2013.09.061 0925-2312/& 2014 Elsevier B.V. All rights reserved.
nCorresponding author.
E-mail address:[email protected](L. Xu).
proposed in[18]and systematically developed in the past decade and a half [13,15,19,20], which provides a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle. BYY is capable of automatic model selection even without imposing any priors on the parameters, and its performance can be further improved with appropriate priors incorporated according to a general guideline. The other is Variational Bayes (VB) [6,21]. It tackles the difficulty in computing the marginal likelihood with a lower bound by means of variational method, and an EM-like algorithm is employed to optimize this lower bound. The model selection of VB is realized by incorporating an appropriate prior distributions on the parameters.
Recently, a comparative study[4]was delivered on automatic model selection by BYY, VB and MML (Minimum Message Length) for GMM with priors over the parameters. Also in[9], FA-b shows better model selection performance than FA-a under BYY and VB, although FA-a and FA-b have equivalent likelihood functions.
This paper is motivated for an empirical investigation on the automatic model selection performances of BYY and VB, based on MFA and LFA, which actually correspond to Mixture of FA-a and Mixture of FA-b, respectively. There exists a VB algorithm[6]for MFA with a Dirichlet prior on the mixing weights, Normal priors on the columns of the factor-loading matrix, and Gamma priors on precision parameters. Following[4], we consider a full prior on all parameters and adopt a Normal prior over the mean vector in each component of MFA. For short, DNG is referred to the above Dirichlet, Normal, Gamma priors. By slightly modifying the one in [6], we obtain a VB learning algorithm with the DNG prior, shortly denoted as VB-MFA. Also, a similar conjugate DNG prior is considered on the parameters of LFA.
Moreover, we develop three automatic model selection algo- rithms, namely the VB algorithm on LFA, or VB-LFA for short, and the BYY algorithms on MFA and LFA, shortly denoted as BYY-MFA and BYY-LFA respectively. With the conjugate property of the priors, the BYY harmony measure is computed by directly integrating out the parameters with respect to the Yang posteriors, instead of using Taylor approximations as in [9]. The handled marginal density of observed variable in each component is tackled by a lower-bound approximation with the help of additional variables, leading to products of multiple Student's T-distributions.
The performances of automatic model selection are extensively compared on a wide range of randomly simulated data, via controlling the hardness of tasks by varying the dimension of data, the number of samples, the true number of components, and the overlap degree of components. The simulated results show the following empirical findings. First, LFA gets better performance than MFA under either VB or BYY, which echoes the advantages of FA-b over FA-a observed in[9]. Second, BYY outperforms VB on both MFA and LFA, and in most cases BYY-LFA performs the best.
Also, we apply these algorithms to not only clustering face and handwritten digit images, but also unsupervised image segmenta- tion on real world images. The results are consistent with the observations from simulated experiments.
The main contribution of this paper can be summarized in two- fold. First, three algorithms, i.e, the algorithm of VB based LFA with Dirichlet–Normal–Gamma (DNG) prior, denoted by VB-LFA, the algorithm of BYY based LFA with DNG prior, denoted by BYY-LFA, and the algorithm of BYY based MFA with DNG prior, denoted by BYY-MFA are derived in detail. Second, based on the algorithms, we empirically compared by extensive experiments the two types of clustering of factor analysis models, i.e., LFA and MFA, as well as two types of automatic model selection strategies, i.e., VB and BYY.
The remainder of this paper is organized as follows.Section 2 introduces MFA/LFA and their DNG priors. We introduce the automatic model selection algorithms with the DNG priors by
BYY inSection 3, and by VB inSection 4. Experimental compar- isons are conducted via a wide range of synthetic datasets and real applications inSection 5. Finally, concluding remarks are made in Section 6.
2. Models and priors 2.1. Model parameterizations
In a mixture model, the distribution qðxjΘÞ of a d-dimensional observed random variable x is a mixture of several local distribu- tions qðxji; θÞ, with each named as a component:
qðxjΘÞ ¼ ∑k
i ¼ 1αiqðxji; θiÞ with Θ ¼ fαig [ fθig; ð1Þ
where k is the component number, fαig are mixing weights with
∑ki ¼ 1αi¼ 1 and each αiZ0, and θi denotes parameters of the ith component. Here and throughout this paper, qðÞ is referred to as a generative distribution, likelihood or prior, while pðÞ is referred to as a posterior distribution.
If each component is a Gaussian distribution, i.e., qðxji; ΘÞ ¼ Gðxjμi; ΣxjiÞ with mean μi and covariance matrix Σxji, qðxjΘÞ by Eq.(1)becomes the widely used Gaussian Mixture Model. For a full matrix Σxji, there are 0:5dðdþ1Þ free parameters to be estimated, whose accuracy is difficult be guaranteed for a small sample size. One way for tackling this problem is to impose certain constraints onΣxji with a Factor Analysis model, i.e.,
qðxjy; i; θiÞ ¼ GðxjAiy þμi; ΨiÞ; qðyji; θiÞ ¼ Gðyj0; ΣyjiÞ;
qðxji; θiÞ ¼Z
qðxjy; i; θiÞqðyji; θiÞ dy ¼ Gðxjμi; AiΣyjiATiþΨiÞ; ð2Þ where we introduce a hidden factor y in an hi-dimensional subspace with hiod, and constrain Ψi to be diagonal. FA actually factorizesΣxjito beΣxji¼ AiΣyjiATiþΨiwith fewer free parameters.
To reduce the indeterminacies of the FA by Eq.(2), two parameter- izations of FA are typically used, called as Mixture of Factor Analyzers (MFA)[2,6]and Local Factor Analysis (LFA)[7,8]respectively, with their corresponding mixture models by Eq.(1)summarized inTable 1. The two FA parameterizations have equivalent likelihood functions by Eq.(2), and thus they have the same model selection performance in a two-stage implementation with AIC or BIC[22]. However, it was found that they result in different model selection performances under BYY [23], and a recent study[9]provided systematic empiricalfindings on how parameterizations affect model selection performance under not only BYY but also VB. Moreover, the differences of two parameteriza- tions on model selection performance have been further analytically investigated in Section 2.2 of [20]. In this paper, we proceed to investigate the automatic model selection performances of MFA/LFA under BYY and VB.
Moreover, when each diagonal covariance Ψi in Table 1 is constrained to be spherical, i.e., Ψi¼ ψiId, MFA and LFA will degenerate to Mixture of PCA[11]and Local PCA[8], respectively.
Table 1
MFA v.s. LFA: similarity and difference. MFA and LFA are actually mixtures of FA-a and FA-b in[9], respectively.
Model: MFA (mixture of FA-a) LFA (mixture of FA-b)
Parametersθi: fAi; μi; Ψig fUi; Λi; μi; Ψig Same: Ψiis d d diagonal Ψiis d d diagonal
Different: Aiis general d hi Uiis orthogonal, i.e., UTiUi¼ Ihi, Λiis diagonal,Λi¼ diag½λ1; …; λhi qðyji; θiÞ: Gðyj0; IhiÞ Gðyj0; ΛiÞ
qðxjy; i; θiÞ: GðxjAiy þμi; ΨiÞ GðxjUiy þμi; ΨiÞ qðxji; θiÞ: Gðxjμi; AiATiþΨiÞ Gðxjμi; UiΛiUTiþΨiÞ
Additionally, MFA/LFA can be reformulated by introducing a binary variable set Z ¼ fztgNt ¼ 1 corresponding to the i.i.d. samples XN¼ fxtgNt ¼ 1. In each zt, we have each ith element zitAf0; 1g and
∑ki ¼ 1zit¼ 1, and zit¼ 1 iff xtis generated from the ith component.
The generative process of an observation xtis thus interpreted as three steps: (1) sample zt from a Multinomial distribution described byα ¼ ½α1; …; αkT, i.e., zt MultinomialðαÞ; (2) generate hidden variable y from a Gaussian subspace, i.e., y ∏iqðyji; θiÞzit; (3) generate xt from a Gaussian conditional on y, i.e., xt
∏iqðxtjy; i; θiÞzit. Therefore, we have the following joint probability:
qðXN; Y; ZjΘÞ ¼ ∏N
t ¼ 1 ∏k
i ¼ 1½αiqðxtjy; i; θiÞqðyji; θiÞzit: ð3Þ Given a set of observations XN, supposing that the component number k and the local hidden dimensionalities fhig are given, one widely used method for parameter estimation is the maximum- likelihood learning, which can be effectively implemented by the well-known Expectation–Maximization (EM) algorithm[1]. Model selection on LFA/MFA is to appropriately determine both the component number k and the local hidden dimensionalities fhigki ¼ 1, or shortly to determine the tuple k ¼ fk; fhigki ¼ 1g, on which the maximum likelihood principle fails to give a good guide[13].
2.2. The conjugate Dirichlet–Normal–Gamma priors
Considering the parameters with appropriate prior distribu- tions can provide helpful learning regularization and also improve model selection performance [13]. Such empirical studies have been conducted on GMM with either the improper Jeffreys prior or the conjugate Dirichlet–Normal–Wishart prior under BYY, VB and MML [4], and also on FA with Normal–Gamma prior under BYY and VB [9]. This section considers similar prior distributions on parameters Θ ¼ α [ fθigki ¼ 1 of MFA or LFA inTable 1, where the Dirichelet distribution Dðαjλ; ξÞ takes the form,
Dðαjλ; ξÞ ¼ ΓðξÞ
∏ki ¼ 1ΓðξλiÞ ∏k
i ¼ 1αξλi i 1
!
; ð4Þ
with the constraintsλ ¼ ½λ1; …; λkT; ∑iλi¼ 1; 8λiZ0:
For MFA, we consider a Dirichlet prior on the mixing weights α ¼ ½α1; …; αkT, a Normal prior on each component's mean vector μi, an independent Gamma prior on the diagonal elements of
φi¼ Ψi 1, whereφðjÞi is the jth diagonal element inφi. Moreover, a hierarchical Normal–Gamma prior is assigned on each jth column AðinjÞ of Ai, where AðinjÞ a priori comes from a zero-mean Normal distribution with a covariance Id=ςij, and ςij further follows a Gamma prior. We useΓðja; bÞ to denote the Gamma distribution with a shape parameter a40 and an inverse scale parameter b40. The whole qðΘÞ is shortly denoted as DNG with details given in the left ofTable 2. A DNG prior was considered on MFA under VB in[6]without qðμiÞ. Based on the observations in[4]that a full prior helps to improve model selection performance, in this paper we consider the full DNG prior with qðμiÞ.
For LFA, we consider a similar DNG prior onΘ in the right of Table 2, with the parts different from MFA highlighted by gray color. Therein, each orthogonal matrix Ui is considered without any prior, instead of adopting the qðUiÞ used in[9]because it is irrelevant to Uiand thus not helpful for automatic model selection.
The differences in qðΘÞ actually come from the parameterizations.
Therefore, the qðΘÞ for LFA is also called DNG prior in this paper without ambiguity. For both MFA and LFA, the DNG prior is conjugate[6,11]to the generative process described in Eq.(3).
In the sequel, we may use short notations aφi ¼ ½aφi1; …; aφidT, bφi ¼ ½bφi1; …; bφidT, aςi ¼ ½aςi1; …; aςihiT, bςi¼ ½bςi1; …; bςihiT, aνi¼ ½aνi1; …;
aνih
iT and bνi¼ ½bνi1; …; bνihiT for expression convenience.
3. BYY algorithms for learning LFA versus MFA 3.1. Bayesian Ying-Yang (BYY) harmony learning
Firstly proposed in [18] and systematically developed over a decade and a half[13], Bayesian Ying-Yang (BYY) harmony learn- ing theory is a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle, which provides a favorable new mechanism for model selection.
The BYY harmony learning is featured by seeking the best harmony between the Ying- Yang pair in a BYY system. The BYY system consists of Yang machine and Ying machine, respectively corresponding to two types of decompositions, namely Yang pðRjXÞpðXÞ and Ying qðXjRÞqðRÞ, where the observed data X is regarded as generated from its inner representation R ¼ fY; Θg that consists of latent variables Y and parameters Θ, supported by a hyper-parameter setΞ. The harmony measure is mathematically expressed as follows[13,15,19]:
Hðpjjq; ΞÞ ¼Z
pðRjXÞpðXÞ½qðXjRÞqðRÞdX dR: ð5Þ Maximizing Hðpjjq; ΞÞ leads to not only a best matching between the Ying-Yang pair, but also a compact model with a least complexity. Different from VB model selection that bases on an appropriate prior qðΘjΞÞ (seeSection 4for the details of the prior in VB), BYY harmony learning by Eq. (5) bases on qðRÞ ¼ qðYjΘÞqðΘjΞÞ to make model selection, where qðYjΘÞ plays a role that is not only equally important to qðΘjΞÞ but also easy computing, and qðΘjΞÞ is still handled in a way similar to VB.
Moreover, maximizing Hðpjjq; ΞÞ is implemented with the help of the general two-stage iterative procedure shown by Fig. 6 in[19]
(also see Eqs.(6)and(7)in[8]). Thefirst stage estimates Ξ (usually via estimating Θ) by an optimization of continuous variables, while the second stage involves a discrete optimization on one or several integers that index candidate models. Here, we only consider thefirst stage where automatic model selection actually performs, though the second stage may be also considered to further improve the model selection performance with much more computing costs.
Table 2
The Dirichlet–Normal–Gamma priors on MFA and LFA with hyper-parameters Ξ ¼ fλ; ξ; βg [ fΞigki ¼ 1. Priors on MFA/LFA are described in the two big columns respectively, whose differences are highlighted with gray color. The “distr.”
columns indicate the distribution types for clarity, with“D” for Dirichlet, “N” for Normal, and“G” for Gamma, respectively.
MFA LFA
Θ ¼ α [ fθigi,θi¼ fAi; μi; Ψig Θ ¼ α [ fθigi,θi¼ fUi; Λi; μi; Ψig φi¼ Ψi 1; AðnjÞi : jth column vector of Ai νi¼ Λi 1,φi¼ Ψi 1
Ξi¼ fmi; faφij; bφijg
j; faςij; bςijg
jg Ξi¼ fmi; faφij; bφijg
j; faνij; bνijg
jg
prior distr. prior distr.
qðΘÞ ¼ qðαÞ ∏k
i ¼ 1qðθiÞ DNG qðΘÞ ¼ qðαÞ ∏k
i ¼ 1qðθiÞ DNG
qðαÞ ¼ Dðαjλ; ξÞ D qðαÞ ¼ Dðαjλ; ξÞ D
qðθiÞ ¼ qðμiÞqðφiÞqðAiÞ NG qðθiÞ ¼ qðμiÞqðφiÞqðνiÞ NG qðμiÞ ¼ Gðμijmi; Id=βÞ N qðμiÞ ¼ Gðμijmi; Id=βÞ N qðφiÞ ¼ ∏d
j ¼ 1
ΓðφðjÞiaφij; bφijÞ G qðφiÞ ¼ ∏d
j ¼ 1
ΓðφðjÞijaφij; bφijÞ G qðAijςiÞ ¼ ∏hi
j ¼ 1
GðAðnjÞi j0; Id=ςðjÞi Þ N qðνiÞ ¼ ∏hi
j ¼ 1ΓðνðjÞi jaνij; bνijÞ G qðςiÞ ¼ ∏hi
j ¼ 1
ΓðςðjÞi jaςij; bςijÞ G
BYY harmony learning leads to improved model selection via either or both of improved model selection criteria and algorithms with automatic model selection. Such a merit can be intuitively understood as follows[13]:
HðpjjqÞ ¼ Z
pðXÞln qðXÞ dX ¼ HðpjjpÞKLðpjjqÞ: ð6Þ Thus, besides the Kullback–Leibler divergence, a system entropy term HðpjjpÞ is also incorporated into the BYY objective function.
By contrast, VB tries to maximize the marginal likelihood by minimizing only the Kullback–Leibler divergence. It is the term HðpjjpÞ that makes the BYY harmony learning possess automatic model selection ability, even there is no prior on the parameter.
More specifically, to estimate some mixture model such as the MFA, with the help of HðpjjpÞ the BYY harmony learning takes the following E-step which compared with the conventional EM algorithm has an extra termΔ as follows[13]:
pj;t¼ pðjjtÞþΔj;t: ð7Þ
By such a regularization termΔj;t (see also(A.9)inAppendix A), the updating on the jth component shares somewhat a similar updating to the rival penalized competitive learning (RPCL) [16,17], and thus realizes automatic model selection.
On MFA and LFA with the DNG priors in a conjugate manner, there is still no detailed automatic model selection algorithm available for implementing BYY harmony learning, and thus this section targets at developing such algorithms.
3.2. BYY algorithm for learning LFA with DNG prior
For LFA, we consider the Ying machine as qðX; RÞ ¼ qðX; Y;
ZjΘÞqðΘÞ, with R ¼ fY; Z; Θg, qðX; Y; ZjΘÞ given in Eq.(3)and qðΘÞ given in the right ofTable 2. In the Yang machine, we consider pðXÞ as the empirical distribution, i.e., pðXÞ ¼δðXXNÞ, and the Yang- pathway pðRjXÞ as
pðRjXNÞ ¼ pðΘjZ; Y; XNÞpðYjZ; XNÞpðZjXNÞ;
pðΘjY; Z; XNÞ ¼ pðαjZ; XNÞ ∏k
i ¼ 1
pðμijZ; XNÞ ∏hi
j ¼ 1
pðνðjÞi jY; Z; XNÞ
"
∏d
j ¼ 1
pðφðjÞi jY; Z; XNÞ
#
;
pðYjZ; XNÞ ¼ ∏k
i ¼ 1 ∏N
t ¼ 1
pðyji; xtÞzit;
pðZjXNÞ ¼ ∏k
i ¼ 1 ∏N
t ¼ 1
pðijxtÞzit: ð8Þ
Particularly, in accordance with the variety preservation prin- ciple (see Section 4.2 in [13]), the details of pðΘjY; Z; XNÞ are further designed as the following posteriors in the DNG form by utilizing the conjugate property:
pðαjZ; XNÞ ¼ Dðαjλn; ξþNÞ;
pðμijZ; XNÞ ¼ G μijmni; βIdþ ∑N
t ¼ 1
zit
diagðaφi⊘bφiÞ
1!
; pðνðjÞi jY; Z; XNÞ ¼ Γ νðjÞi jaνijþ1
2 ∑N
t ¼ 1
zit; bnijν
; pðφðjÞi jY; Z; XNÞ ¼ Γ φðjÞi jaφijþ1
2 ∑N
t ¼ 1
zit; bnijφ
; ð9Þ
where fλn; mni; bnijν; bnijφg are free hyper-parameters to be optimized.
Therein and throughout this paper, we use symbols“⊘” and “”
to denote the Hadamard (element-by-element) division and pro- duct respectively.
The pðijxtÞ in Eq.(8)is constructed as the Bayesian posterior of qðxt; iÞ, i.e., pðijxtÞpqðxt; iÞ, where the qðxt; iÞ is computed by qðxt; iÞ ¼Z
qðxt; ijΘÞqðΘÞ dΘ ¼ λiqðxtjiÞ; ð10Þ qðxtjiÞ ¼Z
qðxtji; θiÞqðθiÞ dθi¼Z
qðxtjyt; i; θiÞqðytji; θiÞqðθiÞ dytdθi; ð11Þ withθi¼ fμi; νi; φig. However, it is difficult to directly compute the above integral over fθi; ytg for an analytical qðxtjiÞ. Therefore, qðxtjiÞ is sequentially approximated by lower-bounds according to Jen- sen's inequality, i.e.,
qðxtjiÞZZ
~qðxtji; θiÞqðθiÞ dθiZZ
~qðxtji; νi; φiÞqðνiÞqðφiÞ dνidφi; ð12Þ which leads to the marginal qðxtjiÞ as a product of several Student's T-distributions in Eq.(A.1)inAppendix A, where
~qðxtji; θiÞ ¼ exp Z
Gðytj~yit; ~ΣyiÞ lnqðxtjyt; θiÞqðytjθiÞ Gðytj~yit; ~ΣyiÞ dy
( )
; ð13Þ
~qðxtji; νi; φiÞ ¼ exp Z
Gðμij~μxit; ~ΣμixÞ ln~qðxtji; θiÞqðμiÞ Gðμij~μxit; ~ΣμixÞdμi
( )
; ð14Þ
and f~yit; ~Σyi; ~μxit; ~Σμixg are assistant variables which can be updated by maximizing the above lower bounds with the details referred to Appendix A.
Similarly for pðyji; xtÞ, we can also obtain a product of multiple Student's T-distributions, which however makes the subsequent integrals in the harmony measure difficult. We further approx- imate it by the following Gaussian according to the property of Student's T-distribution[24]:
pðyjxt; iÞ GðyjWiðxt~μyitÞ; ΠiÞ;
Wi¼ ΠiUTiDi; Di¼ diag½ðaφi þ121dÞ⊘ðbφiþ12 diagð ~ΣμiyÞÞ;
Πi¼ ½UTiDiUiþdiagðaνi⊘bνiÞ 1; ð15Þ where~μyitand ~Σμiycan be updated for a better approximation given the parameters. The details are referred toAppendix A.
Putting Eq. (8) into Eq. (5), the harmony measure on LFA becomes
HLFAðfUig; ΞÞ ¼Z
∑
Z
pðΘjZ; Y; XNÞpðYjZ; XNÞpðZjXNÞ
ln½qðXNjY; Z; ΘÞqðYjZ; ΘÞqðZjΘÞqðΘjΞÞ dY dΘ: ð16Þ By further substituting the details of Eqs.(9)–(15)into Eq.(16), we obtain the following lower-bound:
HLFAðfUig; ΞÞZHLFAðfUig; Ξ; ΞnÞ; ð17Þ where the detailed expression of HLFAðfUig; Ξ; ΞnÞ is given by Eq.(A.5)in Appendix A. The best harmony principle is approxi- mated by maximizing HLFAðfUig; Ξ; ΞnÞ with respect to the prior hyper-parametersΞ ¼ fλ; ξ; fmig; β; faνij; bνijg; faφij; bφijgg and the poster- ior hyper-parametersΞn¼ fλn; fmnig; fbnijνg; fbnijφgg. The derived algo- rithm is sketched inTable 3and shortly denoted as BYY-LFA, with details referred toAppendix A.
The above checking on whether to discard dimension j in component i is actually observing whether the jth diagonal element ofΛitends to zero and thus the corresponding dimension of y can be discarded. Following Section 2.2 of[20], e.g., its Eq.
(36), this checking can be further improved by the nature of the co-dim matrix pair of each FA model.
3.3. BYY algorithm for learning MFA with DNG prior
Learning on MFA can be made in a similar way. The main differences come from the parameterizations on the factor loading matrix and the factor covariance matrix, and their corresponding prior distributions, as shown inTables 1and2.
The Ying machine is represented by replacing the counterparts of LFA with the ones of MFA, i.e., getting qðX; Y; ZjΘÞ in Eq.(3)by the left column ofTable 1and qðΘÞ by the left part ofTable 2. In Yang machine, we still consider the empirical distribution pðXÞ ¼δðXXNÞ, but the Yang-pathway pðRjXÞ is factorized in a form slightly different from Eq.(8):
pðRjXNÞ ¼ pðΘjZ; XNÞpðYjZ; XN; fAigÞpðZjXNÞ;
pðΘjZ; XNÞ ¼ pðαjZ; XNÞ ∏k
i ¼ 1
pðμijZ; XNÞ ∏d
j ¼ 1
pðφðjÞi jZ; XNÞ (
∏hi
j ¼ 1
½pðAðinjÞjZ; XNÞpðςðjÞi jZ; XNÞ
)
;
pðYjZ; XN; fAigÞ ¼ ∏k
i ¼ 1 ∏N
t ¼ 1
pðyji; xt; AiÞzit; ð18Þ where pðZjXNÞ is expressed in the same form as the one in Eq.(8).
We proceed to the detailed expressions of pðΘjZ; XNÞ according to the conjugate property of the DNG priors on MFA. Particularly, Eq.(8)is modified accordingly by replacing pðνðjÞi jY; Z; XNÞ with the following equations:
pðAðinjÞjZ; XNÞ ¼ G
AðinjÞjAðinjÞ;
aςij=bςijIdþ ∑N
t ¼ 1
zit
diagðaφi⊘bφiÞ
1
;
pðςðjÞi jZ; XNÞ ¼ Γ ςðjÞi jaςijþd 2; bnijς
; ð19Þ
where fAðinjÞ; bnijς; bnijφg are free hyper-parameters to be determined.
Similar to Eq. (10), the pðijxtÞ is constructed by the Bayesian posterior, i.e., pðijxtÞpqðxt; iÞ. The qðxtjiÞ for MFA can also be approximated via Eqs.(12)–(14), where ~qðxtji; νi; φiÞ is substituted by ~qðxtji; Ai; φiÞ, and then Eq. (12) is computed by integrating
~qðxtji; Ai; φiÞqðζiÞqðφiÞ with respect to Ai; ζi; φi, leading to a different qðxtjiÞ which is also a product of multiple Students T-distributions.
Moreover, the pðyji; xt; fAigÞ is approximated by Eq.(15)with slight modifications: replacing Ui and diagðaνi⊘bνiÞ with Ai and Ihi
respectively.
Putting the above specifications of Ying-Yang machine into Eq. (5), we obtain a lower-bound HMFAðΞ; ΞnÞ analogous to HLFAðfUig; Ξ; ΞnÞ, and also a corresponding BYY-MFA algorithm to
maximize HMFAðΞ; ΞnÞ via modifying the counterparts of BYY-LFA according to Eqs. (18) and(19). Readers are referred to [25]for details.
It should be noted that the pðθijY; Z; XNÞ is considered to have a conjugate Normal–Gamma form, whereas it was approximated by a 2nd order Taylor expansion in [9] (see Eqs. (13)–(17) and Table B1 in [9] for details), although the prior qðθiÞ on each component's parametersθi inTable 2is the same as those in[9].
Therefore, when the mixture model in Eq.(1)degenerates to only one Gaussian component represented by FA in [9], i.e., k ¼1, the BYY-LFA and BYY-MFA here actually provide alternative BYY learning algorithms for FA-a and FA-b, being different from the algorithms derived in[9].
4. VB algorithms for learning LFA versus MFA
With proper incorporation of prior knowledge on model para- meters, Bayesian model selection is implemented via the max- imum marginal likelihood, which is obtained by integrating out the latent variables Y and the parameters Θ, i.e., qðXNÞ ¼ RqðXN; YjΘÞqðΘÞ dY dΘ. However, the involved integration is usually very difficult. Variational Bayesian [6,21] tackles this difficulty via constructing a tractable lower bound for the log marginal likelihood by means of variational methods, and an EM- like algorithm is employed to optimize this lower bound. More precisely, the lower bound is given as follows:
JVBðΞn; ΞÞ ¼Z
pðΘ; YjΞnÞ lnqðXN; YjΘÞqðΘjΞÞ pðΘ; YjΞnÞ dY dΘ
¼ ln qðXNjΞÞZ
pðΘ; YjΞnÞ ln pðΘ; YjΞnÞ
pðΘ; YjXN; ΞÞdY dΘ; ð20Þ where Y represents all hidden variables, e.g., fY; Zg in Eq.(3)for MFA/LFA, qðΘjΞÞ is a prior on parameters Θ with hyper-parameters Ξ, and pðΘ; YjΞnÞ is a variational posterior with hyper-parameters Ξn to approximate the exact Bayesian posterior pðΘ; YjXN; ΞÞp qðXN; YjΘÞqðΘjΞÞ. It follows from Eq. (20) that the lower bound JVBðΞn; ΞÞ is tight to ln qðXNjΞÞ when pðΘ; YjΞnÞ ¼ pðΘ; YjXN; ΞÞ. For computational convenience, qðΘjΞÞ is usually chosen to be con- jugate priors, and pðΘ; YÞ is usually assumed to be afactorized form pðΘ; YÞ ¼ pðYÞ∏ipðΘiÞ with Θ ¼ ⋃iΘi, so that maximizing JVBmakes the variational posterior pðΘÞ to be conjugate, i.e., in the same form as the corresponding prior.
There is already a VB learning algorithm on MFA proposed in [6], where the prior onθi is qðθiÞ ¼ qðφiÞqðAiÞ, without considering the prior onμi. In this paper, the DNG prior inTable 2considers a Gaussian distribution qðμiÞ for μi. Then, the corresponding VB Table 3
BYY algorithm on LFA with DNG prior (BYY-LFA).
1 Initialization: Randomly initialize the model with large enough number k of components and hidden dimensionalities fhig; set τ ¼ 0 and the harmony measure JBYYðτÞ ¼ 1;
2 repeat 3 4 5 6 7 8 9 10 11 12
Yangstep : Update assistant variables f ~yit; ~Σyig by Eq: ðA:3Þ; f ~μ; ~Σμixg by Eq: ðA:2Þ; and f ~μyit; ~Σμiyg by Eq: ðA:4Þ: Calculate pðijxtÞ by Eq: ðA:1Þ; eitandϵitby Eq: ðA:6Þ:
Yingstep : With ∇UiHLFAby Eq: ðA:8Þ; update each Uivia Unewi ¼ Uoldi þη½∇UiHLFAUoldi ð∇UiHLFAÞTUoldi :
UpdateΞn¼ fλni; mni; bnνi ; bnφi g in a gradient way by Eq: ðA:7Þ:
Hstep : In a gradient way; update fλ; ξ; βg [ fmig by Eq: ðA:10Þ; faνi; bνig by Eq: ðA:11Þ; and faφi; bφig by Eq: ðA:12Þ:
for i ¼ 1; …; k do
ifλni orλi-0 then discard component i; let k ¼ k1;
for j ¼ 1; …; hido ifaν bnνij
ijþ12∑Nt ¼ 1pðijxtÞorbaνijν
ij-0 then discard dimension j in component i; let hi¼ hi1;
66 66 66 4
if another 5 runs pass then letτ ¼ τþ1; calculate JBYYðτÞ ¼ HLFAby Eq: ðA:5Þ;
13 until JBYYðτÞJBYYðτ1ÞoϵJBYYðτ1Þ; with ϵ ¼ 10 5;
algorithm (shortly denoted as VB-MFA) can be obtained through slightly modifying the one in[6]by replacing μi with mni when updating other parameters, and additionally update the following variational posterior forμi:
pðμiÞ ¼ Gðμijmni; ΣniμÞpEp½ln½qðXN; Y; ZjΘÞqðΘÞ; ð21Þ where Ep½ denotes expectation with respect to the current estimate of all variational posteriors pðYjZÞpðZÞpðΘÞ except pðμiÞ.
There is no VB algorithm available for LFA with DNG prior, yet there is a VB algorithm presented in[9]on FA-b, a degenerated case of LFA with k ¼1. Following[6], the joint posterior is assumed to be approximately factorized as the following variational poster- ior:
pðY; Z; ΘjXNÞ pðYjZÞpðZÞpðΘÞ;
pðΘÞ ¼ pðαÞ ∏k
i ¼ 1
pðμiÞ ∏hi
j ¼ 1
pðνðjÞi Þ ∏d
j ¼ 1
pðφðjÞi Þ
" #
; ð22Þ
and then the variational lower bound becomes JLFAVB ¼
Z
pðYjZÞpðZÞpðΘÞ ln qðX; Y; ZjΘÞqðΘÞ pðYjZÞpðZÞpðΘÞ
dY dZ dX dΘ: ð23Þ
Its maximization leads to the following conjugate form:
pðZÞ ¼ ∏k
i ¼ 1 ∏N
t ¼ 1
pzitit; pðYjZÞ ¼ ∏k
i ¼ 1 ∏N
t ¼ 1
Gðyjynit; ΣniyÞzit; pðαÞ ¼ Dðαjλn; ξnÞ; pðμiÞ ¼ Gðμijmni; ΣniμÞ;
pðνðjÞi Þ ¼ ΓðνðjÞi janijν; bnijνÞ; pðφðjÞi Þ ¼ ΓðφðjÞi janijφ; bnijφÞ: ð24Þ An EM-like algorithm is given inAppendix B, for maximizing JVBLFA
with respect to the variational parameters ffpit; ynitgt; Σniygki ¼ 1 that describe the posterior distribution on Y [ Z, and the variational hyper-parametersΞn¼ fλn; ξn; fmni; Σniμgi; fanijν; bnijνg
i;j; fanijφ; bnijφg
i;jg that describe the posterior onΘ.
5. Experimental results
Since the values of the hyper-parameters Ξ are usually unknown and it is not good to assign one for all data by guess, the algorithms are implemented to adjust Ξ under their own learning principles. Moreover, it has been demonstrated in[4]that the performance of VB will be improved when the fmig ofTable 2 are constrained to be the same, i.e., 8 i; mi¼ m, with the number of free hyper-parameters reduced, and thus VB is here implemented with mi¼ m too. Still, no constraints are imposed on fmig for BYY algorithms.
5.1. Comparisons on four series of simulations
Each dataset is generated according to MFA or LFA in Tables 1 and 2. For each component i, we set the hidden dimensionality hni ¼ 5, and the mixing weight αi¼ 1=kn, where kn denotes the true component number. The remaining para- meters are randomly generated according to the Normal–Gamma
distributions given inTable 2, with aςij¼ bςij¼ 3, aνij¼ 10, bνij¼ 200, and aφij¼ bφij¼ 10, 8i; j.
To cover a wide range of experimental conditions, we vary the values of the sample size N, the observed data dimensionality d, the true component number kn, and the overlap degree β of Gaussian components, where increasing β indicates that the degree of separation of the components changes from large to small. Generally speaking, it becomes more difficult in model selection as N decreases, d increases, knincreases, andβ increases.
We consider four series of experiments specified in Table 4.
Starting from a same point in the 4-dimensional factor space, each series varies one factor of ðN; d; kn; βÞ while fixing the remaining three. For each specific setting, 500 datasets are generated independently, and all algorithms are initialized with a same component number 25 and a same hidden dimensionality 9. We compare the resulted k [ fhigki ¼ 1 with the true kn[ fhnigki ¼ 1n with each hni ¼ 5. The model selection accuracies are reported inFig. 1 as percentages of correctly obtaining k ¼ knand hi¼ 5ð8iÞ out of 500 independent runs.
Fig. 1 shows that all algorithms perform well at the starting case, and then decline as the experimental environment deterio- rates. Particularly, we observe:
1. LFA is shown to be superior to MFA in model selection, under either BYY or VB. This observation is consistent to the empirical findings on FA in[9]. Specifically, the superiority of LFA is less obvious in the cases of small N, or large kn; β, because both LFA and MFA get bad performance on these extreme cases. The reason may be due to the fact that y in MFA processes an identity covariance matrix, which can be taken as a special case of that of LFA. Thus, LFA is more flexible than MFA to accommodate different types of data. Moreover, the LFA is better than MFA in terms of providing one additional room for model selection via estimatingΛ. Compared with adding priors, Gðyj0; ΛÞ is more reliable and easier to be estimated from data.
2. On either LFA or MFA, BYY greatly outperforms VB for all the cases except the one kn¼ 7 in series 3. Although VB-LFA benefits from the superiority of using LFA instead of MFA, BYY-MFA is still more robust than VB-LFA against the deteriora- tion of the environment, while BYY-LFA is the best in general.
Regarding the time-cost, due to space limit we are not able to present the detailed results. It was observed that the BYY in general involves a heavier computational load than VB because it involves more gradient calculations. However, the complexities of both algorithms are comparable to each other (around OðN3Þ), since the main computation of both of them arises from the matrix multiplication.
5.2. Face and handwritten digit images clustering
We test the clustering performances of the four algorithms on three real datasets: the ORL1 face image database, the USPS2 handwritten digits, and the MNIST3 handwritten digits. The ORL contains with 10 grayscale images for each 40 human subjects, and all images are of a size 64 64. We project them to 65 dimensions by PCA so that 90% energy in the covariance is reserved. The USPS and MNIST digit databases both contain grayscale images of“0”
through“9”. In USPS, each image is of a size 16 16 and thus 256- dimensional, and there are 1100 images for each digit. In MNIST, each image is of a size 28 28 and thus 784-dimensional, and Table 4
Four series of experiments for MFA/LFA model selection.
Starting case ðN; d; kn; βÞ ¼ ð300; 10; 3; 0:1Þ
Series 1 Vary NAf300; 290; 280; …; 100g and fix d; kn; β Series 2 Vary dAf10; 12; 14; …; 30g and fix N; kn; β Series 3 Vary knAf3; 4; 5; …; 15g and fix N; d; β Series 4 VaryβAf0:1; 0:2; …; 1:5g and fix N; d; kn
1Downloaded from“http://www.cl.cam.ac.uk/research/dtg/attarchive/facedata base.html”.
2Downloaded from“http://cs.nyu.edu/ roweis/data.html”.
3Downloaded from“yann.lecun.com/exdb/mnist/”.
each digit has 12,000 images. To study model selection on a small sample size, we randomly pick 1200 images for each digit in MNIST.
After MFA/LFA learning in an unsupervised way, all samples are partitioned into different clusters with each component being a cluster. Since we do not know the true component number in these real world databases, we use two metrics to evaluate the clustering performance: (1) Rand index (RI) [26]; (2) normalized mutual information (NMI) [27]. Both metrics compare the clus- tered partition with a ground truth partition, which is formed by letting each class (i.e., each person in ORL and each digit in USPS/
MNIST) be a cluster. Both RI and NMI take value in ½0; 1 and equal to 1 when two partitions are identical, and a higher value means greater similarity between the obtained clusters and the ground truth[28]. It should be also noted that, since an appropriate model selection helps improve the generalization ability, a high RI/NMI score is related to but does not necessarily come from an appro- priately determined component number.
After 10 independent runs for each algorithm, the average RI and NMI scores by using each algorithm are reported inTable 5. As shown, on both MFA and LFA, BYY outperforms VB in terms of both RI and NMI scores. Moreover, LFA provides better results than MFA under BYY or VB. Out of all algorithms, BYY-LFA performs the best.
5.3. Unsupervised image segmentation
We also apply the algorithms to unsupervised image segmen- tation on the Berkeley segmentation database of real world images4. Based on the fact that any feature of a single pixel may
not be sufficient for segmentation, we choose the VZ features proposed in [29,30]to take into account the neighborhood and texture information. A VZ feature for each pixel is constructed by vectorizing the color information of all pixels in a w w-sized window centered at the pixel. Here, we set w ¼7 to construct 147-dimensional feature vectors from the LAB color space. Usually, the VZ features are further projected to 8 dimensions by PCA, e.g., in[4,29,30], when the dimensionality is too high. Since MFA/LFA can simultaneously perform clustering and local dimensionality reduction, we use the 147 features directly without PCA prepro- cessing as PCA may result in some information loss.
For every image, the trained MFA/LFA model assigns each image pixel to the cluster (represented by a component) of maximum posterior probability. Since the true component number is unknown, we evaluate the resulted segmentations by the Probabilistic Rand (PR) index [31], which takes values between 0 and 1. A higher PR score indicates a better segmentation with a Fig. 1. The correct selection rates by MFA/LFA automatic model selection algorithms in the four series of simulations.
Table 5
Clustering performance by MFA/LFA algorithms on ORL, USPS and MNIST databases.
Index per algorithm ORL USPS MNIST
Rand index (RI)
BYY-MFA 0.90 0.87 0.87
BYY-LFA 0.92 0.91 0.90
VB-MFA 0.86 0.84 0.86
VB-LFA 0.89 0.88 0.89
Normalized mutual information (NMI)
BYY-MFA 0.91 0.88 0.89
BYY-LFA 0.93 0.91 0.91
VB-MFA 0.89 0.87 0.85
VB-LFA 0.91 0.88 0.88
4http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/.
higher percentage of pixel pairs in the segmentation having the same labels as in the ground truth segmentation. A good model selection performance is closely related to, but not necessarily implies, a high PR score. We directly compare the segmentation performance of the four algorithms without any post-processings
such as region merging and graph cut [32], although these techniques may further improve the segmentations.
Table 6gives the average PR scores by 5 runs on all of the 100 testing images of the Berkeley image segmentation database.
Shown inFig. 2are two examples chosen from the database. On Fig. 2. Two image segmentation examples from Berkeley segmentation database by MFA/LFA automatic model selection algorithms. The segments are illustrated by the highlighted green boundaries. (For interpretation of the references to color in thisfigure caption, the reader is referred to the web version of this paper.)
Table 6
Average PR scores of 5 runs on the 100 testing images of Berkeley image segmentation database by MFA/LFA algorithms on the 147-dimensional features. The“ GMM” results are obtained by GMM automatic model selection algorithms on the 8-dimensional VZ features in[4].
GMM from[4] MFA/LFA
VB-DNW BYY-DNW VB-MFA VB-LFA BYY-MFA BYY-LFA
0.803 0.851 0.819 0.845 0.864 0.878