In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information

(1)

Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are encouraged to visit:

http://www.elsevier.com/authorsrights

(2)

Learning local factor analysis versus mixture of factor analyzers with automatic model selection

Lei Shi

^a

, Zhi-Yong Liu

^b

, Shikui Tu

^a

, Lei Xu

^a,n

aDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong

bThe State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China

a r t i c l e i n f o

Article history:

Received 3 June 2013 Received in revised form 29 August 2013

Accepted 15 September 2013 Available online 31 March 2014 Keywords:

Automatic model selection Mixture of factor analyzers Local factor analysis Variational Bayes Bayesian Ying-Yang Dirichlet–Normal–Gamma

a b s t r a c t

Considering Factor Analysis (FA) for each component of Gaussian Mixture Model (GMM), clustering and local dimensionality reduction can be addressed simultaneously by Mixture of Factor Analyzers (MFA) and Local Factor Analysis (LFA), which correspond to two FA parameterizations, respectively. This paper investigates the performance of Variational Bayes (VB) and Bayesian Ying-Yang (BYY) harmony learning on MFA/LFA for the problem of automatically determining the component number and the local hidden dimensionalities (i.e., the number of factors of FA in each component). Similar to the existing VB learning algorithm on MFA, we develop an alternative VB algorithm on LFA with a similar conjugate Dirichlet–

Normal–Gamma (DNG) prior on all parameters of LFA. Also, the corresponding BYY algorithms are developed for MFA and LFA. A wide range of synthetic experiments shows that LFA is superior to MFA in model selection under either VB or BYY, while BYY outperforms VB reliably on both MFA and LFA. These empiricalﬁndings are consistently observed from real applications on not only face and handwritten digit images clustering, but also unsupervised image segmentation.

1. Introduction

Mixture models[1,2], such as Gaussian Mixture Model (GMM) [3,4], have been widely used in many applications. By exploiting the Factor Analysis (FA) [5] in each Gaussian component, the correlated high dimensional data can be represented by fewer latent factors without requiringOðd²Þ parameters for each Gaus- sian covariance matrix, where d is the dimensionality of the data.

The mixture model can be regarded as a constrained GMM, and has been studied under the name of Mixture of Factor Analyzers (MFA) [2,6] or Local Factor Analysis (LFA)[7,8]in the literature.

MFA and LFA separately employ two parameterizations of FA, shortly called as FA-a that takes the form of a free factor loading matrix and an identity covariance matrix for the latent factors, and FA-b that constrains the factor loading matrix to be a rectangular orthogonal matrix, and allows a diagonal covariance matrix for the latent variables, respectively in[9].

Learning MFA/LFA includes parameter learning for estimating all the unknown parameters and model selection for determining the component number k and the hidden dimensionalities fhig^k_{i ¼ 1}. Parameter learning is usually implemented under the maximum

likelihood principle by an Expectation–Maximization (EM) algorithm [1,10,11]. A conventional model selection approach is featured by a two-stage implementation. Theﬁrst stage conducts parameter learning for each kAM to get a set of candidate models, where k ¼ fk; fhigg for MFA/LFA. The second stage selects the best candidate by a model selection criterion, e.g., Akaike's Information Criterion (AIC)[12]. However, this two-stage implementation suffers from a huge computation because it requires parameter learning for each kAM. Moreover, a larger k often implies more unknown parameters, and then parameter estimation becomes less reliable so that the criterion evaluation reduces its accuracy (see Section 2.1 in[13]for a detailed discussion).

To reduce the computation, an Incremental Mixture of Factor Analyzers (IMoFA) algorithm was proposed on MFA in[14] with the validation likelihood as the criterion to judge whether to split a component, or add a hidden dimension, or terminate. Although such an incremental procedure can save the costs to some extent, it usually leads to a suboptimal solution[13,15].

Another road is referred to as automatic model selection, which starts from a large enough k, and has an intrinsic force to drive extra structures diminished, and thus automatically determines k during parameter learning. An early effort is Rival Penalized Competitive Learning (RPCL) on GMM [16,17]. Two Bayesian related approaches can be implemented with a nature of automatic model selection. One is Bayesian Ying-Yang (BYY) learning, Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/neucom

Neurocomputing

nCorresponding author.

E-mail address:[email protected](L. Xu).

(3)

proposed in[18]and systematically developed in the past decade and a half [13,15,19,20], which provides a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle. BYY is capable of automatic model selection even without imposing any priors on the parameters, and its performance can be further improved with appropriate priors incorporated according to a general guideline. The other is Variational Bayes (VB) [6,21]. It tackles the difﬁculty in computing the marginal likelihood with a lower bound by means of variational method, and an EM-like algorithm is employed to optimize this lower bound. The model selection of VB is realized by incorporating an appropriate prior distributions on the parameters.

Recently, a comparative study[4]was delivered on automatic model selection by BYY, VB and MML (Minimum Message Length) for GMM with priors over the parameters. Also in[9], FA-b shows better model selection performance than FA-a under BYY and VB, although FA-a and FA-b have equivalent likelihood functions.

This paper is motivated for an empirical investigation on the automatic model selection performances of BYY and VB, based on MFA and LFA, which actually correspond to Mixture of FA-a and Mixture of FA-b, respectively. There exists a VB algorithm[6]for MFA with a Dirichlet prior on the mixing weights, Normal priors on the columns of the factor-loading matrix, and Gamma priors on precision parameters. Following[4], we consider a full prior on all parameters and adopt a Normal prior over the mean vector in each component of MFA. For short, DNG is referred to the above Dirichlet, Normal, Gamma priors. By slightly modifying the one in [6], we obtain a VB learning algorithm with the DNG prior, shortly denoted as VB-MFA. Also, a similar conjugate DNG prior is considered on the parameters of LFA.

Moreover, we develop three automatic model selection algorithms, namely the VB algorithm on LFA, or VB-LFA for short, and the BYY algorithms on MFA and LFA, shortly denoted as BYY-MFA and BYY-LFA respectively. With the conjugate property of the priors, the BYY harmony measure is computed by directly integrating out the parameters with respect to the Yang posteriors, instead of using Taylor approximations as in [9]. The handled marginal density of observed variable in each component is tackled by a lower-bound approximation with the help of additional variables, leading to products of multiple Student's T-distributions.

The performances of automatic model selection are extensively compared on a wide range of randomly simulated data, via controlling the hardness of tasks by varying the dimension of data, the number of samples, the true number of components, and the overlap degree of components. The simulated results show the following empirical ﬁndings. First, LFA gets better performance than MFA under either VB or BYY, which echoes the advantages of FA-b over FA-a observed in[9]. Second, BYY outperforms VB on both MFA and LFA, and in most cases BYY-LFA performs the best.

Also, we apply these algorithms to not only clustering face and handwritten digit images, but also unsupervised image segmentation on real world images. The results are consistent with the observations from simulated experiments.

The main contribution of this paper can be summarized in two- fold. First, three algorithms, i.e, the algorithm of VB based LFA with Dirichlet–Normal–Gamma (DNG) prior, denoted by VB-LFA, the algorithm of BYY based LFA with DNG prior, denoted by BYY-LFA, and the algorithm of BYY based MFA with DNG prior, denoted by BYY-MFA are derived in detail. Second, based on the algorithms, we empirically compared by extensive experiments the two types of clustering of factor analysis models, i.e., LFA and MFA, as well as two types of automatic model selection strategies, i.e., VB and BYY.

The remainder of this paper is organized as follows.Section 2 introduces MFA/LFA and their DNG priors. We introduce the automatic model selection algorithms with the DNG priors by

BYY inSection 3, and by VB inSection 4. Experimental comparisons are conducted via a wide range of synthetic datasets and real applications inSection 5. Finally, concluding remarks are made in Section 6.

2. Models and priors 2.1. Model parameterizations

In a mixture model, the distribution qðxjΘÞ of a d-dimensional observed random variable x is a mixture of several local distributions qðxji; θÞ, with each named as a component:

qðxjΘÞ ¼ ∑^k

i ¼ 1αiqðxji; θiÞ with Θ ¼ fαig [ fθig; ð1Þ

where k is the component number, fαig are mixing weights with

∑^ki ¼ 1αi¼ 1 and each αiZ0, and θi denotes parameters of the ith component. Here and throughout this paper, qðÞ is referred to as a generative distribution, likelihood or prior, while pðÞ is referred to as a posterior distribution.

If each component is a Gaussian distribution, i.e., qðxji; ΘÞ ¼ Gðxjμi; ΣxjiÞ with mean μi and covariance matrix Σxji, qðxjΘÞ by Eq.(1)becomes the widely used Gaussian Mixture Model. For a full matrix Σxji, there are 0:5dðdþ1Þ free parameters to be estimated, whose accuracy is difﬁcult be guaranteed for a small sample size. One way for tackling this problem is to impose certain constraints onΣxji with a Factor Analysis model, i.e.,

qðxjy; i; θiÞ ¼ GðxjAiy þμi; ΨiÞ; qðyji; θiÞ ¼ Gðyj0; ΣyjiÞ;

qðxji; θiÞ ¼Z

qðxjy; i; θiÞqðyji; θiÞ dy ¼ Gðxjμi; AiΣyjiA^T_iþΨiÞ; ð2Þ where we introduce a hidden factor y in an hi-dimensional subspace with hiod, and constrain Ψi to be diagonal. FA actually factorizesΣxjito beΣxji¼ AiΣyjiA^T_iþΨiwith fewer free parameters.

To reduce the indeterminacies of the FA by Eq.(2), two parameterizations of FA are typically used, called as Mixture of Factor Analyzers (MFA)[2,6]and Local Factor Analysis (LFA)[7,8]respectively, with their corresponding mixture models by Eq.(1)summarized inTable 1. The two FA parameterizations have equivalent likelihood functions by Eq.(2), and thus they have the same model selection performance in a two-stage implementation with AIC or BIC[22]. However, it was found that they result in different model selection performances under BYY [23], and a recent study[9]provided systematic empiricalﬁndings on how parameterizations affect model selection performance under not only BYY but also VB. Moreover, the differences of two parameterizations on model selection performance have been further analytically investigated in Section 2.2 of [20]. In this paper, we proceed to investigate the automatic model selection performances of MFA/LFA under BYY and VB.

Moreover, when each diagonal covariance Ψi in Table 1 is constrained to be spherical, i.e., Ψi¼ ψiId, MFA and LFA will degenerate to Mixture of PCA[11]and Local PCA[8], respectively.

Table 1

MFA v.s. LFA: similarity and difference. MFA and LFA are actually mixtures of FA-a and FA-b in[9], respectively.

Model: MFA (mixture of FA-a) LFA (mixture of FA-b)

Parametersθi: fA_i; μi; Ψig fU_i; Λi; μi; Ψig Same: Ψiis d d diagonal Ψiis d d diagonal

Different: Aiis general d hi Uiis orthogonal, i.e., U^T_iUi¼ I_h_i, Λiis diagonal,Λi¼ diag½λ1; …; λhi qðyji; θiÞ: Gðyj0; IhiÞ Gðyj0; ΛiÞ

qðxjy; i; θiÞ: GðxjAiy þμi; ΨiÞ GðxjUiy þμi; ΨiÞ qðxji; θiÞ: Gðxjμi; AiA^T_iþΨiÞ Gðxjμi; UiΛiU^T_iþΨiÞ

(4)

Additionally, MFA/LFA can be reformulated by introducing a binary variable set Z ¼ fztg^N_{t ¼ 1} corresponding to the i.i.d. samples XN¼ fxtg^N_{t ¼ 1}. In each zt, we have each ith element zitAf0; 1g and

∑^ki ¼ 1zit¼ 1, and zit¼ 1 iff xtis generated from the ith component.

The generative process of an observation xtis thus interpreted as three steps: (1) sample zt from a Multinomial distribution described byα ¼ ½α1; …; αk^T, i.e., zt MultinomialðαÞ; (2) generate hidden variable y from a Gaussian subspace, i.e., y ∏iqðyji; θiÞ^z^it; (3) generate xt from a Gaussian conditional on y, i.e., xt

∏iqðxtjy; i; θiÞ^z^it. Therefore, we have the following joint probability:

qðXN; Y; ZjΘÞ ¼ ∏^N

t ¼ 1 ∏^k

i ¼ 1½αiqðxtjy; i; θiÞqðyji; θiÞ^z^it: ð3Þ Given a set of observations XN, supposing that the component number k and the local hidden dimensionalities fhig are given, one widely used method for parameter estimation is the maximum- likelihood learning, which can be effectively implemented by the well-known Expectation–Maximization (EM) algorithm[1]. Model selection on LFA/MFA is to appropriately determine both the component number k and the local hidden dimensionalities fhig^k_{i ¼ 1}, or shortly to determine the tuple k ¼ fk; fhig^k_{i ¼ 1}g, on which the maximum likelihood principle fails to give a good guide[13].

2.2. The conjugate Dirichlet–Normal–Gamma priors

Considering the parameters with appropriate prior distributions can provide helpful learning regularization and also improve model selection performance [13]. Such empirical studies have been conducted on GMM with either the improper Jeffreys prior or the conjugate Dirichlet–Normal–Wishart prior under BYY, VB and MML [4], and also on FA with Normal–Gamma prior under BYY and VB [9]. This section considers similar prior distributions on parameters Θ ¼ α [ fθig^k_{i ¼ 1} of MFA or LFA inTable 1, where the Dirichelet distribution Dðαjλ; ξÞ takes the form,

Dðαjλ; ξÞ ¼ ΓðξÞ

∏^ki ¼ 1ΓðξλiÞ ∏^k

i ¼ 1α^ξλi ⁱ¹

!

; ð4Þ

with the constraintsλ ¼ ½λ1; …; λk^T; ∑iλi¼ 1; 8λiZ0:

For MFA, we consider a Dirichlet prior on the mixing weights α ¼ ½α1; …; αk^T, a Normal prior on each component's mean vector μi, an independent Gamma prior on the diagonal elements of

φi¼ Ψi¹, whereφ^ðjÞi is the jth diagonal element inφi. Moreover, a hierarchical Normal–Gamma prior is assigned on each jth column A^ð_iⁿ^jÞ of Ai, where A^ð_iⁿ^jÞ a priori comes from a zero-mean Normal distribution with a covariance Id=ςij, and ςij further follows a Gamma prior. We useΓðja; bÞ to denote the Gamma distribution with a shape parameter a40 and an inverse scale parameter b40. The whole qðΘÞ is shortly denoted as DNG with details given in the left ofTable 2. A DNG prior was considered on MFA under VB in[6]without qðμiÞ. Based on the observations in[4]that a full prior helps to improve model selection performance, in this paper we consider the full DNG prior with qðμiÞ.

For LFA, we consider a similar DNG prior onΘ in the right of Table 2, with the parts different from MFA highlighted by gray color. Therein, each orthogonal matrix Ui is considered without any prior, instead of adopting the qðUiÞ used in[9]because it is irrelevant to U_iand thus not helpful for automatic model selection.

The differences in qðΘÞ actually come from the parameterizations.

Therefore, the qðΘÞ for LFA is also called DNG prior in this paper without ambiguity. For both MFA and LFA, the DNG prior is conjugate[6,11]to the generative process described in Eq.(3).

In the sequel, we may use short notations a^φ_i ¼ ½a^φ_i1; …; a^φid^T, b^φ_i ¼ ½b^φ_i1; …; b^φid^T, a^ς_i ¼ ½a^ς_i1; …; a^ςihi^T, b^ς_i¼ ½b^ς_i1; …; b^ςih_i^T, a^ν_i¼ ½a^ν_i1; …;

a^ν_ih

i^T and b^ν_i¼ ½b^ν_i1; …; b^νihi^T for expression convenience.

3. BYY algorithms for learning LFA versus MFA 3.1. Bayesian Ying-Yang (BYY) harmony learning

Firstly proposed in [18] and systematically developed over a decade and a half[13], Bayesian Ying-Yang (BYY) harmony learning theory is a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle, which provides a favorable new mechanism for model selection.

The BYY harmony learning is featured by seeking the best harmony between the Ying- Yang pair in a BYY system. The BYY system consists of Yang machine and Ying machine, respectively corresponding to two types of decompositions, namely Yang pðRjXÞpðXÞ and Ying qðXjRÞqðRÞ, where the observed data X is regarded as generated from its inner representation R ¼ fY; Θg that consists of latent variables Y and parameters Θ, supported by a hyper-parameter setΞ. The harmony measure is mathematically expressed as follows[13,15,19]:

Hðpjjq; ΞÞ ¼Z

pðRjXÞpðXÞ½qðXjRÞqðRÞdX dR: ð5Þ Maximizing Hðpjjq; ΞÞ leads to not only a best matching between the Ying-Yang pair, but also a compact model with a least complexity. Different from VB model selection that bases on an appropriate prior qðΘjΞÞ (seeSection 4for the details of the prior in VB), BYY harmony learning by Eq. (5) bases on qðRÞ ¼ qðYjΘÞqðΘjΞÞ to make model selection, where qðYjΘÞ plays a role that is not only equally important to qðΘjΞÞ but also easy computing, and qðΘjΞÞ is still handled in a way similar to VB.

Moreover, maximizing Hðpjjq; ΞÞ is implemented with the help of the general two-stage iterative procedure shown by Fig. 6 in[19]

(also see Eqs.(6)and(7)in[8]). Theﬁrst stage estimates Ξ (usually via estimating Θ) by an optimization of continuous variables, while the second stage involves a discrete optimization on one or several integers that index candidate models. Here, we only consider theﬁrst stage where automatic model selection actually performs, though the second stage may be also considered to further improve the model selection performance with much more computing costs.

Table 2

The Dirichlet–Normal–Gamma priors on MFA and LFA with hyper-parameters Ξ ¼ fλ; ξ; βg [ fΞig^k_{i ¼ 1}. Priors on MFA/LFA are described in the two big columns respectively, whose differences are highlighted with gray color. The “distr.”

columns indicate the distribution types for clarity, with“D” for Dirichlet, “N” for Normal, and“G” for Gamma, respectively.

MFA LFA

Θ ¼ α [ fθig_i,θi¼ fA_i; μi; Ψig Θ ¼ α [ fθig_i,θi¼ fU_i; Λi; μi; Ψig φ_i¼ Ψ_i¹; A^ðnjÞ_i : jth column vector of Ai νi¼ Λi¹,φi¼ Ψi¹

Ξi¼ fm_i; fa^φ_ij; b^φ_ijg

j; fa^ς_ij; b^ς_ijg

jg Ξi¼ fm_i; fa^φ_ij; b^φ_ijg

j; fa^ν_ij; b^νijg

jg

prior distr. prior distr.

qðΘÞ ¼ qðαÞ ∏^k

i ¼ 1qðθiÞ DNG qðΘÞ ¼ qðαÞ ∏^k

i ¼ 1qðθiÞ DNG

qðαÞ ¼ Dðαjλ; ξÞ D qðαÞ ¼ Dðαjλ; ξÞ D

qðθiÞ ¼ qðμiÞqðφiÞqðA_iÞ NG qðθiÞ ¼ qðμiÞqðφiÞqðνiÞ NG qðμiÞ ¼ Gðμijmi; Id=βÞ N qðμiÞ ¼ Gðμijmi; Id=βÞ N qðφiÞ ¼ ∏^d

j ¼ 1

Γðφ^ðjÞ_ia^φ_ij; b^φ_ijÞ G qðφiÞ ¼ ∏^d

j ¼ 1

Γðφ^ðjÞ_ija^φ_ij; b^φ_ijÞ G qðAijςiÞ ¼ ∏^hⁱ

j ¼ 1

GðA^ðnjÞ_i j0; Id=ς^ðjÞ_i Þ N qðνiÞ ¼ ∏^hⁱ

j ¼ 1Γðν^ðjÞ_i ja^ν_ij; b^νijÞ G qðςiÞ ¼ ∏^hⁱ

j ¼ 1

Γðς^ðjÞ_i ja^ς_ij; b^ς_ijÞ G

(5)

BYY harmony learning leads to improved model selection via either or both of improved model selection criteria and algorithms with automatic model selection. Such a merit can be intuitively understood as follows[13]:

HðpjjqÞ ¼ Z

pðXÞln qðXÞ dX ¼ HðpjjpÞKLðpjjqÞ: ð6Þ Thus, besides the Kullback–Leibler divergence, a system entropy term HðpjjpÞ is also incorporated into the BYY objective function.

By contrast, VB tries to maximize the marginal likelihood by minimizing only the Kullback–Leibler divergence. It is the term HðpjjpÞ that makes the BYY harmony learning possess automatic model selection ability, even there is no prior on the parameter.

More speciﬁcally, to estimate some mixture model such as the MFA, with the help of HðpjjpÞ the BYY harmony learning takes the following E-step which compared with the conventional EM algorithm has an extra termΔ as follows[13]:

p_j;t¼ pðjjtÞþΔj;t: ð7Þ

By such a regularization termΔ^j;t (see also(A.9)inAppendix A), the updating on the jth component shares somewhat a similar updating to the rival penalized competitive learning (RPCL) [16,17], and thus realizes automatic model selection.

On MFA and LFA with the DNG priors in a conjugate manner, there is still no detailed automatic model selection algorithm available for implementing BYY harmony learning, and thus this section targets at developing such algorithms.

3.2. BYY algorithm for learning LFA with DNG prior

For LFA, we consider the Ying machine as qðX; RÞ ¼ qðX; Y;

ZjΘÞqðΘÞ, with R ¼ fY; Z; Θg, qðX; Y; ZjΘÞ given in Eq.(3)and qðΘÞ given in the right ofTable 2. In the Yang machine, we consider pðXÞ as the empirical distribution, i.e., pðXÞ ¼δðXX^NÞ, and the Yang- pathway pðRjXÞ as

pðRjXNÞ ¼ pðΘjZ; Y; X^NÞpðYjZ; X^NÞpðZjXNÞ;

pðΘjY; Z; XNÞ ¼ pðαjZ; XNÞ ∏^k

i ¼ 1

pðμijZ; XNÞ ∏^hⁱ

j ¼ 1

pðν^ðjÞi jY; Z; XNÞ

"

∏^d

j ¼ 1

pðφ^ðjÞi jY; Z; XNÞ

#

;

pðYjZ; X^NÞ ¼ ∏^k

i ¼ 1 ∏^N

t ¼ 1

pðyji; x^tÞ^z^it;

pðZjXNÞ ¼ ∏^k

i ¼ 1 ∏^N

t ¼ 1

pðijxtÞ^z^it: ð8Þ

Particularly, in accordance with the variety preservation principle (see Section 4.2 in [13]), the details of pðΘjY; Z; XNÞ are further designed as the following posteriors in the DNG form by utilizing the conjugate property:

pðαjZ; XNÞ ¼ Dðαjλⁿ; ξþNÞ;

pðμijZ; XNÞ ¼ G μijmⁿ_i; βIdþ ∑^N

t ¼ 1

z_it

diagða^φ_i⊘b^φiÞ

¹!

; pðν^ðjÞi jY; Z; X^NÞ ¼ Γ ν^ðjÞi ja^ν_ijþ1

2 ∑^N

t ¼ 1

zit; bⁿij^ν

; pðφ^ðjÞi jY; Z; X^NÞ ¼ Γ φ^ðjÞi ja^φ_ijþ1

2 ∑^N

t ¼ 1

z_it; bⁿij^φ

; ð9Þ

where fλⁿ; mⁿi; bⁿij^ν; bⁿij^φg are free hyper-parameters to be optimized.

Therein and throughout this paper, we use symbols“⊘” and “”

to denote the Hadamard (element-by-element) division and product respectively.

The pðijxtÞ in Eq.(8)is constructed as the Bayesian posterior of qðxt; iÞ, i.e., pðijx^tÞpqðx^t; iÞ, where the qðx^t; iÞ is computed by qðxt; iÞ ¼Z

qðxt; ijΘÞqðΘÞ dΘ ¼ λiqðxtjiÞ; ð10Þ qðxtjiÞ ¼Z

qðxtji; θiÞqðθiÞ dθi¼Z

qðxtjy_t; i; θiÞqðy_tji; θiÞqðθiÞ dy_tdθi; ð11Þ withθi¼ fμi; νi; φig. However, it is difﬁcult to directly compute the above integral over fθi; ytg for an analytical qðxtjiÞ. Therefore, qðxtjiÞ is sequentially approximated by lower-bounds according to Jen- sen's inequality, i.e.,

qðxtjiÞZZ

~qðxtji; θiÞqðθiÞ dθiZZ

~qðxtji; νi; φiÞqðνiÞqðφiÞ dνidφi; ð12Þ which leads to the marginal qðxtjiÞ as a product of several Student's T-distributions in Eq.(A.1)inAppendix A, where

~qðxtji; θiÞ ¼ exp Z

Gðy_tj~y_it; ~Σ^yiÞ lnqðxtjyt; θiÞqðytjθiÞ Gðy_tj~yit; ~Σ^yiÞ dy

( )

; ð13Þ

~qðxtji; νi; φiÞ ¼ exp Z

Gðμij~μ^xit; ~Σ^μi^xÞ ln~qðxtji; θiÞqðμiÞ Gðμij~μ^xit; ~Σ^μi^xÞdμi

( )

; ð14Þ

and f~yit; ~Σ^yi; ~μ^xit; ~Σ^μi^xg are assistant variables which can be updated by maximizing the above lower bounds with the details referred to Appendix A.

Similarly for pðyji; xtÞ, we can also obtain a product of multiple Student's T-distributions, which however makes the subsequent integrals in the harmony measure difﬁcult. We further approximate it by the following Gaussian according to the property of Student's T-distribution[24]:

pðyjxt; iÞ GðyjWiðxt~μ^y_itÞ; ΠiÞ;

W_i¼ ΠiU^T_iD_i; Di¼ diag½ða^φ_i þ¹₂1_dÞ⊘ðb^φiþ¹₂ diagð ~Σ^μi^yÞÞ;

Πi¼ ½U^T_iDiUiþdiagða^ν_i⊘b^νiÞ¹; ð15Þ where~μ^yitand ~Σ^μi^ycan be updated for a better approximation given the parameters. The details are referred toAppendix A.

Putting Eq. (8) into Eq. (5), the harmony measure on LFA becomes

H^LFAðfUig; ΞÞ ¼Z

∑

Z

pðΘjZ; Y; XNÞpðYjZ; XNÞpðZjXNÞ

ln½qðXNjY; Z; ΘÞqðYjZ; ΘÞqðZjΘÞqðΘjΞÞ dY dΘ: ð16Þ By further substituting the details of Eqs.(9)–(15)into Eq.(16), we obtain the following lower-bound:

H^LFAðfUig; ΞÞZH^LFAðfUig; Ξ; ΞⁿÞ; ð17Þ where the detailed expression of H^LFAðfUig; Ξ; ΞⁿÞ is given by Eq.(A.5)in Appendix A. The best harmony principle is approximated by maximizing H^LFAðfUig; Ξ; ΞⁿÞ with respect to the prior hyper-parametersΞ ¼ fλ; ξ; fmig; β; fa^νij; b^νijg; fa^φij; b^φijgg and the posterior hyper-parametersΞⁿ¼ fλⁿ; fmⁿig; fbⁿij^νg; fbⁿij^φgg. The derived algorithm is sketched inTable 3and shortly denoted as BYY-LFA, with details referred toAppendix A.

The above checking on whether to discard dimension j in component i is actually observing whether the jth diagonal element ofΛitends to zero and thus the corresponding dimension of y can be discarded. Following Section 2.2 of[20], e.g., its Eq.

(36), this checking can be further improved by the nature of the co-dim matrix pair of each FA model.

(6)

3.3. BYY algorithm for learning MFA with DNG prior

Learning on MFA can be made in a similar way. The main differences come from the parameterizations on the factor loading matrix and the factor covariance matrix, and their corresponding prior distributions, as shown inTables 1and2.

The Ying machine is represented by replacing the counterparts of LFA with the ones of MFA, i.e., getting qðX; Y; ZjΘÞ in Eq.(3)by the left column ofTable 1and qðΘÞ by the left part ofTable 2. In Yang machine, we still consider the empirical distribution pðXÞ ¼δðXXNÞ, but the Yang-pathway pðRjXÞ is factorized in a form slightly different from Eq.(8):

pðRjXNÞ ¼ pðΘjZ; XNÞpðYjZ; XN; fAigÞpðZjXNÞ;

pðΘjZ; XNÞ ¼ pðαjZ; XNÞ ∏^k

i ¼ 1

pðμijZ; XNÞ ∏^d

j ¼ 1

pðφ^ðjÞi jZ; XNÞ (

∏^hⁱ

j ¼ 1

½pðA^ð_iⁿ^jÞjZ; XNÞpðς^ðjÞi jZ; XNÞ

)

;

pðYjZ; XN; fAigÞ ¼ ∏^k

i ¼ 1 ∏^N

t ¼ 1

pðyji; x^t; AiÞ^z^it; ð18Þ where pðZjXNÞ is expressed in the same form as the one in Eq.(8).

We proceed to the detailed expressions of pðΘjZ; XNÞ according to the conjugate property of the DNG priors on MFA. Particularly, Eq.(8)is modiﬁed accordingly by replacing pðν^ðjÞi jY; Z; X^NÞ with the following equations:

pðA^ð_iⁿ^jÞjZ; X^NÞ ¼ G

A^ð_iⁿ^jÞjA^ð_iⁿ^jÞ;

a^ς_ij=b^ςijIdþ ∑^N

t ¼ 1

zit

diagða^φ_i⊘b^φiÞ

¹

;

pðς^ðjÞi jZ; XNÞ ¼ Γ ς^ðjÞi ja^ς_ijþd 2; bⁿij^ς

; ð19Þ

where fA^ð_iⁿ^jÞ; bⁿij^ς; bⁿij^φg are free hyper-parameters to be determined.

Similar to Eq. (10), the pðijxtÞ is constructed by the Bayesian posterior, i.e., pðijxtÞpqðx^t; iÞ. The qðx^tjiÞ for MFA can also be approximated via Eqs.(12)–(14), where ~qðxtji; νi; φiÞ is substituted by ~qðxtji; Ai; φiÞ, and then Eq. (12) is computed by integrating

~qðxtji; Ai; φiÞqðζiÞqðφiÞ with respect to Ai; ζi; φi, leading to a different qðxtjiÞ which is also a product of multiple Students T-distributions.

Moreover, the pðyji; xt; fAigÞ is approximated by Eq.(15)with slight modiﬁcations: replacing Ui and diagða^ν_i⊘b^νiÞ with Ai and Ih_i

respectively.

Putting the above speciﬁcations of Ying-Yang machine into Eq. (5), we obtain a lower-bound H^MFAðΞ; ΞⁿÞ analogous to H^LFAðfUig; Ξ; ΞⁿÞ, and also a corresponding BYY-MFA algorithm to

maximize H^MFAðΞ; ΞⁿÞ via modifying the counterparts of BYY-LFA according to Eqs. (18) and(19). Readers are referred to [25]for details.

It should be noted that the pðθijY; Z; X^NÞ is considered to have a conjugate Normal–Gamma form, whereas it was approximated by a 2nd order Taylor expansion in [9] (see Eqs. (13)–(17) and Table B1 in [9] for details), although the prior qðθiÞ on each component's parametersθi inTable 2is the same as those in[9].

Therefore, when the mixture model in Eq.(1)degenerates to only one Gaussian component represented by FA in [9], i.e., k ¼1, the BYY-LFA and BYY-MFA here actually provide alternative BYY learning algorithms for FA-a and FA-b, being different from the algorithms derived in[9].

4. VB algorithms for learning LFA versus MFA

With proper incorporation of prior knowledge on model parameters, Bayesian model selection is implemented via the maximum marginal likelihood, which is obtained by integrating out the latent variables Y and the parameters Θ, i.e., qðXNÞ ¼ RqðXN; YjΘÞqðΘÞ dY dΘ. However, the involved integration is usually very difﬁcult. Variational Bayesian [6,21] tackles this difﬁculty via constructing a tractable lower bound for the log marginal likelihood by means of variational methods, and an EM- like algorithm is employed to optimize this lower bound. More precisely, the lower bound is given as follows:

J_VBðΞⁿ; ΞÞ ¼Z

pðΘ; YjΞⁿÞ lnqðXN; YjΘÞqðΘjΞÞ pðΘ; YjΞⁿÞ dY dΘ

¼ ln qðXNjΞÞZ

pðΘ; YjΞⁿÞ ln pðΘ; YjΞⁿÞ

pðΘ; YjXN; ΞÞdY dΘ; ð20Þ where Y represents all hidden variables, e.g., fY; Zg in Eq.(3)for MFA/LFA, qðΘjΞÞ is a prior on parameters Θ with hyper-parameters Ξ, and pðΘ; YjΞⁿÞ is a variational posterior with hyper-parameters Ξⁿ to approximate the exact Bayesian posterior pðΘ; YjXN; ΞÞp qðXN; YjΘÞqðΘjΞÞ. It follows from Eq. (20) that the lower bound J_VBðΞⁿ; ΞÞ is tight to ln qðXNjΞÞ when pðΘ; YjΞⁿÞ ¼ pðΘ; YjXN; ΞÞ. For computational convenience, qðΘjΞÞ is usually chosen to be conjugate priors, and pðΘ; YÞ is usually assumed to be afactorized form pðΘ; YÞ ¼ pðYÞ∏ipðΘiÞ with Θ ¼ ⋃iΘi, so that maximizing JVBmakes the variational posterior pðΘÞ to be conjugate, i.e., in the same form as the corresponding prior.

There is already a VB learning algorithm on MFA proposed in [6], where the prior onθi is qðθiÞ ¼ qðφiÞqðAiÞ, without considering the prior onμi. In this paper, the DNG prior inTable 2considers a Gaussian distribution qðμiÞ for μi. Then, the corresponding VB Table 3

BYY algorithm on LFA with DNG prior (BYY-LFA).

1 Initialization: Randomly initialize the model with large enough number k of components and hidden dimensionalities fhig; set τ ¼ 0 and the harmony measure J_BYYðτÞ ¼ 1;

2 repeat 3 4 5 6 7 8 9 10 11 12

Yangstep : Update assistant variables f ~y_it; ~Σ^y_ig by Eq: ðA:3Þ; f ~μ; ~Σ^μ_i^xg by Eq: ðA:2Þ; and f ~μ^y_it; ~Σ^μ_i^yg by Eq: ðA:4Þ: Calculate pðijxtÞ by Eq: ðA:1Þ; eitandϵitby Eq: ðA:6Þ:

Yingstep : With ∇UiH^LFAby Eq: ðA:8Þ; update each Uivia U^new_i ¼ Uôld_i þη½∇UiH^LFAUôld_i ð∇UiH^LFAÞ^TUôld_i :

UpdateΞⁿ¼ fλⁿi; mⁿ_i; b^nνi ; b^nφ_i g in a gradient way by Eq: ðA:7Þ:

Hstep : In a gradient way; update fλ; ξ; βg [ fmig by Eq: ðA:10Þ; fa^ν_i; b^ν_ig by Eq: ðA:11Þ; and fa^φ_i; b^φ_ig by Eq: ðA:12Þ:

for i ¼ 1; …; k do

ifλⁿi orλi-0 then discard component i; let k ¼ k1;

for j ¼ 1; …; hido if_aν ^b^nν^ij

ijþ¹₂∑^Nt ¼ 1pðijxtÞor^b_a^ν^ijν

ij-0 then discard dimension j in component i; let hi¼ hi1;

66 66 66 4

if another 5 runs pass then letτ ¼ τþ1; calculate JBYYðτÞ ¼ H^LFAby Eq: ðA:5Þ;

13 until J_BYYðτÞJBYYðτ1ÞoϵJBYYðτ1Þ; with ϵ ¼ 10⁵;

(7)

algorithm (shortly denoted as VB-MFA) can be obtained through slightly modifying the one in[6]by replacing μi with mⁿ_i when updating other parameters, and additionally update the following variational posterior forμi:

pðμiÞ ¼ Gðμijmⁿ_i; Σⁿi^μÞpEp½ln½qðXN; Y; ZjΘÞqðΘÞ; ð21Þ where Ep½ denotes expectation with respect to the current estimate of all variational posteriors pðYjZÞpðZÞpðΘÞ except pðμiÞ.

There is no VB algorithm available for LFA with DNG prior, yet there is a VB algorithm presented in[9]on FA-b, a degenerated case of LFA with k ¼1. Following[6], the joint posterior is assumed to be approximately factorized as the following variational posterior:

pðY; Z; ΘjX^NÞ pðYjZÞpðZÞpðΘÞ;

pðΘÞ ¼ pðαÞ ∏^k

i ¼ 1

pðμiÞ ∏^hⁱ

j ¼ 1

pðν^ðjÞi Þ ∏^d

j ¼ 1

pðφ^ðjÞi Þ

" #

; ð22Þ

and then the variational lower bound becomes J^LFA_VB ¼

Z

pðYjZÞpðZÞpðΘÞ ln qðX; Y; ZjΘÞqðΘÞ pðYjZÞpðZÞpðΘÞ

dY dZ dX dΘ: ð23Þ

Its maximization leads to the following conjugate form:

pðZÞ ¼ ∏^k

i ¼ 1 ∏^N

t ¼ 1

p^z_it^it; pðYjZÞ ¼ ∏^k

i ¼ 1 ∏^N

t ¼ 1

Gðyjyⁿ_it; Σⁿi^yÞ^z^it; pðαÞ ¼ Dðαjλⁿ; ξⁿÞ; pðμiÞ ¼ Gðμijmⁿ_i; Σⁿi^μÞ;

pðν^ðjÞi Þ ¼ Γðν^ðjÞi jaⁿ_ij^ν; bⁿij^νÞ; pðφ^ðjÞi Þ ¼ Γðφ^ðjÞi jaⁿ_ij^φ; bⁿij^φÞ: ð24Þ An EM-like algorithm is given inAppendix B, for maximizing JVBLFA

with respect to the variational parameters ffpit; yⁿitg_t; Σⁿi^yg^k_{i ¼ 1} that describe the posterior distribution on Y [ Z, and the variational hyper-parametersΞⁿ¼ fλⁿ; ξⁿ; fmⁿi; Σⁿi^μg_i; faⁿij^ν; bⁿij^νg

i;j; faⁿij^φ; bⁿij^φg

i;jg that describe the posterior onΘ.

5. Experimental results

Since the values of the hyper-parameters Ξ are usually unknown and it is not good to assign one for all data by guess, the algorithms are implemented to adjust Ξ under their own learning principles. Moreover, it has been demonstrated in[4]that the performance of VB will be improved when the fmig ofTable 2 are constrained to be the same, i.e., 8 i; mi¼ m, with the number of free hyper-parameters reduced, and thus VB is here implemented with mi¼ m too. Still, no constraints are imposed on fmig for BYY algorithms.

5.1. Comparisons on four series of simulations

Each dataset is generated according to MFA or LFA in Tables 1 and 2. For each component i, we set the hidden dimensionality hⁿ_i ¼ 5, and the mixing weight αi¼ 1=kⁿ, where kⁿ denotes the true component number. The remaining parameters are randomly generated according to the Normal–Gamma

distributions given inTable 2, with a^ς_ij¼ b^ς_ij¼ 3, a^ν_ij¼ 10, b^ν_ij¼ 200, and a^φ_ij¼ b^φ_ij¼ 10, 8i; j.

To cover a wide range of experimental conditions, we vary the values of the sample size N, the observed data dimensionality d, the true component number kⁿ, and the overlap degree β of Gaussian components, where increasing β indicates that the degree of separation of the components changes from large to small. Generally speaking, it becomes more difﬁcult in model selection as N decreases, d increases, kⁿincreases, andβ increases.

We consider four series of experiments speciﬁed in Table 4.

Starting from a same point in the 4-dimensional factor space, each series varies one factor of ðN; d; kⁿ; βÞ while ﬁxing the remaining three. For each speciﬁc setting, 500 datasets are generated independently, and all algorithms are initialized with a same component number 25 and a same hidden dimensionality 9. We compare the resulted k [ fh_ig^k_{i ¼ 1} with the true kⁿ[ fhⁿ_ig^k_{i ¼ 1}ⁿ with each hⁿ_i ¼ 5. The model selection accuracies are reported inFig. 1 as percentages of correctly obtaining k ¼ kⁿand hi¼ 5ð8iÞ out of 500 independent runs.

Fig. 1 shows that all algorithms perform well at the starting case, and then decline as the experimental environment deterio- rates. Particularly, we observe:

1. LFA is shown to be superior to MFA in model selection, under either BYY or VB. This observation is consistent to the empirical findings on FA in[9]. Specifically, the superiority of LFA is less obvious in the cases of small N, or large kⁿ; β, because both LFA and MFA get bad performance on these extreme cases. The reason may be due to the fact that y in MFA processes an identity covariance matrix, which can be taken as a special case of that of LFA. Thus, LFA is more flexible than MFA to accommodate different types of data. Moreover, the LFA is better than MFA in terms of providing one additional room for model selection via estimatingΛ. Compared with adding priors, Gðyj0; ΛÞ is more reliable and easier to be estimated from data.

2. On either LFA or MFA, BYY greatly outperforms VB for all the cases except the one kⁿ¼ 7 in series 3. Although VB-LFA beneﬁts from the superiority of using LFA instead of MFA, BYY-MFA is still more robust than VB-LFA against the deteriora- tion of the environment, while BYY-LFA is the best in general.

Regarding the time-cost, due to space limit we are not able to present the detailed results. It was observed that the BYY in general involves a heavier computational load than VB because it involves more gradient calculations. However, the complexities of both algorithms are comparable to each other (around OðN³Þ), since the main computation of both of them arises from the matrix multiplication.

5.2. Face and handwritten digit images clustering

We test the clustering performances of the four algorithms on three real datasets: the ORL¹ face image database, the USPS² handwritten digits, and the MNIST³ handwritten digits. The ORL contains with 10 grayscale images for each 40 human subjects, and all images are of a size 64 64. We project them to 65 dimensions by PCA so that 90% energy in the covariance is reserved. The USPS and MNIST digit databases both contain grayscale images of“0”

through“9”. In USPS, each image is of a size 16 16 and thus 256- dimensional, and there are 1100 images for each digit. In MNIST, each image is of a size 28 28 and thus 784-dimensional, and Table 4

Four series of experiments for MFA/LFA model selection.

Starting case ðN; d; kⁿ; βÞ ¼ ð300; 10; 3; 0:1Þ

Series 1 Vary NAf300; 290; 280; …; 100g and fix d; kⁿ; β Series 2 Vary dAf10; 12; 14; …; 30g and fix N; kⁿ; β Series 3 Vary kⁿAf3; 4; 5; …; 15g and fix N; d; β Series 4 VaryβAf0:1; 0:2; …; 1:5g and fix N; d; kⁿ

1Downloaded from“http://www.cl.cam.ac.uk/research/dtg/attarchive/facedata base.html”.

2Downloaded from“http://cs.nyu.edu/ roweis/data.html”.

3Downloaded from“yann.lecun.com/exdb/mnist/”.

(8)

each digit has 12,000 images. To study model selection on a small sample size, we randomly pick 1200 images for each digit in MNIST.

After MFA/LFA learning in an unsupervised way, all samples are partitioned into different clusters with each component being a cluster. Since we do not know the true component number in these real world databases, we use two metrics to evaluate the clustering performance: (1) Rand index (RI) [26]; (2) normalized mutual information (NMI) [27]. Both metrics compare the clus- tered partition with a ground truth partition, which is formed by letting each class (i.e., each person in ORL and each digit in USPS/

MNIST) be a cluster. Both RI and NMI take value in ½0; 1 and equal to 1 when two partitions are identical, and a higher value means greater similarity between the obtained clusters and the ground truth[28]. It should be also noted that, since an appropriate model selection helps improve the generalization ability, a high RI/NMI score is related to but does not necessarily come from an appropriately determined component number.

After 10 independent runs for each algorithm, the average RI and NMI scores by using each algorithm are reported inTable 5. As shown, on both MFA and LFA, BYY outperforms VB in terms of both RI and NMI scores. Moreover, LFA provides better results than MFA under BYY or VB. Out of all algorithms, BYY-LFA performs the best.

5.3. Unsupervised image segmentation

We also apply the algorithms to unsupervised image segmentation on the Berkeley segmentation database of real world images⁴. Based on the fact that any feature of a single pixel may

not be sufﬁcient for segmentation, we choose the VZ features proposed in [29,30]to take into account the neighborhood and texture information. A VZ feature for each pixel is constructed by vectorizing the color information of all pixels in a w w-sized window centered at the pixel. Here, we set w ¼7 to construct 147-dimensional feature vectors from the LAB color space. Usually, the VZ features are further projected to 8 dimensions by PCA, e.g., in[4,29,30], when the dimensionality is too high. Since MFA/LFA can simultaneously perform clustering and local dimensionality reduction, we use the 147 features directly without PCA prepro- cessing as PCA may result in some information loss.

For every image, the trained MFA/LFA model assigns each image pixel to the cluster (represented by a component) of maximum posterior probability. Since the true component number is unknown, we evaluate the resulted segmentations by the Probabilistic Rand (PR) index [31], which takes values between 0 and 1. A higher PR score indicates a better segmentation with a Fig. 1. The correct selection rates by MFA/LFA automatic model selection algorithms in the four series of simulations.

Table 5

Clustering performance by MFA/LFA algorithms on ORL, USPS and MNIST databases.

Index per algorithm ORL USPS MNIST

Rand index (RI)

BYY-MFA 0.90 0.87 0.87

BYY-LFA 0.92 0.91 0.90

VB-MFA 0.86 0.84 0.86

VB-LFA 0.89 0.88 0.89

Normalized mutual information (NMI)

BYY-MFA 0.91 0.88 0.89

BYY-LFA 0.93 0.91 0.91

VB-MFA 0.89 0.87 0.85

VB-LFA 0.91 0.88 0.88

4http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/.

(9)

higher percentage of pixel pairs in the segmentation having the same labels as in the ground truth segmentation. A good model selection performance is closely related to, but not necessarily implies, a high PR score. We directly compare the segmentation performance of the four algorithms without any post-processings

such as region merging and graph cut [32], although these techniques may further improve the segmentations.

Table 6gives the average PR scores by 5 runs on all of the 100 testing images of the Berkeley image segmentation database.

Shown inFig. 2are two examples chosen from the database. On Fig. 2. Two image segmentation examples from Berkeley segmentation database by MFA/LFA automatic model selection algorithms. The segments are illustrated by the highlighted green boundaries. (For interpretation of the references to color in thisﬁgure caption, the reader is referred to the web version of this paper.)

Table 6

Average PR scores of 5 runs on the 100 testing images of Berkeley image segmentation database by MFA/LFA algorithms on the 147-dimensional features. The“ GMM” results are obtained by GMM automatic model selection algorithms on the 8-dimensional VZ features in[4].

GMM from[4] MFA/LFA

VB-DNW BYY-DNW VB-MFA VB-LFA BYY-MFA BYY-LFA

0.803 0.851 0.819 0.845 0.864 0.878