FA − parameterizations affect model selection

How parameterizations affect model selection performance is an issue that has been ignored or seldom studied since traditional model selection criteria, such as Akaike’s Information Criterion (AIC), Schwarz’s Bayesian Information Criterion (BIC), difference of negative log likelihood (DNLL) etc, perform equivalently on different parameterizations that have equivalent likelihood functions. For factor analysis (FA), in addition to one traditional model (shortly denoted by FA-a), it was previously found that there is another parameterization (shortly denoted by FA-b) and the Bayesian Ying-Yang (BYY) harmony learning gets different model selection performances on FA-a and FA-b. This chapter investigates a family of FA parameterizations that have equivalent likelihood functions, where each one (shortly denoted by FA-r) is featured by an integer r, with FA-a as one end that r = 0 and FA-b as the other end that r reaches its upper-bound. In addition to the BYY learning in comparison with AIC, BIC, and DNLL, we also implement variational Bayes (VB). Several empirical ﬁnds have been obtained via extensive experiments.

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION71

3.1 Parameterization Issue in Model Selection

Model selection is traditionally implemented in two stages. The ﬁrst stage enu-merates a set of candidate models via an indexk that represents the complexity of the corresponding model and estimates the parameter ˆθ_k that maximizes the like-lihoodL(θk), while the second stage selects a best complexity k according to one of typical criteria, such as AIC [1] and BIC [86], which are given in Eq.(1.14), in a format of

J

(k) = L(ˆθ_k) +C(k). (3.1)

Two candidate models with different parameterizations have a same model se-lection performance if they share equivalent likelihoods and the complexity term C(k). Consequently, how parameterizations affect model selection performance was an issue that has been ignored or seldom studied.

In the literature, the parameterization issue of a statistical model has been s-tudied within Bayesian paradigm on the performance of numerical techniques in making inferences rather than model selection. Reparameterization techniques in-clude parameter transformation to posterior normality and orthogonality [46, 56], data augmentation (adding latent variables) and parameter expansion (adding new parameters) to improve the computational accuracy and efﬁciency, such as ﬁtting the data more accurately, speeding up the Gibbs sampler for posterior [37] and so on. Recently, FA is overparameterized in [39] to obtain a fast Gibbs sampler for a posterior distribution, where the factor loading matrix has a lower triangular structure and the covariance matrix for the latent factors is diagonal.

As given by Eq.(1.29),(1.30)&(1.31) in Chapter 1, there exists two parameter-izations of FA, i.e., FA-a and FA-b. The details are brieﬂy summarized in Tab. 3.1.

The FA-a and FA-b are equivalent by the maximum-likelihood (ML) learning be-cause the corresponding two likelihood functions by Eq.(1.31) are equivalent, and

thus get the same performance of model selection under the criterion of eq.(3.1).

However, it was found that the Bayesian Ying-Yang (BYY) harmony learning get-s different model get-selection performanceget-s on FA-a and FA-b [119, 121, 48, 88].

The following sections continues the above study to further examine how param-eterizations affect model selection performance, based on FA-a and FA-b under not only BYY but also VB, as well as AIC [1] and BIC/MDL [86, 83] given in Eq.(1.14), and the logarithm of the likelihood-ratio or the difference of the neg-ative log-likelihood (DNLL) given by Eq.(2.7), which allows model selection by capturing the decrement of the negative log-likelihood as the candidate hidden di-mensionality increases by one.

In the following, we consider FA withΣe= σ²_eI_n, which leads FA equivalent to Principal Component Analysis (PCA) [5, 97] under the ML principle. Without loss of generality, we also assume zero mean, i.e., a0= 0.

3.2 FAr: ML-equivalent Parameterizations of FA

Since FA-a has more number of free parameters than FA-b, they are different under AIC or BIC by eq.(1.14) if the number of free parameters is directly used asdm. In practice [5, 97, 121], the extra degrees of freedom in FA-a are actually subtracted, i.e.,dm= nm + 1 −^m(m−1)₂ , equal to the number of free parameters in FA-b. Thus, we get the same estimate ˆm under AIC or BIC by eq.(1.14).

Although FA-a and FA-b are equivalent in model selection under AIC or BIC, they have been pointed out to be different under the Bayesian Ying-Yang (BYY) learning in [118, 119]. This motivates us to further investigate how the forms of parameterizations affect model selection performance. For a systematic study, we present a new family of ML-equivalent FA parameterizations varying from FA-a to FA-b as follows.

The difference between FA-a and FA-b mainly comes from how to encode the

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION73

TYPE-A TYPE-B

FA-a:Θ^a_m= {A,μ,Σe} FA-b: Θ^b_m= {U,μ,Λ,Σe} E[ye^T] 0 (y and e uncorrelated) 0 (y and e uncorrelated)

q(y|Θ) G(y|0,Im) G(y|0,Λ), Λ = diag[λ1,...,λm] A any full column rank matrix A= U, U^TU= Im

Table 3.1: Two probabilistic parameterizations of FA, namely FA-a and FA-b, are given in the second and third columns respectively, whereE[·] denotes the expec-tation, and G(•|μ,Σ) denotes a Gaussian distribution with the mean vector μ and the covariance matrixΣ, and diag[λ1,...,λm] is a diagonal matrix with λ1,...,λm

as its diagonal elements. I_m is anm × m identity matrix. Σeis a diagonal positive deﬁnite matrix.

hidden variable y’s complexity. Following this nature, we construct the following FA model

x= Vry+ μ + e, Vr= [Ur,Am−r], (3.2) y comes fromG(y|0,Σ^r_y), Σ^r_y= diag[ν⁻¹₁ ,...,ν⁻¹_r ,1,...,1],

whereΣ^r_yis the y’s covariance matrix withm−r constant 1s in the diagonal. More-over, we have U_r ∈ R^n×r, U^T_rU_r= Ir, A_m−r∈ R^n×(m−r), andm is the initial value of the hidden dimensionality. The integerr denotes the number of free parameters in Σ^r_y with 0≤ r ≤ m. We denote this type of parameterizations as FA-r, where the noise covariance is the same as FA-a and FA-b. For any r ∈ [0,m], FA-r is ML-equivalent to FA-a, and r indicates to what extent FA-r is similar to FA-b.

Specially, FA-r becomes FA-a when r = 0, and becomes FA-b when r = m.

Stage I:

Enumerate each candidate model scale m∈M^:

(a.1: VBE) p^(τ+1)(Y) = argmax_p(Y)F(p^(τ)(Θ), p(Y),m,Ξ^(τ)), (a.2: VBM) p^(τ+1)(Θ) = argmax_p(Θ)F(p(Θ), p^(τ+1)(Y),m,Ξ^(τ)), (b) Ξ^(τ+1)= argmaxΞF(p^(τ+1)(Θ), p^(τ+1)(Y),m,Ξ).

Stage II: Model selection: ˆm= argmax_m∈_MF(p^(τ^o⁾(Θ), p^(τ^o⁾(Y),m,Ξ^(τ^o⁾).

Table 3.2: The two-stage procedure of VB learning for a model selection problem consists of repeating a VBEM algorithm to maximize

F

and a discrete maximiza-tion to select an appropriate model scale, where τ is the iteration indicator, and τo denotes the number of iterations used to reach convergence (i.e., the objective function values vary small). A general derivation of VBEM is referred to the The-orem 2.1 in [10].

3.3 Variational Bayes on FAr

There have been efforts of VB learning on FA-a [13, 38, 78], in which the adopted priors on FA-a’s parameters are listed in the left column of Tab.3.3. For FA-b, we have derived a VB learning algorithm in [100] by the priors given in the right column of Tab.3.3. We directly use the existing VB learning algorithms for FA-a and then extend it for FA-b by certain modiﬁcations, and further extend them into a VB learning algorithm for FA-r, by maximizing the variational lower-bound

F

given in 1.17 through a two-stage procedure in Tab. 3.2. The derivation details are given in Appendix A.1.

The algorithm aims at maximizing the followingF resulted from putting the details of eq.(3.2) and Tab.3.1&3.3 into the variational lower bound

F

by eq.(1.17):

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION75

F

1+ ln

∏

i=1Γ(νi|a^ν_i,b^ν_i)^m−r_k=1

∏

^(G(a^k^|0,α⁻¹^k ^Iⁿ^)Γ(α^k^|a^α^k^,b^α^k⁾⁾

· q(Ur)Γ(ϕ|a^ϕ,b^ϕ) (pApUpνpαpϕ)

pΘpY dΘdY, (3.3)

where the variational posterior pY = p(Y), pΘ= p(Θ) = pApUpαpνpϕ, and

F

∑

t=1

lnG(xt|Vry_t,ϕ⁻¹I_n) + lnG(yt|0,diag[ν⁻¹_r ,Im−r]) − ln p(Y)

pY dY, (3.4) and V_r= [Ur,Am−r], q(Ur) = 2^−r∏iΓ((n−i+1)/2)π^{−(n−i+1)/2},νr= [ν1,...,νr], A_m−r = [a1,...,am−r]. For simplicity, we omit the subscripts r and m − r in the rest of context.

Moreover,

F

by eq.(3.3) consists of a part that is a function ofΞr, that is,

F

h(Ξr) + others. (3.5)

priors for FA-a priors for FA-b Ξa= {a^α,b^α,a^ϕ,b^ϕ} Ξb= {a^ν,b^ν,a^ϕ,b^ϕ} ϕ = ϕIn= Σ⁻¹_e , ϕ = ϕIn= Σ⁻¹_e , ν = Λ⁻¹. a_i:i-th column vector of A. U^TU= I_m,U is at Stiefel manifold

q(A|α) = ∏^m_i=1G(ai|0,_α¹_iI_n), q(U) = 2^−m∏iΓ((n − i + 1)/2)π^{−(n−i+1)/2}. q(α|a^α,b^α) = ∏^m_i=1Γ(α_i|a^α_i,b^α_i) q(ν|a^ν,b^ν) = ∏^m_i=1Γ(ν_i|a^ν_i,b^ν_i)

q(ϕ|a^ϕ,b^ϕ) = Γ(ϕ|a^ϕ,b^ϕ) q(ϕ|a^ϕ,b^ϕ) = Γ(ϕ|a^ϕ,b^ϕ)

Table 3.3: The above prior distributions in the left column for FA-a have been used in [13, 38, 78]. The priors in the right column for FA-b have been used in [100]. Γ(z|a,b) = b^az^a−1e^−bz/Γ(a) is the Gamma density with shape parameter a and inverse scale parameterb, where Γ(a) is the Gamma function. The ΞaandΞb

denote the hyperparameters.

Theτ-th iteration of Stage I(a):

(a.1): Updatep^(τ)_Y = ∏^Nt=1G(yt|μ^(t)_y|x,Σ_y|x) based on p^(τ−1)_A ,p^(τ−1)α ,p^(τ−1)ν ,p^(τ−1)ϕ ,Ξ^(τ−1)r . (a.2):

• Update p^(τ)_A = ∏ⁿj=1G(aj|μA, j,ΣA, j) based on p^(τ)_Y ,p^(τ−1)α ,p^(τ−1)ν ,p^(τ−1)ϕ ,Ξ^(τ−1)r .

• Update p^(τ)_U ≈ δ(U − U^∗_S), U^∗_S= U^∗_E

U^∗_E^TU^∗_E−¹₂

, U^∗_E=

∑tx_t(μ^(t)_y|x)^T

E[yty^T_t ]₋₁

• Update p^(τ)α = ∏^m−rk=1Γ(αk| ˆa^α_k, ˆb^α_k), based on p^(τ)_A ,p^(τ)α ,p^(τ−1)ν ,p^(τ−1)ϕ ,Ξ^(τ−1)r .

• Update p^(τ)ν = ∏^ri=1Γ(νi| ˆa^ν_i, ˆb^ν_i), based on p^(τ)_A ,p^(τ)α ,p^(τ)ν ,p^(τ−1)ϕ ,Ξ^(τ−1)r .

• Update p^(τ)ϕ = Γ(ϕ| ˆa^ϕ, ˆb^ϕ), based on p^(τ)_A ,p^(τ)α ,p^(τ)ν ,p^(τ)ϕ ,Ξ^(τ−1)r .

Theτ-th iteration of Stage I(b):

Update HyperparametersΞrby gradient method,Ξ^new_r = Ξ^old_r + η ·^∂^F_∂Ξ^h^(Ξ_r^r⁾,

Table 3.4: An outline of the VB algorithm on FA-r, with details in Appendix A.1.

As listed as Stage I(b) in Tab.3.2, we also maximize

F

with respect to the hy-perparametersΞr = {a^α_k,b^α_k,a^ν_i,b^ν_i,a^ϕ,b^ϕ}, which is implemented in the detailed algorithm given by Appendix A with the help of the gradient of

F

h(Ξr) with re-spect to the hyperparameters Ξr. It follows from eq.(1.18) that such an update of Ξr actually tends to minimize the KL term and thus make the variational lower bound

F

further approach to lnq(XN|m).

Leaving the computational details of the VB algorithm for FA-r in Appendix A.1, we outline the major updates in Tab.3.4 together with the following remarks:

• When r = 0, the VB algorithm on FA-r equivalently implement the one on FA-a [13], where the variational posteriors pν and pU disappear because U andν is empty for r = 0.

• When r = m, the VB algorithm on FA-r becomes the one on FA-b [100], where pU and pνtake over pAand pαwith U andν taking the place of A. It is empirically observed that pϕ has different impacts on model selection in

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION77

FA-a and FA-b as shown by experiments later, although the corresponding two variational posteriors which have similar forms are computed from the same Gamma prior.

• When 0 < r < m, the VB algorithms on FA-r are variants in addition to those on FA-a and FA-b. On one hand, if we consider no priors over all the param-etersΘ^r_min FA-r, then the bound

F

degenerates to

F

1by eq.(3.4). Maximiz-ing

F

1 leads to an Expectation-Maximization (EM) algorithm for FA-r. On the other hand, maximizing

F

1takes the lead in maximizing

F

(for a large r), especially when the sample size N or the dimensionality n is very large, because we use a point estimation for pU ( see a.2 in Tab.3.4) and thus the contribution of updating U at Stage I of Tab.3.2 to maximizing

F

^actually

comes through maximizing

F

1. Denote the number of free parameters inφ byd(φ), we have d(U) = nr−0.5r(r+1) and d(Θ^r_m) = nm−0.5r(r−1)+1.

It follows d(U)/d(Θ^r_m) ≈ r/m for a large n and r/m ≈ 1 for a r close to m that a large n implies the learning on U actually plays a main role in max-imizing

F

. This degeneracy would make the VB algorithm of maximizing

F

return back towards the EM algorithm, deteriorating the model selection performances and also reducing the performance differences of FA-r for d-ifferentr. Still, maximizing

F

^(forr > 0) yields better performance than the algorithm in [13, 78] for FA-a as will be shown later. Moreover, a further improvement in model selection by

F

is possible by ﬁnding a better prior on U.

3.4 Bayesian Ying-Yang Harmony Learning on FAr

In implementation, we maximize H(pq) by a two-stage procedure as shown in Tab.3.5 which are brieﬂy summarized from Eqs.(1.21),(1.22)&(1.23). Moreover,

Stage I:

Enumerate candidate models by m and for each candidate, we iterate the following (a) and (b) until converged:

(a) Θ^(τ)= argmax/incr_ΘH(pq,Θ,m,Ξ^(τ−1)) (b) Ξ^(τ)= argmax/incrΞ

H(pq,Θ^(τ),m,Ξ) +¹₂dm(Ξ) + H_b(m,Ξ) ,

Stage II:

Select the best ˆm:

mˆ = argminm

−H(pq,Θ^(τ^∗⁾,m,Ξ^(τ^∗⁾) +¹₂nf(Θm) − Hb(m,Ξ) , τ^∗is the value of the iteration indicatorτ when Stage I converged.

Table 3.5: The general two-stage iterative BYY harmony learning procedure, restated from Fig.6(a) in [126] (also see Eqs.(6)&(7) in [128] and Fig.5(b) in [130]), where nf(Θ) is the number of free parameters in Θ, and dm(Ξ) is given in eq.(1.27). The “incr” means “to increase”.

the BYY harmony learning is also featured by its favorable nature that model se-lection is made automatically during the implementation of merely Stage I, e.g., for FA-b in Tab.3.1, the implementation of either Stage I(a) or both Stage I(a)&

I(b) will drive someλjto zero when the j-th dimension of y_t is extra. Thus, auto-matic model selection can be made via discarding the j-th dimension via checking λj→ 0.

This chapter mainly focuses on a detailed comparative study with the VB learn-ing in Tab.3.2 by the conventional two-stage procedure, without maklearn-ing automatic model selection via checkingλj→ 0. Also, we provide a simple comparative in-vestigation on the automatic model selection performances of BYY and VB. Fur-ther details about automatic model selection are referred to Sect.2.1. and Sect.3.2.

in [130] and to Sect.2.2 in [131] for further improvements via exploring a co-dimensional matrix pair nature (additionally where an improved model selection criterion is given by e.g., eq.(29) in [131]).

Speciﬁcally, we consider the FA-r model by eq.(3.2) with independent and

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION79

Figure 3.1: BYY system in the general form and speciﬁc structures for FA

identically distributed (i.i.d.) samples inXN = {xt}_t=1^N , from which we have q(X|Y,Θ) =

∏

_t ^q(x^t^|y^t,Θ), q(Y|Θ) =

∏

_t ^q(y^t|Θ), p(Y|X,Θ) =

∏

_t ^p(y^t^|x^t^,Θ)

In the sequel, we develop the learning procedure in Tab.3.5 into a gradien-t based BYY learning algorigradien-thm on FA-r. Leaving gradien-the compugradien-tagradien-tional degradien-tails of this algorithm in Appendix A.2, here we introduce its key points in an outline in Tab.3.6.

At the τ-th step of the implementation, putting eq.(1.27) into eq.(1.24) and obtainingΘ^(τ)r of the FA-r model by eq.(3.2), we have

H(pq) ≈ H(pq,Θ,m,Ξ) +1

2dm(Ξr) + Hb, (3.7) dm(Ξr) = −nf(Θr) + Δ^T_Θ_rΩ(Θ^(τ)r ,Ξr)ΔΘ_r, ΔΘ_r= Θ^(τ−1)r − Θ^(τ)r , (3.8)

Objective: maximize the harmony functional H(pq) ≈ H1+ dr( W) + lnq(Θ^a|Ξ) + Hb+¹₂dm(Ξ)

H1= −^N(n+m)₂ ln(2π) −^Nm₂ −¹₂Tr[SN(V · diag[ν⁻¹,Im−r] · V^T+ ϕ⁻¹I_n)⁻¹] +^N₂ln(|νr| · |ϕIn|), The last four terms are given by eq.(3.11), eq.(3.12), eq.(3.13), and eq.(3.7).

Theτ-th iteration of Stage I(a): Gradient Method to Update the Parameters θ^(τ)= θ^(τ−1)+ η · ∂θ, ∂θ = ∂^Hθ =^∂H(pq)_∂θ |_θ=θ^τ, ∀θ ∈ {U,A,ν,ϕ}, η is a step size.

∂θ = ∂^H¹θ + ∂^d^rθ + ∂^qθ + ∂^H^bθ + ∂^d^mθ, from the 5 terms of H(pq).

Theτ-th iteration of Stage I(b): Gradient Method to Update the Hyperparameters Hessian matrixΩ(Θ^(τ),Ξ) (approximated as block-diagonal);

ξ^(τ)= ξ^(τ−1)+ η ·^∂H(pq)_∂ξ

ξ^(τ−1),θ^(τ), ∀ξ ∈ {a^α_k,b^α_k,a^ν_i,b^ν_i,a^ϕ}, η is a step-size.

Table 3.6: A sketch of the gradient implementation of BYY learning algorithm on FA-r. All computational details are referred to Appendix A.2.

from which we get Stage I(b) in Tab.3.6 for updating the hyperparametersΞ at the τ-th step.

Further putting eq.(3.6) and the priors given in Tab.3.3 into eq.(1.24) and eq.(1.26), we have

H(pq,Θ,m,Ξ) =

∏

_t ^ln^G(x^t^|0,Σ^x^{) − N ln}^(2πe)^m^|Σ^y|x^{| + d}^r⁽^{W) + lnq(Θ}^a^|Ξ),

(3.9) Σx= VΣ^r_yV^T + ϕ⁻¹I_n, Σ^r_y= diag[ν⁻¹,Im−r], Σ_y|x=

(Σ^r_y)⁻¹+ V^T(ϕIn)V ⁻¹, (3.10) dr( W) = −1

2Tr(Δ_W^T Σ⁻¹_y|xΔWSN), SN =

∑

_t ^x^t^x^T^t ^{, Δ}^W ⁼W −W; W = Σ^r_yV^TΣ⁻¹_x ; (3.11) Again, the above H(pq,Θ,m,Ξ) shares a format similar to eq.(1.14). The term

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION81 dr( W) vanishes when the algorithm converges, taking a regularization role during learning for alleviating to be stuck at local optimums. The previous studies of the BYY learning for FA-a in [48] or for FA-b in [88] without considering the prior term lnq(Θ^a|Ξ), except a preliminary study made in [100]. In contrast, a role similar to the conventional Bayesian regularization is taken by the (log) prior term in eq.(3.9) with the following details:

lnq(Θ^a|Ξ) = ln

q(U) ·

∏

_i=1^r ^Γ(νⁱ^|a^νⁱ^,b^νⁱ^{) · Γ(ϕ|a}^ϕ^,b^ϕ⁾

= −r ln2 +_i=1

∑

^r ^ln^Γ(^{n − i + 1}₂ ^{) −}^{n − i + 1}₂ ^ln^π

+_i=1

∑

^r ^{(a^νⁱ ^{− 1)lnν}ⁱ^{− b}^νⁱ^νⁱ^{+ a}^νⁱ ^ln^b^νⁱ ^{− lnΓ(a}^νⁱ^)}

+ (a^ϕ− 1)lnϕ − b^ϕlnϕ + a^ϕlnb^ϕ− lnΓ(a^ϕ), (3.12)

Hb(m,Ξ) = p(α|A,ϕ,XN)ln[q(A|α)q(α)]dα (3.13)

=^m−r

∑

k=1

( ˆa^α_k − 1)

ψ( ˆa^α_k) − ln ˆb^α_k

− ˆa^α_k+ a^α_klnb^α_k − lnΓ(a^α_k)

−n(m − r)

2 ln(2π) where p(α|A,ϕ,XN) = ∏^m−r_k=1Γ(αk| ˆa^α_k, ˆb^α_k) with ˆa^α_k = a^α_k +ⁿ₂ and ˆb^α_k = b^α_k +^a^T^k₂^a^k, and Idenotes an × identity matrix, and ψ(·) is the digamma function.

Putting the above obtained H(pq,Θ,m,Ξ) into Tab.3.6, we can derive the detailed equations for gradients and Hessian matrices (with respect to each part of unknown parameters), from which we obtained the BYY learning algorithm for FA-r given in Appendix A.2 together with the following remarks:

• When r = 0, Tab.3.6 implements BYY harmony learning on FA-a, where the terms ln|ν|, lnq(U), and lnq(ν) in H(pq) disappear.

• When r = m, Tab.3.6 implements BYY harmony learning on FA-b, where the term H_b given in eq.(3.13) disappears, and maximizing the term ln|ν|

pushes 1/νi→ 0 if the i-th hidden dimension is an extra scale.

• When 0 < r < m, Tab.3.6 provides variants of BYY learning algorithms on FA between FA-a and FA-b.

Last but not the least, the algorithm in Tab.3.6 is derived from getting the inte-gral over y analytically removed and then making gradient based updates. Al-ternatively, maximizing the harmony functional can also be implemented by a Ying-Yang alternation procedure (see e.g., Figure 8 of [130]), which is featured by getting the peak value ofy^∗ in the Yang step and removing the integral overy around thisy^∗, while the Ying step updates all the unknown parameters. Readers are referred to Sect. 4.3 in [130] for more details.

3.5 Empirical Analysis

3.5.1 Three levels of investigations

This empirical analysis has the following purposes

• Examining whether FA-b is better than FA-a for making model selection, via BYY, VB, AIC, BIC, and DNLL;

• Examining the joint effects of two parameterizations and the role of priors on the performances of model selection;

• Comparing the performances of BYY, VB, AIC, BIC, and DNLL.

Towards these purposes, we conduct investigations at three different levels, as shown in Tab.3.7. The criteria AIC, BIC, and DNLL are indifferent for FA-b and FA-a in term of making model selection. Without a prioriq(Θ^a|Ξ) (i.e., Level 1 in Tab.3.7), VB degenerates to ML and thus is also indifferent for FA-b and FA-a.

In this case, only BYY is capable of model selection, and has different perfor-mances on FA-a and FA-b. To enable VB to make model selection, we take a

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION83

Level 1 Level 2 Level 3

lnq(Θ|Ξ) = 0 q(Θ|Ξ) with Ξ ﬁxed with Ξ optimized

VB Update pY andΘ = without all the

in Tab.3.2 arg max_ΘF¹^{instead of} Stage I (b) steps {p_A, p_U, p_α, p_ν, p_ϕ};

BYY Fix∂^qθ = ∂^H^bθ = without all the

in Tab.3.5 ∂^d^mθ = 0 and not update Stage I (b) steps Ξ_r= {a^α_k,b^α_k,a^ν_i,b^ν_i,a^ϕ,b^ϕ}

Table 3.7: Three levels of investigations

priori q(Θ^a|Ξ) in consideration to compare the performances of both BYY and VB. Since q(Θ^a|Ξ) depends on the hyperparameters Ξ, it is natural to consider the cases with Ξ ﬁxed (i.e., Level 2 in Tab.3.7) and the cases with Ξ optimized (i.e., Level 3 in Tab.3.7) via maximizing the variational lower bound

F

^{by VB and}

H(pq) by BYY.

For simplicity and clarity, we use the notations VB(r,l) and BYY(r,l) to in-dicate the two-stage procedure by VB and BYY, respectively, for different val-ues of r for FA-r and for different levels of l. E.g., VB(r,1), VB(r,2), VB(r,3) versus BYY(r,1), BYY(r,2), BYY(r,3) respectively. Also, on FA-a and FA-b we have VB(a,1), VB(b,1) (i.e., VB(0,1), VB(m,1)) versus BYY(a,1), BYY(b,1) (i.e., BYY(0,1), BYY(m,1)).

We adopt the empirical analysis method presented in [101] for the performance evaluation on the three levels of implementations of VB and BYY for FA-a and FA-b, and also with the performances on AIC, BIC and DNLL included for com-parisons.

V( f ) is the set of the candidate values of the feature f .

features f candidate values

sample sizeN V(N): 25, 50, 75, 100, 200, 400, 800 SNR:γo=^λ_σ^m∗2

e + 1 V(γo): 1.2, 1.5, 2, 2.5, 3, 3.5, 4, 8, 16.

dim:{n,m^∗} V(n,m^∗): {15,5}, {30,10}

Table 3.8: The candidate values of each feature are listed. All possible combina-tions consist of all settings

S

(N,γo,n,m^∗) used in the empirical analysis. We set λ1= ... = λm^∗ = 1. For two-phase procedures, we set the candidate set of hid-den dimensionalities as

M

= {1,...,9} or {1,...,15} for m^∗= 5,10 respectively, unless otherwise speciﬁed.

The simulated data sets are randomly generated according to FA-b (or FA-a) in Tab.3.1. A setting

S

(N,γo,n,m^∗) for FA-b is determined by choosing values from a candidate set of the sample sizesN, the signal-to-noise ratios (SNR) γo, the dimensionality of the observed variablen = dim(x) and the dimensionality of the latent variablem^∗= dim(y), where SNR is deﬁned as the ratio of the m^∗-th largest eigenvalue of the population covariance matrix UΛU^T+σ²_eI_nto the noise variance σ²_e, i.e.,γo= (λm^∗+ σ²_e)/σ²_e.

Listed in Tab.3.8 are the choices of

S

(N,γo,n,m^∗) considered in this paper. For example,

S

(50,3.0,15,5) means that a training data set XN= {xt}^N_t=1is randomly generated according to FA-b withN = 50, γo= 3.0, n = 15 and m^∗= 5.

3.5.2 FA-a vs FA-b: performances of BYY, VB, AIC, BIC, and DNLL

Each of BYY, VB, AIC, BIC, and DNLL is implemented for 10³trials on each of the settings

S

(:,:,15,5) = {

S

(N,γo,15,5) : ∀N ∈ V(N),γo∈ V(γo)} with different

CHAPTER 3. FA₋PARAMETERIZATIONS AFFECT MODEL SELECTION85

sample sizes and SNRs chosen from Tab.3.8. The model selection accuracies are reported in Fig.3.2-3.4 through the contour maps suggested in [101] for illustrating the joint effect ofN and γoon the performance. Readers are referred to [101] for the characteristics (e.g., a three-region partition phenomenon) of the contour maps for describing model selection accuracies, as well as a systematic comparison of BYY(b,1) with several classical criteria and recently developed model selection methods.

Here we summarize our observations on Fig.3.2-3.4 as follows:

1) Shown by Fig.3.3, VB performs better on FA-b than on FA-a. VB(a,1) and VB(b,1) actually implement the maximum likelihood principle which is not good¹for model selection under a ﬁnite sample size. For a relatively small N, FA-b is obviously superior to FA-a under VB. As N goes large, the dif-ference between VB(b,2) and VB(a,2) tends to be not so obvious, because a largeN would lead

F

^(forr = m) in eq.(3.3) close to

F

1and thus VB(b,2) approaches to VB(b,1). Analogously, VB(a,2) approaches to VB(a,1). This tendency towards maximum likelihood gradually reduces the gain obtained from using FA-b in place of FA-a.

Due to the approximation p_U^(τ) ≈ δ(U − U^∗_S) in Tab.3.4, VB(b,3) with op-timized hyperparameters becomes even closer to maximum likelihood than VB(b,2) does, while VB(a,3) does not decline to be inferior to VB(a,2) with the help of the variational posterior over the loading matrix A [10, 78]. As a

1Notice that the ﬁgures of VB(a,1) and VB(b,1) are “blank” (i.e., zero rates of successful-selections). VB(a,1) and VB(b,1) are not capable of model selection for a ﬁnite sample size, be-cause they both implement maximum likelihoodL( ˆΘm) which increases as m grows. Therefore, the estimated ˆm = argmax_m∈_ML( ˆΘm) tends to overestimation. An alternative criterion is DNLL given by eq.(2.7) and Fig.3.2(c), which ﬁnds the maximum increment in the likelihood function. In the ﬁgures of VB(a,2) and VB(a,3), the accuracies are zero whenSNR < 1.5 and N ≤ 800. Actually, asN further goes large, the rates will increase.

Sample Size N (adjusted)

{n=15,m*=5}: S−Selection by [AIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

Sample Size N (adjusted)

{n=15,m*=5}: S−Selection by [BIC]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

Sample Size N (adjusted)

{n=15,m*=5}: S−Selection by [DNLL]

1.2 1.5 2.0 2.5 3.0 3.5 4.0 8 16

Figure 3.2: The successful-selection (S-selection) rates on

S

(:,:,15,5) are present-ed in terms of contour maps. The axes are adjustpresent-ed by equally spacing the elements inV(N) and V(γo). A red asterisk (∗) at the coordinate (N,γo) indicates that the corresponding criterion gets the highest successful selection rate on

S

(N,γo,15,5) among AIC, BIC, DNLL and all implementations of VB and BYY. Brieﬂy speak-ing, the contour lines close to the bottom-left corner means being robust to the

Figure 3.2: The successful-selection (S-selection) rates on

S

(N,γo,15,5) among AIC, BIC, DNLL and all implementations of VB and BYY. Brieﬂy speak-ing, the contour lines close to the bottom-left corner means being robust to the

在文檔中 Learning Non-gaussian Factor Analysis with Different Structures: Comparative Investigations on Model Selection and Applications (頁 89-123)