Main Contribution of the Thesis - Learning Non-gaussian Factor Analysis with Different Structur

In this dissertation, we investigate the NFA with different structures of Eq.(1.5) un-der the BYY learning theory, from two-stage model selection to automatic model selection. Several novel algorithms have been developed, and comparisons have been made with other related model selection methods. The scope of this thesis is brieﬂy summarized in Fig. 1.2.

We begin our work by proposing an empirical analysis tool for systematic comparisons on the relative strengths and weaknesses of model selection meth-ods, based on the problem of determining the number of factors in FA which is a degenerate case of NFA inbox (b) of Fig. 1.2. Speciﬁcally,

• we examine the joint effect of sample size N and signal-noise ratio (SNR) rather than merely the effect of either of SNR andN with the other ﬁxed as

CHAPTER 1. INTRODUCTION 19

Figure 1.2: The roadmap of the thesis

usually made in the literature, by varying both SNR andN. The indifference curves, deﬁned by the contour lines of model selection accuracies, visually reveal that all methods demonstrate relative advantages obviously within a region of moderateN and SNR. Moreover, the importance of studying this region is also conﬁrmed by an alternative reference criterion by maximizing the testing likelihood.

• we also provides a theoretic comparison among AIC, CAIC, HQC and BIC, by building up a partial order of the relative underestimation tendency. The order is shown to be AIC, HQC, BIC, and CAIC, indicating the underesti-mation probabilities from small to large.

We further examine how parameterizations affect model selection performance, based on FA-a and FA-b. We combine FA-a and FA-b into a family of FA param-eterizations that have equivalent likelihood functions. Each instance in this family is featured by an integerr and thus shortly denoted by FA-r, with FA-a as one end that r = 0 and FA-b as the other end that r reaches its upper-bound m. Between

the two ends, FA-r is a mixture of a r hidden factor based FA-b and a m−r hidden factor based FA-a, with r indicating the number of free parameters in the diag-onal covariance matrix of the hidden variables. With a Bayesian formulation of FA-r, alternative VB algorithms are derived and also BYY algorithms on FA are extended to be quipped with priors on the parameters.

Several empirical ﬁnds have been obtained via extensive experiments.

• First, both BYY and VB perform better on FA-b than on FA-a. Speciﬁcally, both BYY and VB reach their best performances on one parameterization FA-m^∗ withm^∗ being the correct number of hidden factors. This provides a correct calibration though thism^∗ is unknown. On one hand, the perfor-mances on those of FA-r drop sharply as r reduces from m^∗towards to FA-a, which means that the contribution of FA-a is negative. On the other hand, the performance of FA-r reduces slightly and slowly as r increases towards to FA-b. Moreover, we make a comparison on FA-b with its initial dimen-sion set at r and found a performance similar to that on FA-b. Therefore, FA-b is superior to FA-a considerably and reliably.

• Second, both BYY and VB outperform AIC, BIC, and DNLL, while BYY further outperforms VB, especially on FA-b. Moreover, with FA-a replaced by FA-b, the gain obtained by BYY is obviously higher than the one by VB, while the gain by VB is better than no gain by AIC, BIC, and DNLL, especially for a ﬁnite size of samples.

• Third, we also provide a systematic investigation on how each part of the priors contributes to the model selection performance, and ﬁnd that though the performance of either VB or BYY can be improved with the help of ap-propriate priors, BYY does not highly depend on the presences of the priors whereas VB does. Moreover, optimizing the hyper-parameters of priors by

CHAPTER 1. INTRODUCTION 21

BYY further improves the performances while using VB for this purpose actually deteriorates the performances.

To explore latent binary structures of data, we consider BFA in box (c) of Fig. 1.2, from the perspective of three levels of inverse problems [129], i.e., inverse inference from observation to inner representation, parameter learning, and model selection. Maximizing the BYY harmony functional turns the ﬁrst level into a Binary Quadratic Programming (BQP). We consider four BQP methods. One is the exact BQP solver by enumeration (shortly denoted as enum). The other three are approximate methods, i.e., the greedy method in [69], the cdual method derived from the canonical duality theory [29], and the round method by relaxing the binary y to a continuous one and rounding the optimal solution back to a binary one [121]. Their BQP optimization performances are ranked as: round< cdual

< greedy < enum.

• Extensive experiments show that cdual and round are fast and more ef-fective in discarding extra factors, and leads to much better model selection performances than greedy and enum. Thus, some amount of error in BQP actually provides a helpful learning regularization with gain on both compu-tational efﬁciency and model selection performance.

• Moreover, automatic model selection is adopted to save computation from the two-phase implementation by starting from a large enough m and then discarding redundant binary factors during parameter learning. Under BYY, we incorporate into BFA learning priors distributions over parameters, which plays a similar role as Bayesian regularization. With the help of priors, enumand greedy improve in automatic model selection, but are still infe-rior to cdual and round when they are aided with a pinfe-riori distributions.

• Finally, we provide a comparison on the performance of automatic model

selection between BYY and VB, as well as BIC in the two-phase implemen-tation as a reference. Such comparisons have been made on factor analysis in [102] and Gaussian mixture model in [89], but not on BFA yet. Notice that BFA is a typical problem of independent component analysis (ICA) when the signal sources are binary, and then we accordingly simplify the VB-ICA algorithm [23, 22] to obtain a VB algorithm on BFA. Empirical analysis shows that BYY is the best for most conﬁgurations, while BIC is more ro-bust than VB. VB is good only when both training sample size N is large and noise is small, and declines drastically when N reduces and noise in-creases. Moreover, applied to the problem of blind binary image separation, the results again show that BYY outperforms VB.

Moreover, when BFA is used for modeling binary data matrixX, it becomes the BMF model in box (d) of Fig. 1.2. The BMF by Eq.(1.8) factorizes X as a product of two low-rank binary matrices and equivalently performs a bi-clustering task. However, most existing BMF algorithms require a given low-rank for the latent matrices. To tackle this problem, we propose a probabilistic BMF model under the BYY learning framework. We also develop a novel learning algorith-m called BYY-BMF that can autoalgorith-matically deteralgorith-mine the low-rank m during the BMF learning. We prove that the proposed algorithm converges after only one iteration for the data with non-overlapping clusters. In addition, we prove that our method can infer the exact number of clusters under appropriate initialization-s. Moreover, the algorithm is extended with two variants for overlapping caseinitialization-s.

Experiments show the effectiveness and efﬁciency of our algorithm. Furthermore, BMF is applied in bioinformatics to detect protein complexes by clustering the proteins which share similar interactions through factorizing the binary adjacent matrix of a PPI network. BYY-BMF’s clustering results does not depend on any parameters or thresholds, unlike the Markov Cluster Algorithm (MCL) that relies

CHAPTER 1. INTRODUCTION 23

on a so-called inﬂation parameter. On synthetic PPI networks, the predictions eval-uated by the known annotated complexes indicate that BYY-BMF is more robust than MCL for most cases. On real PPI networks from the MIPS [71] and DIP [112] databases, BYY-BMF obtains a better balanced prediction accuracies than MCL and a spectral analysis method, while MCL has its own advantages, e.g., with good separation values.

Finally, we consider NFA in a general semi-blind learning framework (box (e) of Fig. 1.2) with applications in transcriptional regulatory network analysis and exome sequencing data analysis.

• We modiﬁes NCA [64] to model gene transcriptional regulation by NFA [124]. The previous NFA algorithm [123, 124] is extended here as sparse BYY-NFA by considering either or both of a priori connectivity and a priori sparse distribution q(A) over A. Therefore, the a priori knowledge about the connection topology of the TF-gene regulatory network required by N-CA is not necessary for our NFA algorithm. With the incorporated sparsity penalty on the mixing matrix of control strengths, the extra entries are au-tomatically pushed to zero if there is not enough evidence for the existence of corresponding TF-gene connections. Simulation study demonstrates the effectiveness of sparse BYY-NFA in recovering the hidden dynamics of TF regulatory signals, and in estimating the connectivity topology and control strengths. The sparse BYY-NFA can not only be applied to detect cyclic pat-terns of transcription factor activities from the yeast cell cycle data [91], and activations of involved TF regulatory signals during E. coli carbon source transition from glucose to acetate [55], but also shut-off unreliable or unnec-essary TF-gene connections.

• The sparse BYY-NFA can be further modiﬁed by Eq.(1.7) to get a sparse BYY-BFA algorithm, which directly models the switching patterns of latent

TF activities, e.g., whether or not a TF is activated. The identiﬁcation of bimodal activity is useful to identify the biological variation of TFs whose regulatory dynamics are tightly around two discrete levels which are usually corrupted by noise. When applied on the yeast cell cycle data [91] and E. coli carbon source transition data [55], the reconstructed binary TF activities by the sparse BYY-BFA is consistent with the ups and downs of the continuous ones by NCA.

• We apply the semi-blind NFA learning to the problem of identifying dis-ease associated single nucleotide polymorphisms (SNPs) from the exome sequencing data, for which the methods of conventional genome wide as-sociation study (GWAS) do not work properly because one usually need to distinguish true rare variants from sequencing errors in exome sequencing data. Here, a novel method is presented for exome sequencing analysis:

First, the information of all SNPs in one exon/gene is encoded by a mul-tidimensional vector y in Eq.(1.1), and then a NFA classiﬁer optimized on a training set{(y,x)} is used in prediction, and signiﬁcant exons/genes are selected according the p-values of Fisher’s exact test on the confusion tables by the prediction results. The results on a real data set from an exome se-quencing project show that the selected signiﬁcant genes are consistent in part with published results, and some of them are further veriﬁed by exper-iments to be new signiﬁcant genes associated with the disease. Therefore, our algorithm is a promising tool for exome sequencing data analysis.

There are some other important related topics that have not been covered by the thesis. For the cases of classiﬁcation or clustering analysis, mixture models are usually involved by introducing a label variable, i.e., q(x) = ∑^k₌₁πq(x|), and then a local NFA model is considered accordingly when each component q(x|) is formulated by NFA [122]. Actually, NFA can be regarded as the following

CHAPTER 1. INTRODUCTION 25

constrained Gaussian mixture model (GMM), q(x) =

∑

q(x|y,j)q(y|j)q(j)dy =

∑

αjG(x|Aμ_j+ a0,AΛjA^T + Σe), (1.33)

where the summation is taken over{j = [ j1,..., jm]|1 ≤ jr ≤ kr; 1≤ r ≤ m}, and αj = ∏^mr=1αj_r, μj = [μj1,...,μj_m], Λ = diag[λj1,...,λj_m]. However, these topics are out of scope of this thesis. Readers are referred to [90] for a systematic study of automatic model selection on GMM and extensions of the efforts on FA in this thesis to mixture models.

在文檔中 Learning Non-gaussian Factor Analysis with Different Structures: Comparative Investigations on Model Selection and Applications (頁 37-44)