Learning Non-gaussian Factor Analysis with Different Structures: Comparative Investigations on Model Selection and Applications

全文

(1)Learning Non-gaussian Factor Analysis with Different Structures: Comparative Investigations on Model Selection and Applications TU, Shikui A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy in Computer Science and Engineering. The Chinese University of Hong Kong August 2012.

(2) Thesis/Assessment Committee Professor JIA Jiaya (Chair) Professor XU Lei (Thesis Supervisor) Professor LEUNG Kwong Sak (Committee Member) Professor KWOK James Tin Yau (External Examiner).

(3) Abstract Mining the underlying structure from high dimensional observations is of critical importance in machine learning, pattern recognition and bioinformatics. In this thesis, we, empirically or theoretically, investigate non-Gaussian Factor Analysis (NFA) models with different underlying structures. We focus on the problem of determining the number of latent factors of NFA, from two-stage approach model selection to automatic model selection, with real applications in pattern recognition and bioinformatics. We start by a degenerate case of NFA, the conventional Factor Analysis (FA) with latent Gaussian factors. Many model selection methods have been proposed and used for FA, and it is important to examine their relative strengths and weaknesses. We develop an empirical analysis tool, to facilitate a systematic comparison on model selection performances of not only classical criteria (e.g., Akaike’s information criterion or shortly AIC) but also recently developed methods (e.g., Kritchman & Nadler’s hypothesis tests), as well as the Bayesian Ying-Yang (BYY) harmony learning. Also, we prove a theoretical relative order of underestimation tendency of four classical criteria. Then, we investigate how parameterizations affect model selection performance, an issue that has been ignored or seldom studied since traditional model selection criteria, like AIC, perform equivalently on different parameterizations that have equivalent likelihood functions. Focusing on two typical parameterizations of FA,. i.

(4) one of which is found to be better than the other under both Variational Bayes (VB) and BYY via extensive experiments on synthetic and real data. Moreover, a family of FA parameterizations that have equivalent likelihood functions are presented, where each one is featured by an integer r, with the two known parameterizations being both ends as r varies from zero to its upper bound. Investigations on this FA family not only confirm the significant difference between the two parameterizations in terms of model selection performance, but also provide insights into what makes a better parameterization. With a Bayesian treatment to the new FA family, alternative VB algorithms on FA are derived, and also BYY algorithms on FA are extended to be equipped with prior distributions on the parameters. A systematic comparison shows that BYY generally outperforms VB under various scenarios including varying simulation configurations and incrementally adding priors to parameters, as well as automatic model selection. To describe binary latent features, we proceed to binary factor analysis (BFA), which considers Bernoulli factors. First, we introduce a canonical dual approach to tackling a difficult Binary Quadratic Programming (BQP) problem encountered as a computational bottleneck in BFA learning. Although it is not an exact BQP solver, it improves the learning speed and model selection accuracy, which indicates that some amount of error in solving the BQP, a problem nested in the hierarchy of the whole learning process, brings gain on both computational efficiency and model selection performance. The results also imply that optimization is important in learning, but learning is not just a simple optimization. Second, we develop BFA algorithms under VB and BYY to incorporate Bayesian priors on the parameters to improve the automatic model selection performance, and also show that BYY is superior to VB under a systematic comparison. Third, for binary observations, we propose a Bayesian Binary Matrix Factorization (BMF) algorithm under the BYY framework. The performance of the BMF algorithm is guaranteed. ii.

(5) with theoretical proofs and verified by experiments. We apply it to discovering protein complexes from protein-protein interaction (PPI) networks, an important problem in bioinformatics, with outperformance comparing to other related methods. Furthermore, we investigate NFA under a semi-blind learning framework. In practice, there exist many scenarios of knowing partially either or both of the system and the input. Here, we modify Network Component Analysis (NCA) to model gene transcriptional regulation in system biology by NFA. The previous hardcut NFA algorithm is extended here as sparse BYY-NFA by considering either or both of a priori connectivity and a priori sparse constraint. Therefore, the a priori knowledge about the connection topology of the TF-gene regulatory network required by NCA is not necessary for our NFA algorithm. The sparse BYY-NFA can be further modified to get a sparse BYY-BFA algorithm, which directly models the switching patterns of latent transcription factor (TF) activities in gene regulation, e.g., whether or not a TF is activated. Mining switching patterns provides insights into exploring regulation mechanism of many biological processes. Finally, the semi-blind NFA learning is applied to identify those single nucleotide polymorphisms (SNPs) that are significantly associated with a disease or a complex trait from exome sequencing data. By encoding each exon/gene (which may contain multiple SNPs) as a vector, an NFA classifier, obtained in a supervised way on a training set, is used for prediction on a testing set. The genes are selected according to the p-values of Fisher’s exact test on the confusion tables collected from prediction results. The selected genes on a real dataset from an exome sequencing project on psoriasis are consistent in part with published results, and some of them are probably novel susceptible genes of the disease according to the validation results.. iii.

(6) 基於多種結構的非高斯因數分析的模型選擇學習演算法比較研究及其應用塗仕奎遞往香港中文大學作為獲取電腦科學與工程學哲學博士學位的部分要求論文摘要. 高維資料的隱含結構挖掘是機器學習、模式識別和生物資訊學等領域中的重要問題。本論文從實踐和理論上研究了具有不同隱含結構模式的非高斯因數分析 (Non-Gaussian Factor Analysis)模型。本文既從兩步法又從自動法的角度重點研究確定隱因數個數的模型選擇問題，及其在模式識別和生物資訊學上的實際應用。. 非高斯因數分析在單高斯因數的情況下退化為傳統的因數分析 (Factor Analysis)模型。我們發展了一套系統地比較模型選擇方法性能的工具，比較研究了經典的模型選擇準則(比如 AIC 等)，及近年來基於隨機矩陣理論的統計檢驗方法，還有貝葉斯陰陽(Bayesian Ying-Yang)和諧學習理論。同時，我們也對四個經典準則提供了一個適用於小樣本的低估因數數目傾向的相對排序的理論結果。. 基於傳統的因數分析模型，我們還研究了參數化形式對模型選擇方法的性能的影響，一個重要的但被忽略或很少研究的問題，因為似然函數等價的參數化形式在傳統的模型選擇準則像 AIC 下不會有性能差異。但是，我們通過大量的模擬資料和實際資料上的結果發現，在兩個常用的似然函數等價的因數分析參數化形式中，其中一個更加有利於在變分貝葉斯(Variational Bayes)和貝葉斯陰陽理論框.

(7) 架下做模型選擇。進一步地，該兩個參數化形式被作為兩端拓展成一系列具有等價似然函數的參數化形式。實驗結果更加可靠地揭示了參數化形式的逐漸變化對模型選擇的影響。同時，實驗結果也顯示參數先驗分佈的引入可以提高模型選擇的準確度，並給出了相應的新的學習演算法。系統比較表明，不僅是兩步法還是自動法，貝葉斯陰陽學習理論都比變分貝葉斯的模型選擇的性能更佳，並且能在有利的參數化形式中獲得更大的提高。. 二元因數分析(Binary FA)也是一種非高斯因數分析模型，它用伯努利因數去解釋隱含結構。首先，我們引入一種叫做正則對偶(canonical dual)的方法去解決在二元因數分析學習演算法中遇到的一個計算複雜度很大的二值二次規劃 (Binary Quadratic Programming)問題。雖然它不能準確找到二值二次規劃的全域最優，它卻提高了整個學習演算法的計算速度和自動模型選擇的準確性。由此表明，局部嵌套的子優化問題的解不需要太精確反而能對整個學習演算法的性能更有利。然後，先驗分佈的引入進一步提高了模型選擇的性能，並且貝葉斯陰陽學習理論被系統的實驗結果證實要優於變分貝葉斯。接著，我們進一步發展了一個適用於二值資料的二元矩陣分解演算法。該演算法有理論的結果保證它的性能，並且在實際應用中，能以比其他相關演算法更優的性能從大規模的蛋白相互作用網路中檢測出蛋白功能複合物。. 進一步，我們在一個半盲(semi-blind)的框架下研究了非高斯因數分析的演算法及其在系統生物學中的應用。非高斯因數分析模型被用於基因轉錄調控建模，並引入稀疏約束到連接矩陣，從而提出一個能有效估計轉錄因數調控信號的方法，而不需要像網路分量分析(Network Component Analysis)方法那樣預先給定轉.

(8) 錄因數調控基因的拓撲網路結構。特別地，借助二元因數分析，調控信號中的二元特徵能被直接捕捉。這種似開關的模式在很多生物過程的調控機制裡面起著重要作用。. 最後，基於半盲非高斯因數分析學習演算法，我們提出了一套分析外顯子測序數據的方法，能有效地找出與疾病關聯的易感基因，提供了一個可能的方向去解決傳統的全基因組關聯分析(GWAS)方法在低頻高雜訊的外顯子測序數據上失效的問題。在一個 1457 個樣本的大規模外顯子測序數據的初步結果顯示，我們的方法既能確認很多已經被認為是與疾病相關的基因，又能找到新的被重複驗證有顯著性的易感基因。相關的表達譜資料進一步顯示所找到的新基因在疾病和對照上有顯著的上下調的表達差異。.

(9) Acknowledgement During my graduate study, I am deeply indebted to my supervisor, Professor Lei Xu, who guided me to the research road with numerous efforts. I could not have these academic achievements without his rigorous supervision. I still keep his words in mind about how to be an excellent scholar on the day of my first visit to him. From Prof. Xu, I learned much in many ways. Particularly, his serious and rigorous academic spirit encouraged me a lot, and I understood the importance of the basic concept and the physical meaning in the research process. I thank the members of my thesis committee for squeezing much time on reading and commenting on my thesis. I thank Prof. Runsheng Chen from Institute of Biophysics, CAS, and thank Prof. Xihong Wu and Prof. Dingsheng Luo from Peking University, and Prof. Zhiyong Liu from Institute of Automation, CAS, and also Dr. Kun Zhang. I also thank my research colleagues: Mr. Lei Shi, Mr. Ke Sun, Dr. Xiaopeng Zhu, Mr. Zaihu Pang, Dr. Liqing Tian, Dr. Beibei Chen, Mr. Xiaowei Chen for helpful discussions. I thank my follow classmates in CUHK, Dr. Hao Ma, Mr. Jilin Le, Ms. Fang Xu, Ms. Tu Zhou, Mr. Jianing Chen, Dr. Yubin Zhang for the wonderful life in CUHK, and thank Mr. Weijie Wu, Mr. Ping Li, Mr. Chen Huang for their kindly help. Sincere thanks also go to all the members from the CSE department, all the members from Prof. Runsheng Chen’s Lab. Last but not least, I thank my parents, my other family members, and friends, for their warm support during my Ph.D. studies.. iv.

(10) Contents Abstract. i. Acknowledgement. iv. 1. Introduction. 1. 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1.1. Motivations . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1.2. Independent Factor Analysis (IFA) . . . . . . . . . . . . .. 2. 1.1.3. Learning Methods . . . . . . . . . . . . . . . . . . . . .. 6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 1.2.1. Learning Gaussian FA . . . . . . . . . . . . . . . . . . .. 14. 1.2.2. Learning NFA . . . . . . . . . . . . . . . . . . . . . . .. 16. 1.2.3. Learning Semi-blind NFA . . . . . . . . . . . . . . . . .. 18. 1.3. Main Contribution of the Thesis . . . . . . . . . . . . . . . . . .. 18. 1.4. Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 1.5. Publication List . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 1.2. 2 FA − comparative analysis. 31. 2.1. Determining the factor number . . . . . . . . . . . . . . . . . . .. 32. 2.2. Model Selection Methods . . . . . . . . . . . . . . . . . . . . . .. 34. 2.2.1. Two-Stage Procedure and Classical Model Selection Criteria 34 v.

(11) 2.3. 2.4. 2.5 3. 2.2.2. Kritchman&Nadler’s Hypothesis Test (KN) . . . . . . . .. 35. 2.2.3. Minimax Rank Estimation (MM) . . . . . . . . . . . . .. 37. 2.2.4. Minka’s Criterion (MK) for PCA . . . . . . . . . . . . .. 38. 2.2.5. Bayesian Ying-Yang (BYY) Harmony Learning . . . . . .. 39. Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . .. 42. 2.3.1. A New Tool for Empirical Comparison . . . . . . . . . .. 42. 2.3.2. Investigation On Model Selection Performance . . . . . .. 44. A Theoretic Underestimation Partial Order . . . . . . . . . . . . .. 49. 2.4.1. Events of Estimating the Hidden Dimensionality . . . . .. 49. 2.4.2. The Structural Property of the Criterion Function . . . . .. 49. 2.4.3. Experimental Justification . . . . . . . . . . . . . . . . .. 54. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . .. 58. FA − parameterizations affect model selection. 70. 3.1. Parameterization Issue in Model Selection . . . . . . . . . . . . .. 71. 3.2. FAr: ML-equivalent Parameterizations of FA . . . . . . . . . . .. 72. 3.3. Variational Bayes on FAr . . . . . . . . . . . . . . . . . . . . . .. 74. 3.4. Bayesian Ying-Yang Harmony Learning on FAr . . . . . . . . . .. 77. 3.5. Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . .. 82. 3.5.1. Three levels of investigations. 82. 3.5.2. FA-a vs FA-b: performances of BYY, VB, AIC, BIC, and DNLL . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. 3.5.3. FA-r: performances of VB versus BYY . . . . . . . . . .. 87. 3.5.4. FA-a vs FA-b: automatic model selection performance of BYY and VB . . . . . . . . . . . . . . . . . . . . . . . .. 90. Classification Performance on Real World Data Sets . . .. 92. Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . .. 93. 3.5.5 3.6. . . . . . . . . . . . . . . .. vi.

(12) 4 BFA − learning versus optimization 4.1. Binary Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . 105. 4.2. BYY Harmony Learning on BFA . . . . . . . . . . . . . . . . . . 107. 4.3. Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 108. 4.4 5. 104. 4.3.1. BIC and Variational Bayes (VB) on BFA . . . . . . . . . 108. 4.3.2. Error in solving BQP affects model selection . . . . . . . 110. 4.3.3. Priors over parameters affect model selection . . . . . . . 114. 4.3.4. Comparisons among BYY, VB, and BIC . . . . . . . . . . 115. 4.3.5. Applications in recovering binary images . . . . . . . . . 116. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 117. BMF − for PPI network analysis. 124. 5.1. The problem of protein complex prediction . . . . . . . . . . . . 125. 5.2. A novel binary matrix factorization (BMF) algorithm . . . . . . . 126. 5.3. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 130. 5.4. 5.5. 5.3.1. Other methods in comparison . . . . . . . . . . . . . . . 130. 5.3.2. Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 131. 5.3.3. Evaluation criteria . . . . . . . . . . . . . . . . . . . . . 131. 5.3.4. On altered graphs by randomly adding and deleting edges. 5.3.5. On real PPI data sets . . . . . . . . . . . . . . . . . . . . 137. 5.3.6. On gene expression data for biclustering . . . . . . . . . . 137. 132. A Theoretical Analysis on BYY-BMF . . . . . . . . . . . . . . . 138 5.4.1. Main results . . . . . . . . . . . . . . . . . . . . . . . . . 138. 5.4.2. Experimental justification . . . . . . . . . . . . . . . . . 140. 5.4.3. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 147. vii.

(13) 6. Semi-blind NFA: algorithms and applications 6.1. 6.2 7. Determining transcription factor activity . . . . . . . . . . . . . . 148 6.1.1. A brief review on NCA . . . . . . . . . . . . . . . . . . . 149. 6.1.2. Sparse NFA . . . . . . . . . . . . . . . . . . . . . . . . . 150. 6.1.3. Sparse BFA . . . . . . . . . . . . . . . . . . . . . . . . . 156. 6.1.4. On Yeast cell-cycle data . . . . . . . . . . . . . . . . . . 160. 6.1.5. On E. coli carbon source transition data . . . . . . . . . . 166. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 170. Applications on Exome Sequencing Data Analysis. 172. 7.1. From GWAS to Exome Sequencing . . . . . . . . . . . . . . . . 172. 7.2. Encoding An Exon/Gene . . . . . . . . . . . . . . . . . . . . . . 173. 7.3. An NFA Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 175. 7.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176. 7.5 8. 148. 7.4.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 176. 7.4.2. On a real exome sequencing data set: AHMUe . . . . . . 177. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 186. Conclusion and Future Work. 187. A Derivations of the learning algorithms on FA-r A.1. 190. The VB learning algorithm on FA-r . . . . . . . . . . . . . . . . 190. A.2 The BYY learning algorithm on FA-r . . . . . . . . . . . . . . . 193 Bibliography. 195. viii.

(14) List of Figures 1.1. IFA with different structures . . . . . . . . . . . . . . . . . . . .. 3. 1.2. The roadmap of the thesis . . . . . . . . . . . . . . . . . . . . . .. 19. 2.1. Demonstration of the figures. . . . . . . . . . . . . . . . . . . . .. 43. 2.2. Adjusted contour maps of Successful-selection rates of AIC and BYY over the three scenarios. . . . . . . . . . . . . . . . . . . .. 44. 2.3. Testing log-likelihood (TLL) ratios and selected m(TLL). ˆ . . . . .. 46. 2.4. Adjusted contour maps of Successful-selection rates of BIC and CAIC across the three scenarios. . . . . . . . . . . . . . . . . . .. 2.5. Adjusted contour maps of Successful-selection rates of HQC and MK across the three scenarios. . . . . . . . . . . . . . . . . . . .. 2.6. 61. Adjusted contour maps of Successful-selection rates of KN and MM across the three scenarios. . . . . . . . . . . . . . . . . . . .. 2.7. 60. 62. Averaged curves of Successful rates by averaging the contour maps along one axis and then projecting along the other.. . . . . . . .. 63. 2.8. Demo of the structural property of the criterion function . . . . . .. 64. 2.9. The theoretical relative underestimation tendency (U-tendency) . .. 64. 2.10 Examples of contour maps with adjusted axes . . . . . . . . . . .. 65. 2.11 The underestimation contours of AIC, BIC, HQC and CAIC at the same level (30%, 50% or 70%) for scenario I&II . . . . . . . . .. ix. 66.

(15) 2.12 The underestimation contours of AIC, BIC, HQC and CAIC at the same level (30%, 50% or 70%) for scenario III . . . . . . . . . .. 67. 2.13 Example successful-selection contour curves of AIC, BIC, HQC and CAIC at the same level (50%) for the four scenarios . . . . .. 68. 2.14 Results on real world data sets . . . . . . . . . . . . . . . . . . .. 69. 3.1. BYY system in the general form and specific structures for FA . .. 79. 3.2. The successful-selection rates on S (:, :, 15, 5) in terms of contour maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 3.3. The successful-selection rates of VB on the same synthetic data .. 95. 3.4. The successful-selection rates of BYY are obtained on the same synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 96. 3.5. The model selection accuracies of VB learning on FA-r . . . . . .. 97. 3.6. The experimental results on FA-r with different parts of priors under BYY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 3.7. The model selection accuracies of FA-r under VB for r = 0, 1, 3, 5, 7, 9 99. 3.8. The model selection accuracies of VB on FA-r for different r ∈ {1, . . . , 9} on experimental settings (N, γo ) . . . . . . . . . . . . . 100. 3.9. A comparison of model selection performances of VB and BYY learning on FA-r at r = m∗ = 5. . . . . . . . . . . . . . . . . . . . 100. 3.10 The values of VB’s variational lower bound F and BYY’s harmony functional H . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.11 The automatic model selection accuracies of VB and BYY on FAa and FA-b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.1. Percentage of correct solutions by the three approximate BQP methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112. 4.2. Learning without dimension deduction on a synthetic data set . . . 113. 4.3. Automatic model selection accuracies of BYY-BFA without prior x. 119.

(16) 4.4. Automatic model selection accuracies of BYY-BFA with prior . . 120. 4.5. Comparison of model selection performances by BYY embedded with cdual or round . . . . . . . . . . . . . . . . . . . . . . . 121. 4.6. Comparison of model selection performances of BYY, VB and BIC 122. 4.7. Results of recovering binary images . . . . . . . . . . . . . . . . 123. 5.1. Prediction evaluations of BMF and MCL on altered graphs.. 5.2. Prediction evaluations of BMF and MCL on altered graphs (con-. . . . 135. tinued). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3. Prediction accuracies of BMF, MCL and SC on real world PPI networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138. 5.4. Matching scores of BYY-BMF versus other biclustering algorithms.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139. 5.5. Results on synthetic data . . . . . . . . . . . . . . . . . . . . . . 141. 5.6. The results of BYY-BMF on X1 . . . . . . . . . . . . . . . . . . 142. 6.1. The generation distributions . . . . . . . . . . . . . . . . . . . . 155. 6.2. The true activities (in blue) and the estimated ones (in red) of the latent factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157. 6.3. The true binary activities (in blue, covered by the red points) is correctly reconstructed by the ones (in red) by BYY-BFA . . . . . 160. 6.4. The estimated TFA profile by NCA . . . . . . . . . . . . . . . . . 163. 6.5. The estimated TFA profile by BYY-NFA(n+c) . . . . . . . . . . . 164. 6.6. The estimated TFA profile by BYY-NFA(s+c) . . . . . . . . . . . 165. 6.7. The estimated TFA profile by BYY-NFA(s+f) . . . . . . . . . . . 166. 6.8. The estimated TFA profile by BYY-BFA(n+c) . . . . . . . . . . . 167. 6.9. The estimated TFA profile by BYY-BFA(s+c) . . . . . . . . . . . 167. 6.10 The estimated TFA profile by BYY-BFA(s+f) . . . . . . . . . . . 168 6.11 The estimated TFA profile on E.Coli data by NCA . . . . . . . . . 169 xi.

(17) 6.12 The estimated TFA profile on E.Coli data by BYY-NFA(s+c) . . . 169 6.13 The estimated TFA profile on E.Coli data by BYY-BFA(s+c) . . . 170 7.1. Detecting contributed SNPs and interactions between SNPs from the simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . 178. 7.2. The lengths of exons/genes in AHMUe and the number of SNPs contained in each exon/gene. . . . . . . . . . . . . . . . . . . . . 179. 7.3. The distribution of top 100 significant genes ranked by the p-values from the testing set.. . . . . . . . . . . . . . . . . . . . . . . . . 181. 7.4. An example of the enriched pathways. . . . . . . . . . . . . . . . 182. 7.5. The corresponding gene expression profiles of four genes in GSE13355. 185. xii.

(18) List of Tables 2.1. BYY harmony learning on FA . . . . . . . . . . . . . . . . . . .. 41. 2.2. Configurations of the three scenarios . . . . . . . . . . . . . . . .. 45. 2.3. Configurations of the four scenarios . . . . . . . . . . . . . . . .. 54. 2.4. The empirical values of the approximation difference and the relative difference . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.5. 57. The means and the standard deviations of the approximation difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 3.1. Two probabilistic parameterizations of FA . . . . . . . . . . . . .. 73. 3.2. The two-stage procedure of VB learning . . . . . . . . . . . . . .. 74. 3.3. Prior distributions for FA . . . . . . . . . . . . . . . . . . . . . .. 75. 3.4. An outline of the VB algorithm on FA-r . . . . . . . . . . . . . .. 76. 3.5. The general two-stage iterative BYY harmony learning procedure. 78. 3.6. A sketch of the gradient implementation of BYY learning algorithm on FA-r . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 3.7. Three levels of investigations . . . . . . . . . . . . . . . . . . . .. 83. 3.8. The candidate values of each feature . . . . . . . . . . . . . . . .. 84. 3.9. Configurations of the training and testing sets of real data . . . . .. 93. 3.10 The classification accuracies (%) in the form of “average±standard deviation” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. xiii.

(19) 4.1. Algorithms for solving the BQP in Eq.(4.9). . . . . . . . . . . . . 111. 5.1. The Sketched BYY-BMF algorithm . . . . . . . . . . . . . . . . 129. 5.2. Evaluations of different clustering algorithms on the test graph . . 133. 6.1. The BYY-NFA algorithm . . . . . . . . . . . . . . . . . . . . . . 153. 6.2. Implementation of algorithms . . . . . . . . . . . . . . . . . . . . 154. 6.3. Implementation of the algorithms . . . . . . . . . . . . . . . . . . 158. 6.4. Yeast cell cycle data points in combined dataset time points (min). 6.5. Reconstruction mean square errors (MSE) on yeast cell-cycle data 163. 6.6. Confusion matrices of the reconstructed connectivity by sparse. 161. BYY-NFA/BFA on yeast cell-cycle data against the known connectivity from available experiments. . . . . . . . . . . . . . . . . 164 6.7. Sensitivity of TFA profiles to randomly inserting connections in connectivity from ChIP-chip data . . . . . . . . . . . . . . . . . . 165. 7.1. Encoding the exon sequence . . . . . . . . . . . . . . . . . . . . 174. 7.2. Encoding the genotypes of a SNP . . . . . . . . . . . . . . . . . 174. 7.3. Description of the exome sequencing data set: AHMUe . . . . . . 179. 7.4. Significant genes according the validation p-values (< 10−3 ) by at least two of the three methods . . . . . . . . . . . . . . . . . . . 184. 7.5. Test of significance of the four genes differentially expressed between normal skin and lesional skin. . . . . . . . . . . . . . . . . 185. xiv.

(20) Chapter 1 Introduction 1.1 1.1.1. Background Motivations. In practice, the observed data are usually high dimensional, strongly dependent, and very noisy, in the fields of signal processing, pattern recognition, bioinformatics, and so on. To mining the complex data, one way is to decompose them into independent components. Here, the independence is mainly referred to statistical independence, or equivalently the information-theoretic independence. Such decompositions will allow us to uncover the structure and the patterns in data, to represent and visualize the data in a lower-dimensional intrinsic coordinate with the noise greatly reduced, and to select and extract features for the later processes such as clustering and classification. We consider the linear model, which explains the data in terms of mutually independent latent sources after a linear transform and corrupted by Gaussian noise. By considering different component distributions, the linear model will become a powerful tool in various applications, as will be demonstrated in this thesis.. 1.

(21) CHAPTER 1. INTRODUCTION. 1.1.2. 2. Independent Factor Analysis (IFA). We consider the general Independent Factor Analysis (IFA) model, which explains the correlated observation x in terms of m latent variables y1 , . . . , ym that are mutually statistically independent. The y1 , . . . , ym are referred to as factors. Specifically, the n-dimensional x is assumed to be generated by linear combinations of the factors, and added with a Gaussian noise e independent from the factors, i.e., x = Ay + a0 + e,. (1.1). where A is an n × m mixing matrix, and m. q(y) = ∏ q(y j ),. (1.2). q(x|y) = G(x|Ay + a0 , Σe ),. (1.3). j=1. and Σe is a diagonal covariance matrix, and G(z|μ, Σ) denotes a Gaussian probability density with mean μ and covariance Σ. Here and throughout this thesis, p(·) and q(·) denote probability distributions. Usually, IFA is investigated by considering specific structures of q(y j ) in Eq.(1.2). Next, we follow Fig. 1.1 to provide a brief introduction to some typical structures studied in the literature. Gaussian Factor Analysis (FA) The classic Factor Analysis (FA) [5] is a type of IFA in a sense of independence in second order statistics by considering each q(y j ) in Eq.(1.2) to be a zero-mean Gaussian (box 2 of Fig. 1.1). Usually, it is further assumed that E[yyT ] = I since the uncorrelated components remain uncorrelated after any scaling transform: Ay = A y , A = AD−1 , y = Dy, for any invertible diagonal matrix D, where E[·] denotes the expectation operator. Moreover, there still exists a rotation indeterminacy, i.e.,.

(22) CHAPTER 1. INTRODUCTION ϭ. 3. /ŶĚĞƉĞŶĚĞŶƚ&ĂĐƚŽƌŶĂůǇƐŝƐ;/&Ϳ. T [ \ *DXVVLDQ. *[ $\ D ȈH T \L

(23) . [. T\ . $\ D H P. T \ L. L . T\

(24)

(25) T \L * \L PL OL . ϴ EĞƚǁŽƌŬŽŵƉŽŶĞŶƚ. EŽŶͲ'ĂƵƐƐŝĂŶ&ĂĐƚŽƌ ŶĂůǇƐŝƐ;E&Ϳ. ϯ. ³ T[ \T\G\. SDUWLDONQRZOHGJHRQ$. 1RQ*DXVVLDQ. Ϯ &ĂĐƚŽƌŶĂůǇƐŝƐ;&Ϳ. T[. ŶĂůǇƐŝƐ;EͿ. PLQ$<__;±$<__. T\

(26)

(27) . :LWKFRQVWUDLQWVRQ$ ;$<ELQDU\PDWULFHV. NL . ϱ. ϰE& E\*DXVVLDQPL[WXUH

(28) T \L . NL. ¦D M . LM. T\ T \L EL\L EL \L. * \L PLM OLM NL. PL OL. E& E\*UDP ϲ &KDUOLHUH[SDQVLRQ

(29) . ŝŶĂƌǇ&;&Ϳ. T \L . PL . OL . . [DQG$DUHELQDU\. . ª UM º + \ M » « » * \ M « « N M » + \ M » « ¬ ¼. ϳ ŝŶĂƌǇDĂƚƌŝǆ&ĂĐƚŽƌŝǌĂƚŝŽŶ;D&Ϳ ; $<(;$<ELQDU\PDWULFHV. Figure 1.1: IFA with different structures for the independent factors and A. E[yyT ] = E[y (y )T ] = I, AψψT AT = AAT , for any y = ψy with ψψT = I. See Eq.(5)-(6) of [121] for more details. Non-Gaussian Factor Analysis (NFA) The rotation indeterminacy in FA can be removed when y1 , . . . , ym comes from independent non-Gaussian densities (see Eq.(47) in [124]), e.g., κ2j − 3 ρj q(y j ) = G(y j |0, 1) 1 + H3 (y j ) + H4 (y j ) , 6 24 or q(y j ) =. kj. ∑ α jG(y j |μ j, λ j),. =1. 0 ≤ α j ≤ 1,. (1.4). kj. ∑ α j = 1,. (1.5). =1. where Hn (y j ) is the nth-order Chebyshev-Hermite polynomial. The Eq.(1.4) is the Gram-Charlier expansion with unknown parameters {ρ j , κ2j }, while the Eq.(1.5) is expressed as a Gaussian mixture with parameters {α j , μ j , λ j }. The scaling indeterminacy can also be removed by imposing the unit variance constraint E[y2j ] = 1 with E[y j ] = 0 (see Eq.(125) in [123])..

(30) CHAPTER 1. INTRODUCTION. 4. To address the difference from FA with Gaussian q(y j ), the model with nonGaussian q(y j ) is called non-Gaussian Factor Analysis (NFA) (box 3 of Fig. 1.1), which was studied within the Bayesian Ying-Yang (BYY) learning system [116, 121, 123]. The Eq.(1.4) and Eq.(1.5) induce two specific types of NFA, as indicated in box 4 and box 6 of Fig. 1.1, respectively. Besides, the NFA by Eq.(1.5) was also investigated in [7] under the name of IFA, which is actually a special case of the general IFA by Eq.(1.1)-(1.3). The use of the term IFA in [7] is inaccurate, because the concept underlying IFA should preferably correspond to the general formulation by Eq.(1.1)-(1.3). The above Eq.(1.4) and Eq.(1.5) are continuous non-Gaussian densities. When each factor y j is discrete, e.g., binary, the Bernoulli distribution is considered instead: y. q(y j ) = β j j (1 − β j )1−y j ,. 0 ≤ β j ≤ 1, y j ∈ {0, 1},. (1.6). which removes the scaling indeterminacy because the range of y j is fixed at {0, 1}. The IFA model with Eq.(1.6) is another type of NFA, and in the literature (see e.g., [121, 3]) it was called Binary Factor Analysis (BFA) which is placed in box 5 of Fig. 1.1. Particularly, the Bernoulli distribution by Eq.(1.6) can be equivalently written in the form of Eq.(1.5) with k j = 2, μ j1 = 0, μ j2 = 1 and λ j1 = λ j2 = 0, for each 1 ≤ j ≤ m, q(y j ) = α j1 δ(y j ) + α j2 δ(y j − 1),. (1.7). where δ(z) is the Dirac delta function. Therefore, BFA is equivalently a special implementation of NFA in box 4 of Fig. 1.1, when the observations are traced to independent Bernoulli information sources. From the perspective of matrix factorization, the Eq.(1.1) is rewritten as follows: X = AY + E,. (1.8).

(31) CHAPTER 1. INTRODUCTION. 5. where X = {x}, Y = {y}, and E = {e}. The bias vector a0 is assumed to be zero + E with Y = {y + μ}. or represented by Aμ from which it follows that X = AY BFA constraints Y to be binary. If the data matrix X is binary and A is constrained to be binary, then BFA implements a Binary matrix Factorization (BMF) model (see box 5 of Fig. 1.1), which seeks a factorization of X as a product of two low-rank binary matrices A, Y. Network Component Analysis (NCA) We have considered different structure constraints on q(y j ) for Eq.(1.1), given only X = {x}, which falls into the area of unsupervised learning. If Y = {y} is also given, the Eq.(1.1) becomes linear regression model which is a typical example in supervised learning. As discussed in Section 1 of [131], there exist many scenarios of knowing partially either or both of the system (i.e., A and the property of E = {e}) and the input, and these cases can be unified in a semi-blind learning framework. One example is the network component analysis (NCA) model [64] (indicated in box 8 of Fig. 1.1) for transcriptional regulation network analysis in system biology. Specifically, NCA takes the same factorization form as Eq.(1.8). Instead of imposing structural constraints on Y, NCA fixes some entries of A at zeros according to a priori knowledge in practice, and requires A and its resultant matrix by removing a column together with the rows corresponding to non-zero entries of the column to have full-column rank, and requires Y to have full-row rank. It should be noted that the IFA model with Eq.(1.1)-(1.3) actually assumes the elements of the matrix Y in Eq.(1.8) are independently identically distributed or shortly i.i.d. Structural constraints can be made on A for NCA, or on both A and Y to be binary for BFA, to be non-negative for non-negative matrix factorization (NMF) [61]. Moreover, the matrix factorization form by Eq.(1.8) can go beyond to.

(32) CHAPTER 1. INTRODUCTION. 6. have certain cross-column dependence over the columns of Y. More discussions are referred to Figure 1 of [131] from a perspective of the stochastic bilinear matrix system.. 1.1.3. Learning Methods . For a probabilistic model e.g., IFA with q(x|Θ) = q(x|y)q(y) dy, to describe a set of observations {x}, the learning task consists of three levels of inverse problems, i.e., inverse inference from observation x to inner representation y, parameter learning of Θ, and model selection for e.g., dim(y), which are usually nested in order in a hierarchy of learning process [129]. Maximum Likelihood (ML) Learning N , parameter learning is to estimate the Given an i.i.d. sample set XN = {xt }t=1. parameters Θ to make the model q(x|Θ) fit the data well. Usually, parameters are estimated under the Maximum Likelihood (ML) principle: ˆ ML = arg max ln q(XN |Θ), q(XN |Θ) = ∏ q(xt |Θ), Θ Θ. (1.9). t. which can be implemented by an Expectation-Maximization (EM) algorithm [84, 97]. The EM algorithm interprets the data set incomplete with missing data yt associated with each xt which is generated by Eq.(1.1). Consequently, the logN is computed as follows: likelihood of the complete data {xt , yt }t=1 N. L (XN ,YN |Θ) = ∑ ln[q(xt |yt )q(yt )].. (1.10). t=1. Then, the EM algorithm is implemented by iterating the following two steps:.

(33) CHAPTER 1. INTRODUCTION. 7. E-Step: Compute the expectation of Eq.(1.10) given by XN and the current parameter estimate Θ∗ , N. Q (Θ, Θ ) = ∑ ∗. t=1. . q(yt |xt , Θ∗ ) ln[q(xt |yt )q(yt )] dyt. (1.11). where q(yt |xt , Θ∗ ) is calculated by the Bayes formula: q(yt |xt , Θ∗ ) =. q(xt |yt )q(yt ) . q(xt |Θ∗ ). (1.12). M-Step: Update the parameters by Θnew = arg maxΘ Q (Θ, Θ∗ ). Notice that the task of Bayesian inference of yt is nested in EM by Eq.(1.12). Two-stage Procedure for Model Selection If the latent dimensionality of y, i.e., m = dim(y), is unknown to be determined, then we encounter a model selection problem. A traditional approach is a twostage procedure, i.e., parameter learning is repeated on a set M of candidate latent dimensionalities among which one is selected by a model selection criterion. Existing classical criteria include Akaike’s Information Criterion (AIC)[1], Bozdogan’s Consistent Akaike’s Information Criterion (CAIC) [14], Hannan-Quinn information criterion (HQC) [42], Schwarz’s Bayesian Information Criterion (BIC) [86], and Minimum Description Length (MDL) [83] (which stems from another viewpoint but coincides with BIC when it is simplified to an analytically computable criterion). They trade off between the likelihood-based goodness of fit and model complexity, subject to noise and uncertainty in a finite number of observations. The two-stage procedure is formulated as follows: ˆ m = Θ(X ˆ N , m) for each candidate m ∈ M . Usually, Θ ˆ m is Stage I: Compute Θ ˆ ML given by the ML estimator Θ m in Eq.(1.9)..

(34) CHAPTER 1. INTRODUCTION. 8. Stage II: Estimate the true hidden dimensionality m∗ by minimizing a model selection criterion JCri , e.g., ˆ m ), mˆ = arg min JCri (XN , Θ. (1.13). JCri (XN , Θˆ m ) = − ln q(XN |Θˆ m ) + (ρN dm )/2,. (1.14). m∈M. ρN =. ⎧ ⎪ ⎪ 2; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ln N;. ⎪ ⎪ ⎪ ln N + 1; ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 2 ln(ln N);. for AIC [1] for BIC/MDL [86, 83]. ,. (1.15). for CAIC [14] for HQC [42]. where dm in Eq.(1.14) is the number of free parameters of the model q(x|Θ). Automatic model selection The conventional two-stage procedure has been successfully used to tackle model selection problems. However, this two-stage implementation suffers from a huge computation because it requires parameter learning for each m ∈ M . Moreover, a larger m often implies more unknown parameters, and then parameter estimation becomes less reliable so that the criterion evaluation reduces its accuracy (see Section 2.1 in [130] for a detailed discussion). One road of studies to reduce the computational cost is referred to as automatic model selection. An early effort is RPCL (rival penalized competitive learning) [114, 115] which can automatically determine the cluster number in competitive learning by balancing a participating mechanism and a leaving mechanism [127]. As systemically discussed in Section 2.1 in [130], automatic model selection is associated with two features. First, there is an indicator ψ(θSR ) on a subset θSR of scale representative (SR) parameters. For example in FA, the variance λ j of q(y j ) is a type of SR parameter, and the factor y j becomes extra and thus discarded if its.

(35) CHAPTER 1. INTRODUCTION. 9. corresponding ψ(λ j ) = λ j gets λ j = 0.. (1.16). Second, there is an intrinsic mechanism to drive ψ(θSR ) → 0, as θSR tends to a specific value, if the corresponding structure is redundant. For a sample set XN of limited size, the ML principle is not good for model selection. Therefore, the criterion by Eq.(1.14) is also not good for automatic model selection, because the last term in Eq.(1.14) is irrelevant to Θ, and thus parameter estimation based on JCri (XN , Θm ) degenerates back to ML learning. Variational Bayes (VB) [54, 8, 10] was proposed to be capable of automatic model selection with the help of prior distributions over the parameters, while Bayesian Ying-Yang (BYY) [116, 130] harmony learning was proposed with not only improved model selection criteria and but also improved automatic model selection. Variational Bayes (VB) Bayesian approach has been extensively used in many scientific areas. One important and difficult problem is computing the marginal likelihood of a given training data set XN , which involves a high dimensional integral over all parameters. Developed recently, Variational Bayes (VB) [54, 8, 10] tackles the integral by means of variational methods to approximate the log marginal likelihood ln q(XN |m, Ξ) with a lower bound:. F (p(Θ), p(Y ), m, Ξ) =. . q(XN ,Y |Θ)q(Θ|m, Ξ) p(Θ)p(Y ) ln dΘ dY p(Θ)p(Y ). (1.17). = ln q(XN |m, Ξ) − KL(p(Θ)p(Y )q(Θ,Y |XN , m, Ξ)) (1.18) where Y represents hidden variables, q(Θ|m, Ξ) is a given prior over the parameters Θ, and KL(pq) =. . p ln(p/q) ≥ 0 is the KL-divergence, q(Θ,Y |XN , m, Ξ) ∝. q(XN ,Y |Θ)q(Θ|m, Ξ). The lower bound F is a functional of model scale m, prior’s hyperparameters Ξ, and the variational posterior p(Θ)p(Y ), which is usually.

(36) CHAPTER 1. INTRODUCTION. 10. assumed to be further factorized as ∏i p(θi ) ∏t p(yt ) with Θ = {θi } and Y = {yt }, in order to obtain computable variational posteriors. The tightness of the bound depends on the KL divergence between the computed variational posterior and the exact Bayesian posterior. The optimized F approaches the maximum log marginal likelihood, which encodes a preference for simpler, more constrained models through assigning higher probability to the data set. It is straightforward to show that the optimum form for each component of variational posterior distribution is [54, 10]: p(ϑi ) ∝ exp{E[ ln[q(XN ,Y |Θ)q(Θ|m, Ξ)] | ∏ p(ϑ j )]},. (1.19). j =i. where ϑi ∈ Θ ∪Y , and E p [·] or E[·|p] denotes expectation with respect to p. Then, maximizing F is implemented via the following VBEM algorithm until the convergence of F : VBE-Step: Compute p(yt ) for each yt ∈ Y by Eq.(1.19), given the current estimate of p(θi ), ∀θi ∈ Θ; VBM-Step: Compute p(θi ) for each θi ∈ Θ by Eq.(1.19), given the current estimate of p(yt ) and p(θ j ), ∀ j = i. It should be noted that VBEM algorithm degenerates to the standard EM if no prior distributions q(Θ) are considered on the parameters Θ. For IFA, the factor number m can be determined in a two-stage procedure similar to Eq.(1.14), i.e., mˆ = arg maxm∈M F ( p(Θ), ˆ p(Y ˆ ), m, Ξ), where p(Θ), ˆ p(Y ˆ ) are computed by the VBEM algorithm, and Ξ can be given or further tuned to maximize F . Moreover, a hidden factor y j can be automatically discarded during VB learning if the norm of the corresponding j-th column of the mixing matrix A is pushed towards zero with the help of Automatic Relevance Determination (ARD) prior [77]..

(37) CHAPTER 1. INTRODUCTION. 11. Bayesian Ying-Yang (BYY) Harmony Learning Firstly proposed in [116] and systematically developed over a decade and a half, Bayesian Ying-Yang (BYY) harmony learning theory is a general statistical learning framework that can handle both parameter learning and model selection under a best harmony principle. The BYY harmony learning on typical structures leads to new model selection criteria, new techniques for implementing regularization and a class of algorithms that implement automatic model selection during parameter learning. In the sequel, we introduce the fundamentals of BYY. Readers are referred to [130] for a recent systematic introduction. Mathematically, the best harmony principle is to maximize the following harmony functional: H(pq) =. . H(pq, Θ) =. p(R|X)p(X) ln [q(X|R)q(R)] dXdR = . . p(Θ|X)H(pq, Θ)dΘ,. p(Y |X, Θ)p(X) ln[q(X|Y, Θ)q(Y |Θ)] dY dX + ln q(Θ|Ξ), (1.20). where the observation data X are generated from its inner representation R = {Y, Θ},the parameter set Θ represents the underlying regularities of X and Y is the inner representation of X accordingly. The two types of Bayesian decompositions, i.e., p(R|X)p(X) and q(X|R)q(R), are called Yang machine and Ying machine respectively, which form a BYY system. An important nature of maximizing H(pq) is that it leads to not only a best matching between the Ying-Yang pair, but also a compact model with least complexity. Such an ability can be observed and investigated from several perspectives, see Sect.4.1 in [130], and here we only introduce one of them due to space limit. On one hand, maximizing H(pq) forces Ying machine q(X|R)q(R) to match Yang machine p(R|X)p(X). Due to a finite sample size and practical constraints imposed on the Ying-Yang structures, a perfect equality q(X|R)q(R) = p(R|X)p(X) may not be really reached but still be approached as possible as it can. At this e-.

(38) CHAPTER 1. INTRODUCTION. 12. quality, H(pq) becomes the negative entropy that describes the complexity of the system. Further maximizing it will decrease the system complexity which provides a model selection ability. Similar to Eq.(1.14), maximizing H(pq) can be implemented by the following two-stage procedure: Stage I: Enumerate candidate models by m and for each candidate, we iterate the following (a) and (b) until converged: (a) (b). Θ(τ) = arg maxΘ H(pq, Θ, m, Ξ(τ−1) ),

(39). 1 (τ) (τ) Ξ = arg maxΞ H(pq, Θ , m, Ξ) + dm (Ξ) + Hb (m, Ξ) , 2. (1.21) (1.22). where τ is the iteration number. Stage II: Select the best mˆ by

(40). 1 (τ∗ ) (τ∗ ) mˆ = arg minm −H(pq, Θ , m, Ξ ) + n f (Θm ) − Hb (m, Ξ) , 2. (1.23). where τ∗ is the value of the iteration indicator τ when Stage I converged. Next, we outline the deviation the above two-stage procedure with details referred to Sec.4.3 in [130]. Putting the empirical density p(X) = δ(X − XN ) with N into eq.(1.20) and splitting Θ = Θa ∪ Θb , Θa ∩ Θb = empty such that XN = {xt }t=1. we have H(pq) = Hb (m, Ξ) + Hb (m, Ξ) =. . . p(Θ|XN , Ξ)H(pq, Θ, m, Ξ) dΘ. p(Θb |XN , Ξ) ln q(Θb |Ξ) dΘb ,. H(pq, Θ, m, Ξ) =. . (1.24) (1.25). p(Y |XN , Θ) ln [q(XN |Y, Θ)q(Y |Θ)] dY + ln q(Θa |Ξ), (1.26). where the integral for Hb (m, Ξ) can be solved analytically and Θb could be an empty subset. The second term in eq.(1.24) is handled by the so called apex ap-.

(41) CHAPTER 1. INTRODUCTION. 13. proximation, resulting in . 1 p(Θ|XN , Ξ)H(pq, Θ, m, Ξ) dΘ ≈ H(pq, Θ∗ , m, Ξ) + dm (Ξ), 2. (1.27). dm (Ξ) = −n f (Θ) + (ΘX − Θ∗ )T Ω(Θ∗ , Ξ)(ΘX − Θ∗ ), Θ∗ = arg max H(pq, Θ, m, Ξ), Θ. (1.28). where Ω(Θ∗ , Ξ) = ∇2ΘΘT H(pq, Θ, m, Ξ) is the Hessian matrix evaluated at Θ∗ . ΘX is the mean of p(Θ|XN , Ξ). Simply, we adopt ΘX = Θ(τ−1) and thus Θ∗ = Θ(τ) . It follows that Θ(τ) − Θ(τ−1) vanishes when the iteration converges. Therefore, we get Stage I(a) directly from eq.(1.28). Moreover, putting eq.(1.27) into eq.(1.24), we may update the hyperparameters Ξ by Stage I(b) and select mˆ by Stage II. Moreover, the BYY harmony learning is also featured by its favorable nature that model selection is made automatically during the implementation of merely Stage I, e.g., for FA the implementation of either Stage I(a) or both Stage I(a)& I(b) will drive the variance λ j of p(y j ) to zero when the j-th dimension of y is extra. Thus, automatic model selection can be made via discarding the j-th dimension via checking λ j → 0. Further details about automatic model selection are referred to Sect.2.1 and Sect.3.2 in [130] and to Sect.2.2 in [131] for further improvements via exploring a co-dimensional matrix pair nature (additionally where an improved model selection criterion is given by e.g., eq.(29) in [131]). The above is a brief introduction to BYY. Readers are referred to not only a summary of nine aspects on the novelty and favorable natures of BYY best harmony learning, made at the end of Sect. 4.1 in [130], but also the roadmap shown in Fig. A2 of in [130], as well as to a systematic outline on the 13 topics about best harmony learning in Sect. 7 of [132]..

(42) CHAPTER 1. INTRODUCTION. 1.2 1.2.1. 14. Related Work Learning Gaussian FA. FA is a widely-used linear technique of dimensionality reduction, which plays an important role in feature selection and extraction for pattern recognition problems [103, 53, 106]. Parameter learning on FA is usually implemented by the EM algorithm [5]. Revisited in [97], the maximum likelihood solution of FA with an isotropic noise covariance matrix extracts principal components of the observed data as given by PCA [53]. Determining the number m of the hidden factors has been extensively investigated as an equivalent one of detecting the number of source signals, tackled by the two-stage procedure with AIC and MDL, in many signal processing problems such as sensor array processing, the poles retrieval of a system response, and so on (see e.g., [107, 85]). Recent efforts include not only further developed Bayesian model selection methods [13, 70], but also rank estimation algorithms [59, 80] based on the results of sample covariance asymptotics stemming from random matrix theory [9, 52, 79]. It is important to examine the relative strengths and weaknesses of these model selection methods, but there is still lack of such a systematic comparison. Usually, FA is parameterized by a free factor loading matrix A and a unit variance for each hidden factor to remove the scaling indeterminacy. Specifically, the Eq.(1.2) becomes q(y) = G(y|0, Im ),. (1.29). where Im denotes the unit matrix of size m × m. We shortly call this parameterization as FA-a, which has been widely used in various studies, e.g., in [13, 38, 78]. Still, FA-a has a rotation indeterminacy [121, 97]..

(43) CHAPTER 1. INTRODUCTION. 15. In the Item 9.4 of [118], an alternative FA parametrization has been proposed and implemented by the BYY harmony learning. This parameterization constrains the factor loading matrix to be a rectangular orthogonal matrix, and allows free parameters as the diagonal covariance matrix of the latent variables, i.e., q(y) = G(y|0, Λ),. (1.30). where Λ is a diagonal covariance matrix, and to avoid ambiguity, the orthogonal mixing matrix in Eq.(1.1) is rewritten as U with UT U = Im . Here, we shortly denote this parameterization as FA-b. The scaling indeterminacy and the rotation indeterminacy has been removed in FA-b, because the orthogonality of U cannot be kept after a scaling and the covariance matrix of q(y) may not be diagonal after a rotation transform on y. Then, the likelihood function of FA is given by q(x|Θ) = where Σx =. . q(x|y)q(y) dy = G(x|a0 , Σx ),. ⎧ ⎪ ⎨ AAT + Σe ,. for FA-a. ⎪ ⎩ UΛUT + Σe , for FA-b. (1.31). (1.32). It can be observed from Eq.(1.31) and Eq.(1.32) that the likelihood functions of FA-a and FA-b are equivalent, when there is no constraint to hinder AAT + Σe = UΛUT + Σe to reach any value of a positive or semi-positive matrix. The issue that how parameterizations affect model selection performance has been ignored or seldom studied, because the classical model selection criteria by Eq.(1.14) perform equivalently on FA-a and FA-b due to their equivalent likelihood functions by Eq.(1.31)&(1.32). However, it was found that the BYY gets different model selection performances on FA-a and FA-b [119, 121]. It deserves more research to find out which one is better under BYY or other model selection methods like VB..

(44) CHAPTER 1. INTRODUCTION. 1.2.2. 16. Learning NFA. Continuous NFA It becomes non-trivial to compute the Eq.(1.11)&(1.12) for NFA due to the complexity of non-Gaussian q(y). NFA relaxes the impractical noise-free assumption for independent component analysis (ICA) [50], and related efforts has been made in e.g., [49] with a rough maximum-likelihood (ML) approximation algorithm under the name of noisy ICA. The EM algorithm that exactly implements ML on the NFA in box 4 of Fig. 1.1 was firstly proposed in [72] through a trick of equivalently exchanging the product of summations into a summation of products, and the same one was also published in [7]. For a large factor number m, approximation was made with the help of variational methods for tractable computation of the Eq.(1.11)&(1.12) [7]. It should be noted that the above EM algorithms for NFA in box 4 of Fig. 1.1 required the factor number m and Gaussian component numbers {k j } to be known in advance, i.e., the model selection issue of determining m ∪ {k j } was not investigated. Not only new model selection criteria but also automatic model selection algorithms were proposed for NFA in both box 4 and box 6 of Fig. 1.1 under the BYY harmony learning. The details are referred to Sect. 5 of [123] and Sect. IV(C) in [124]. Based on the NFA by box 4 of Fig. 1.1, BYY was found to be preferred to AIC, BIC, etc. in a preliminary comparison in [4]. Moreover, BYY may further improve its model selection performance by considering prior distributions on the parameters. BFA For the discovery of non-Gaussian binary factors from continuous data, BFA is used as a special case of NFA in box 5 of Fig. 1.1 [121, 4, 95, 68]. Research on BFA has also been focused on analysis of binary data, such as social research.

(45) CHAPTER 1. INTRODUCTION. 17. questionnaires, market basket data and so on, with the aid of Boolean algebra [57] In the EM algorithm for BFA (see e.g., equation (6)-(9) in [4] for details), the corresponding Eq.(1.12) involves combinatorial complexity of evaluation of q(x|y)q(y) for each y ∈ {0, 1}m×1 . Since the Eq.(1.12) is nested in EM iterations, the EM algorithm is computational expensive for a large m due to the combinatorial complexity. Under BYY [121], the Bayesian inference of y can be analytically solved if the columns of A are restricted to be orthogonal to each other [93], or becomes an NPhard problem of Binary Quadratic Programming (BQP) which may need proper approximations. A study in [3] investigated not only BYY criterion but also BYY automatic model selection performance on BFA, in comparison with AIC, BIC, and so on. However, it was only a preliminary work with rough implementations of BYY. The BYY model selection performance may be further improved by considering more efficient methods to BQP, or by considering prior distributions over the parameters. Also, there is still lack of comparison between BYY and VB for their model selection performances in BFA learning. BMF BMF factorizes the binary data matrix as a product of two low-rank binary matrices. Usually, BMF is optimized according to the square error function [63, 137]. By interpreting the obtained binary matrices as cluster membership, BMF can be used for biclustering on dyadic data, the domains of which have two finite sets of objects and observations are collected on dyads. Most existing matrix factorization algorithms proceed with a given low-rank for the learned matrices [63, 137]. Further restricting the A and X to be binary in BFA, BMF can be presented as a probabilistic model. Equivalently, the low-rank becomes the number of hidden factors, and then the low-rank can be determined.

(46) CHAPTER 1. INTRODUCTION. 18. by the model selection methods.. 1.2.3. Learning Semi-blind NFA. As discussed in Section 1 of [131], there exist many scenarios of knowing partially either or both of the system (i.e., A and the property of E) and the input, and these cases can be unified in a semi-blind learning framework. One example is the NCA model [64] in the form of Eq.(1.8) with certain structural constraints on A, and the result of the decomposition is computed by minimizing the square error criterion, as indicated in box 8 of Fig. 1.1. According to Eq.(1.1)&(1.8), NFA can be used to generalize NCA [64] so that the structural constraints are easily incorporated as prior distributions within the Bayesian paradigm. Moreover, sparse learning and model selection on NFA can be realized simultaneously under the BYY framework.. 1.3. Main Contribution of the Thesis. In this dissertation, we investigate the NFA with different structures of Eq.(1.5) under the BYY learning theory, from two-stage model selection to automatic model selection. Several novel algorithms have been developed, and comparisons have been made with other related model selection methods. The scope of this thesis is briefly summarized in Fig. 1.2. We begin our work by proposing an empirical analysis tool for systematic comparisons on the relative strengths and weaknesses of model selection methods, based on the problem of determining the number of factors in FA which is a degenerate case of NFA in box (b) of Fig. 1.2. Specifically, • we examine the joint effect of sample size N and signal-noise ratio (SNR) rather than merely the effect of either of SNR and N with the other fixed as.

(47) CHAPTER 1. INTRODUCTION. 19. DŽĚĞůƐĞůĞĐƚŝŽŶ͗Ěŝŵ;ǇͿс͍. [ T[. $\ D H. ď &ĂĐƚŽƌŶĂůǇƐŝƐ;&Ϳ. ³ T[ \T\G\. T\

(48)

(49) T \L * \L PL OL . NL . Ă. EŽŶͲ'ĂƵƐƐŝĂŶ&ĂĐƚŽƌ ŶĂůǇƐŝƐ;E&Ϳ. Đ. T\

(50)

(51) T \L . NL. NL. ¦D M . LM. * \L PLM OLM P L OL. PL . OL . ŝŶĂƌǇ&;&Ϳ. T\ \L \L T \L EL EL . . ŚĂƉƚĞƌϮ. zĐŽŵƉĂƌŝƐŽŶŽĨŵŽĚĞůƐĞůĞĐƚŝŽŶ ŵĞƚŚŽĚƐ. ŚĂƉƚĞƌϯ. zƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶŝƐƐƵĞ z&ͲĂǀƐ&Ͳď͕&Ͳƌ zƉƌŝŽƌĚŝƐƚƌŝďƵƚŝŽŶƐŽǀĞƌ੓. ŚĂƉƚĞƌϰ. zYWƐŽůǀĞƌƐĂĨĨĞĐƚŵŽĚĞůƐĞůĞĐƚŝŽŶ zƉƌŝŽƌĚŝƐƚƌŝďƵƚŝŽŶƐŽǀĞƌ੓. [DQG$DUHELQDU\. Ě ŝŶĂƌǇDĂƚƌŝǆ&ĂĐƚŽƌŝǌĂƚŝŽŶ;D&Ϳ ; $<(;$<ELQDU\ SDUWLDONQRZOHGJHRQ$[\ %. Ğ ^ĞŵŝͲďůŝŶĚE& !

(52)

(53) " #

(54) $ "! . ŚĂƉƚĞƌϱ. zĂŶŽǀĞůD&ĂůŐŽƌŝƚŚŵǁŝƚŚ ĂƵƚŽŵĂƚŝĐŵŽĚĞůƐĞůĞĐƚŝŽŶ zďŝĐůƵƐƚĞƌŝŶŐ zƉƌŽƚĞŝŶĐŽŵƉůĞǆĚĞƚĞĐƚŝŽŶ. ŚĂƉƚĞƌϲ. zƐƉĂƌƐĞE&ͬ&ĨŽƌŵŽĚĞůŝŶŐŐĞŶĞƌĞŐƵůĂƚŝŽŶ. ŚĂƉƚĞƌϳ. zĞǆŽŵĞƐĞƋƵĞŶĐŝŶŐĚĂƚĂĂŶĂůǇƐŝƐ. Figure 1.2: The roadmap of the thesis usually made in the literature, by varying both SNR and N. The indifference curves, defined by the contour lines of model selection accuracies, visually reveal that all methods demonstrate relative advantages obviously within a region of moderate N and SNR. Moreover, the importance of studying this region is also confirmed by an alternative reference criterion by maximizing the testing likelihood. • we also provides a theoretic comparison among AIC, CAIC, HQC and BIC, by building up a partial order of the relative underestimation tendency. The order is shown to be AIC, HQC, BIC, and CAIC, indicating the underestimation probabilities from small to large. We further examine how parameterizations affect model selection performance, based on FA-a and FA-b. We combine FA-a and FA-b into a family of FA parameterizations that have equivalent likelihood functions. Each instance in this family is featured by an integer r and thus shortly denoted by FA-r, with FA-a as one end that r = 0 and FA-b as the other end that r reaches its upper-bound m. Between.

(55) CHAPTER 1. INTRODUCTION. 20. the two ends, FA-r is a mixture of a r hidden factor based FA-b and a m − r hidden factor based FA-a, with r indicating the number of free parameters in the diagonal covariance matrix of the hidden variables. With a Bayesian formulation of FA-r, alternative VB algorithms are derived and also BYY algorithms on FA are extended to be quipped with priors on the parameters. Several empirical finds have been obtained via extensive experiments. • First, both BYY and VB perform better on FA-b than on FA-a. Specifically, both BYY and VB reach their best performances on one parameterization FA-m∗ with m∗ being the correct number of hidden factors. This provides a correct calibration though this m∗ is unknown. On one hand, the performances on those of FA-r drop sharply as r reduces from m∗ towards to FA-a, which means that the contribution of FA-a is negative. On the other hand, the performance of FA-r reduces slightly and slowly as r increases towards to FA-b. Moreover, we make a comparison on FA-b with its initial dimension set at r and found a performance similar to that on FA-b. Therefore, FA-b is superior to FA-a considerably and reliably. • Second, both BYY and VB outperform AIC, BIC, and DNLL, while BYY further outperforms VB, especially on FA-b. Moreover, with FA-a replaced by FA-b, the gain obtained by BYY is obviously higher than the one by VB, while the gain by VB is better than no gain by AIC, BIC, and DNLL, especially for a finite size of samples. • Third, we also provide a systematic investigation on how each part of the priors contributes to the model selection performance, and find that though the performance of either VB or BYY can be improved with the help of appropriate priors, BYY does not highly depend on the presences of the priors whereas VB does. Moreover, optimizing the hyper-parameters of priors by.

(56) CHAPTER 1. INTRODUCTION. 21. BYY further improves the performances while using VB for this purpose actually deteriorates the performances. To explore latent binary structures of data, we consider BFA in box (c) of Fig. 1.2, from the perspective of three levels of inverse problems [129], i.e., inverse inference from observation to inner representation, parameter learning, and model selection. Maximizing the BYY harmony functional turns the first level into a Binary Quadratic Programming (BQP). We consider four BQP methods. One is the exact BQP solver by enumeration (shortly denoted as enum). The other three are approximate methods, i.e., the greedy method in [69], the cdual method derived from the canonical duality theory [29], and the round method by relaxing the binary y to a continuous one and rounding the optimal solution back to a binary one [121]. Their BQP optimization performances are ranked as: round < cdual < greedy < enum. • Extensive experiments show that cdual and round are fast and more effective in discarding extra factors, and leads to much better model selection performances than greedy and enum. Thus, some amount of error in BQP actually provides a helpful learning regularization with gain on both computational efficiency and model selection performance. • Moreover, automatic model selection is adopted to save computation from the two-phase implementation by starting from a large enough m and then discarding redundant binary factors during parameter learning. Under BYY, we incorporate into BFA learning priors distributions over parameters, which plays a similar role as Bayesian regularization. With the help of priors, enum and greedy improve in automatic model selection, but are still inferior to cdual and round when they are aided with a priori distributions. • Finally, we provide a comparison on the performance of automatic model.

(57) CHAPTER 1. INTRODUCTION. 22. selection between BYY and VB, as well as BIC in the two-phase implementation as a reference. Such comparisons have been made on factor analysis in [102] and Gaussian mixture model in [89], but not on BFA yet. Notice that BFA is a typical problem of independent component analysis (ICA) when the signal sources are binary, and then we accordingly simplify the VB-ICA algorithm [23, 22] to obtain a VB algorithm on BFA. Empirical analysis shows that BYY is the best for most configurations, while BIC is more robust than VB. VB is good only when both training sample size N is large and noise is small, and declines drastically when N reduces and noise increases. Moreover, applied to the problem of blind binary image separation, the results again show that BYY outperforms VB. Moreover, when BFA is used for modeling binary data matrix X, it becomes the BMF model in box (d) of Fig. 1.2. The BMF by Eq.(1.8) factorizes X as a product of two low-rank binary matrices and equivalently performs a bi-clustering task. However, most existing BMF algorithms require a given low-rank for the latent matrices. To tackle this problem, we propose a probabilistic BMF model under the BYY learning framework. We also develop a novel learning algorithm called BYY-BMF that can automatically determine the low-rank m during the BMF learning. We prove that the proposed algorithm converges after only one iteration for the data with non-overlapping clusters. In addition, we prove that our method can infer the exact number of clusters under appropriate initializations. Moreover, the algorithm is extended with two variants for overlapping cases. Experiments show the effectiveness and efficiency of our algorithm. Furthermore, BMF is applied in bioinformatics to detect protein complexes by clustering the proteins which share similar interactions through factorizing the binary adjacent matrix of a PPI network. BYY-BMF’s clustering results does not depend on any parameters or thresholds, unlike the Markov Cluster Algorithm (MCL) that relies.

(58) CHAPTER 1. INTRODUCTION. 23. on a so-called inflation parameter. On synthetic PPI networks, the predictions evaluated by the known annotated complexes indicate that BYY-BMF is more robust than MCL for most cases. On real PPI networks from the MIPS [71] and DIP [112] databases, BYY-BMF obtains a better balanced prediction accuracies than MCL and a spectral analysis method, while MCL has its own advantages, e.g., with good separation values. Finally, we consider NFA in a general semi-blind learning framework (box (e) of Fig. 1.2) with applications in transcriptional regulatory network analysis and exome sequencing data analysis. • We modifies NCA [64] to model gene transcriptional regulation by NFA [124]. The previous NFA algorithm [123, 124] is extended here as sparse BYY-NFA by considering either or both of a priori connectivity and a priori sparse distribution q(A) over A. Therefore, the a priori knowledge about the connection topology of the TF-gene regulatory network required by NCA is not necessary for our NFA algorithm. With the incorporated sparsity penalty on the mixing matrix of control strengths, the extra entries are automatically pushed to zero if there is not enough evidence for the existence of corresponding TF-gene connections. Simulation study demonstrates the effectiveness of sparse BYY-NFA in recovering the hidden dynamics of TF regulatory signals, and in estimating the connectivity topology and control strengths. The sparse BYY-NFA can not only be applied to detect cyclic patterns of transcription factor activities from the yeast cell cycle data [91], and activations of involved TF regulatory signals during E. coli carbon source transition from glucose to acetate [55], but also shut-off unreliable or unnecessary TF-gene connections. • The sparse BYY-NFA can be further modified by Eq.(1.7) to get a sparse BYY-BFA algorithm, which directly models the switching patterns of latent.

(59) CHAPTER 1. INTRODUCTION. 24. TF activities, e.g., whether or not a TF is activated. The identification of bimodal activity is useful to identify the biological variation of TFs whose regulatory dynamics are tightly around two discrete levels which are usually corrupted by noise. When applied on the yeast cell cycle data [91] and E. coli carbon source transition data [55], the reconstructed binary TF activities by the sparse BYY-BFA is consistent with the ups and downs of the continuous ones by NCA. • We apply the semi-blind NFA learning to the problem of identifying disease associated single nucleotide polymorphisms (SNPs) from the exome sequencing data, for which the methods of conventional genome wide association study (GWAS) do not work properly because one usually need to distinguish true rare variants from sequencing errors in exome sequencing data. Here, a novel method is presented for exome sequencing analysis: First, the information of all SNPs in one exon/gene is encoded by a multidimensional vector y in Eq.(1.1), and then a NFA classifier optimized on a training set {(y, x)} is used in prediction, and significant exons/genes are selected according the p-values of Fisher’s exact test on the confusion tables by the prediction results. The results on a real data set from an exome sequencing project show that the selected significant genes are consistent in part with published results, and some of them are further verified by experiments to be new significant genes associated with the disease. Therefore, our algorithm is a promising tool for exome sequencing data analysis. There are some other important related topics that have not been covered by the thesis. For the cases of classification or clustering analysis, mixture models are usually involved by introducing a label variable , i.e., q(x) = ∑k=1 π q(x|), and then a local NFA model is considered accordingly when each component q(x|) is formulated by NFA [122]. Actually, NFA can be regarded as the following.

(60) CHAPTER 1. INTRODUCTION. 25. constrained Gaussian mixture model (GMM), q(x) =. . ∑ q(x|y, j)q(y|j)q(j) dy = ∑ αjG(x|Aμj + a0, AΛjAT + Σe), j. (1.33). j. where the summation is taken over {j = [ j1 , . . . , jm ] | 1 ≤ jr ≤ kr ; 1 ≤ r ≤ m}, and α j = ∏m r=1 α jr , μj = [μ j1 , . . . , μ jm ], Λ = diag[λ j1 , . . . , λ jm ]. However, these topics are out of scope of this thesis. Readers are referred to [90] for a systematic study of automatic model selection on GMM and extensions of the efforts on FA in this thesis to mixture models.. 1.4. Thesis Organization. The remainder of this thesis is organized according to Fig. 1.2 as follows. Chapter 2 In this chapter, we provide a comparative investigation on model selection performances of several classical criteria and recently developed methods, based on the problem of determining the number of factors in Factor Analysis (FA) model, or equivalently the problem of detecting the number of signals in signal processing literature. A new empirical analysis tool is presented by the contour lines of model selection accuracies under varied signal-noise ratio (SNR) and training sample size N. We also provides a theoretic comparison among AIC, HQC, BIC and CAIC by building up a partial order of the relative underestimation tendency. Chapter 3 In this chapter, we investigate how parameterizations affect model selection performance, an issue that has been ignored or seldom studied in the literature. Based on two maximum-likelihood (ML) equivalent parameterizations of FA, i.e., FA-a and FA-b, we present and explore a family of ML-equivalent FA parameterizations which contains FA-a and FA-b as two.

(61) CHAPTER 1. INTRODUCTION. 26. ends. FA is implemented under BYY and VB, as well as AIC, BIC. Several empirical finds have been obtained via extensive experiments. Chapter 4 In this chapter, we first empirically compare the performances of four BQP methods adopted for inferring a binary code for each observation, an inverse problem nested in BFA learning under BYY. The results suggest that some amount of error in BQP optimization is not always a bad thing but instead provides a helpful regularization for the learning process with automatic model selection. Then, we further improves the performance of BYY on BFA by imposing priors over parameters, and conduct a systematic comparison between BYY and VB for automatic model selection, as well as BIC in the two-stage implementation as a reference. Chapter 5 In this chapter, we propose a novel BMF algorithm under the Bayesian Ying-Yang (BYY) harmony learning, and apply it in bioinformatics to detect protein complexes by clustering the proteins which share similar interactions through factorizing the binary adjacent matrix of a PPI network. Also, we prove some theoretical results on the convergence and performance of the proposed BMF algorithm. Chapter 6 In this chapter, we propose a sparse NFA algorithm to a transcription network analysis problem in system biology, by extending the NCA framework to be applicable when the TF-gene network topology is unreliable or unknown. Moreover, sparse NFA can be implemented at its special case of sparse BFA, which is used to directly model the switch-like patterns of the underlying regulatory dynamics of TFs. Chapter 7 In this chapter, we present a novel method for exome sequencing data analysis based on the semi-blind NFA learning. Unlike the conventional GWAS to evaluate the significance of individual SNP associated with a dis-.