結論 - 基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性

本研究基於分群集成採樣技術和 Hyper-ensemble 方法開發出 CE-SMURF 機器學習框架，並應用於預測非編碼區致病變異，在調整 CE-SMURF 採樣參數以獲取最佳化模型時，發現單獨使用 CE-Under 能夠有最好的表現，但這並不能直接否定 CE-SMOTE 的作用，或許在其他訓練集中同時使用 CE-SMOTE 和 CE-Under 會有更好的表現。在目前有使用到採樣技術的方法中，CE-SMURF 不管是在 ROC 指標或是 PRC 指標都能取得較高的分數，表示分群集成採樣和 Hyper-ensemble 能有效改善一般機器學習演算法在學習不平衡資料集時的限制，其中分群集成採樣能在平衡正負樣本數量的同時降低對於資料特性的影響，而 Hyper-ensemble 透過平均多個 Random Forest 分類器的結果，藉此得到比單一 Random Forest 分類器更好的預測結果。此外 CE-SMURF 對於訓練資料集的不平衡程度有較低的敏感度，隨著不平衡程度的上升，訓練的表現能有較小幅度的降低，特別的是雖然訓練的表現有下降的趨勢，但在測試集的預測表現上卻是大幅的上升。除此之外，本研究也發現移除資料庫中潛藏的錯誤致病變異，使用較高可信度的致病變異當作訓練資料，能讓正負樣本之間有更大的差異，進而提升預測的準確度。未來能夠使用較新的採樣技術或是 Ensemble 的方法來改良 CE-SMURF 的部分框架，且隨著更多臨床實驗資料的釋出，正負樣本的數量必然都會呈現上升的趨勢，適當的選擇可信度較高的樣本則能在預測致病變異的問題上有更好的表現。

參考文獻

1. Edwards, S.L., et al., Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet, 2013. 93(5): p. 779-97.

2. Smedley, D., et al., A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genet, 2016. 99(3): p. 595-606.

3. Kircher, M., et al., A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 2014. 46(3): p. 310-5.

4. Quang, D., Y. Chen, and X. Xie, DANN: a deep learning approach for

annotating the pathogenicity of genetic variants. Bioinformatics, 2015. 31(5): p.

761-3.

5. Ionita-Laza, I., et al., A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet, 2016. 48(2): p.

214-20.

6. Ritchie, G.R., et al., Functional annotation of noncoding sequence variants. Nat Methods, 2014. 11(3): p. 294-6.

7. Schubach, M., et al., Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Sci Rep, 2017. 7(1): p.

2959.

8. Chawla, N.V., et al., SMOTE: synthetic minority oversampling technique. J. Artif.

Int. Res., 2002. 16(1): p. 321-357.

9. 陈思，郭躬德，陈黎飞, 基于聚类融合的不平衡数据分类方法. 模式识别与 人工智能, 2010. 23(6): p. 772-775

10. Breiman, L., Random Forests. Machine Learning, 2001. 45(1): p. 5-32.

11. Rojano, E., et al., Regulatory variants: from detection to predicting impact. Brief Bioinform, 2018.

12. Stenson, P.D., et al., Human Gene Mutation Database (HGMD): 2003 update.

Hum Mutat, 2003. 21(6): p. 577-81.

13. Landrum, M.J., et al., ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res, 2018. 46(D1): p. D1062-D1067.

14. Genomes Project, C., et al., A global reference for human genetic variation.

Nature, 2015. 526(7571): p. 68-74.

16. Fred, A.L. and A.K. Jain. Data clustering using evidence accumulation. in Object recognition supported by user interaction for service robots. 2002. IEEE.

17. Fred, A. Finding consistent clusters in data partitions. in International Workshop on Multiple Classifier Systems. 2001. Springer.

18. Strehl, A. and J. Ghosh, Cluster ensembles---a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 2002.

3(Dec): p. 583-617.

19. Zhou, Z.-H. and W. Tang, Clusterer ensemble. Knowledge-Based Systems, 2006.

19(1): p. 77-83.

20. Topchy, A., et al. Adaptive clustering ensembles. in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. 2004.

IEEE.

21. Chen, S., G. Guo, and L. Chen. Semi-supervised classification based on

clustering ensembles. in International Conference on Artificial Intelligence and Computational Intelligence. 2009. Springer.

22. Liu, L., et al., Biological relevance of computationally predicted pathogenicity of noncoding variants. Nat Commun, 2019. 10(1): p. 330.

23. Richards, S., et al., Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 2015. 17(5): p. 405-24.

附錄 1 各類別內詳細特徵

TF binding

JUND, SP1, FOSL2, HNF4A, EP300, FOXA2, TCF12, TBP, HDAC2, HEY1, FOXA1, HNF4G, GATA1, SIN3A, GTF2F1, MYC, TCF7L2, CHD2, TAF1, STAT1, BCLAF1, MAX, CEBPB, MXI1, BATF, RDBP, BCL3, E2F4, POU2F2, SLC22A2, HMGN3, PAX5, YY1, NFKB1, NR3C1, USF1, STAT3, GATA2, TFAP2C, BHLHE40, TAL1, HSF1, TFAP2A, ELF1, GTF2B, USF2, FOS, CCNT2, E2F6, IRF4, CTCF, E2F1, ZEB1, STAT2, REST, SREBF2, MEF2A, SMARCB1, EGR1, RXRA, SPI1, ELK4, EBF1, PBX3, RFX5, BRCA1, SMC3, SMARCA4, SREBF1, NR2C2, TRIM28, TAF7, NFYA, RAD21, SRF, ZBTB7A, IRF1, SIRT6, NFE2, ZNF263, THAP1, CTBP2,

MEF2_complex, GTF3C2, ATF3, BCL11A, BDP1, BRF1, BRF2, CTCFL,

ERALPHAA, ESRRA, ETS1, ERALPHAA, FAM48A, FOSL1, GABPA, GATA3, HDAC8, IRF3, JUN, JUNB, KAT2A, MAFF, MAFK, NANOG, NFYB, NR4A1, NRF1, POU5F1, PPARGC1A, PRDM1, SETDB1, SIX5, SMARCC1, SMARCC2, SP2,

SUZ12, WRNIP1, XRCC4, ZBTB33, ZNF143, ZNF274, ZZZ3, bound motif, pwm

Histone modifications

H3K4me3, H3K4me2, H3K9ac, H2AFZ, H3K4me1, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H3K9me3, H3K9me1, H4K20me1

Open chromation

DNase, FAIRE, dnase_fps

RNA polymerase binding

POLR2A, POLR2A_elongating, POLR3A

CpG islands cpg_island

Genome segmentation

TSS, TRAN, ENH, WEAK_ENH, CTCF_REG, TSS_FLANK, REP

Human variation avg_daf, avg_het Genic context

EXON, INTRON, CDS, UTR’5, UTR’3, DONOR, ACCEPTOR, START, STOP, tss_dist, ss_dist, GC, in_cpg

Sequence context

seq_A, seq_C, seq_G, seq_T, repeat

在文檔中基於分群集成技術的非平衡學習應用於預測非編碼區變異的致病性 (頁 41-45)