• 沒有找到結果。

6.1 結論

本篇研究發展出一套新方法,能針對當雙胜肽特徵對於分類有效果時能夠有 效的分類蛋白質,並簡單的利用統計雙胜肽的方法來建立 model,也就是 scoring card,再利用 scoring card 中 400 雙胜肽不同的權重給予蛋白質分數再將其予以 分類。

蛋白質利用大腸桿菌表現系統被表現出是一個非常普遍又實用的技術,但此 技術最大的癥結就是在於被表現出的蛋白質狀態,如果此蛋白質為 soluble,那 就代表此蛋白質為結構與功能都正確之蛋白質,並可用以做後續的實驗,但相反 的,如果此蛋白質為 inclusion body,代表此表現出的蛋白質沒有正確生物功能 也無法被用在後續實驗上。所以生物學家都希望能得知欲表現之蛋白質表現後的 狀態。

有生物實驗上許多改變實驗條件的方法來盡可能取得 soluble 的蛋白質,其 中也常使用增加其他不同胺基酸序列,如 GST tag 等方法來讓蛋白質從 inclusion body 轉變為 soluble。所以此實驗的 dataset 包含了沒有接上 tag 的蛋白質與接有 不同 tag 的蛋白質,其中共包含六種不同的 tag,希望藉由蛋白質之胺基酸序列 來預測表現後蛋白質的狀態。

而在研究的過程中發現雙胜肽為分類此問題之重要特徵,並利用此點來研發 出了一個簡單、快速又容易分析的方法來解決預測蛋白質之溶解度。且對此方法,

scoring card,又做了更進一步的改良,加入了強大的最佳化系統。基因演算法已 經是被廣泛應用在各領域中的最佳化方法,但此實驗中用來最佳化 scoring card 的方法不是一般的 simple GA,而是又再加入了直交表的概念,使最佳化的結果 能快速地達到收斂效果,縮短時間的同時又能提升適應函數的表現。所以使用 IGA 來調整 dipeptide scoring card 的結果比起 scoring card 的結果會好出許多。

在本實驗中 IGA-scoring card 的效果比一般常用再分類問題的機器學習方法 SVM 更好,得到更高的準確率,不只如此,比起 SVM 等黑盒子的方法,scoring card 的方法更簡單明瞭,能讓使用者直覺式的依據 scoring card 中的分數值來判 斷並分析。

6.2 未來展望

此研究中發展出的新分類方法對於本研究的分類問題有效果,而後續也可用 來分類其他問題,每種分類問題中的蛋白質類別對於雙胜肽的數目大不相同,但 如果雙胜肽對於分類有效果,那便可藉由 dipeptide scoring card 的方式來做分 類。

雖然目前對於個別胺基酸(amino acid)之特性的研究非常多,也將胺基酸分類

41

為許多不同特性,但這方面的分類在雙胜肽中就少之又少,許多文獻中雖然提出 雙胜肽對於分類某些問題頗有效果,但後續並無進一步對其原因做分析。而經由 IGA 最佳化後的 scoring card 中可看出哪些雙胜肽具有較大的影響程度,所以可 針對某幾個對於某些分類問題有較大影響力的雙胜肽再做進一步的分析。

未來更希望能對於 IGA-scoring card 的方法在做最佳化,並研究出哪些分類 問題可以適用此方法,而不需要使用更複雜之機器學習等的方法。

42

參考文獻

1. Baneyx, F., Recombinant protein expression in Escherichia coli. Curr Opin Biotechnol, 1999. 10(5): p. 411-21.

2. Clark, E.D.B., Refolding of recombinant proteins. Curr Opin Biotechnol, 1998.

9(2): p. 157-63.

3. Dale, G.E., et al., Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. Protein Eng, 1994. 7(7): p. 933-9.

4. Jenkins, T.M., et al., Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. Proc Natl Acad Sci U S A, 1995. 92(13): p. 6057-61.

5. Malissard, M. and E.G. Berger, Improving solubility of catalytic domain of human beta-1,4-galactosyltransferase 1 through rationally designed amino acid replacements. Eur J Biochem, 2001. 268(15): p. 4352-8.

6. Murby, M., et al., Hydrophobicity engineering to increase solubility and stability of a recombinant protein from respiratory syncytial virus. Eur J Biochem, 1995. 230(1): p. 38-44.

7. Pedelacq, J.D., et al., Engineering soluble proteins for structural genomics. Nat Biotechnol, 2002. 20(9): p. 927-32.

8. Timson, D.J. and R.J. Reece, Functional analysis of disease-causing mutations in human galactokinase. Eur J Biochem, 2003. 270(8): p. 1767-74.

9. Wetzel, R., L.J. Perry, and C. Veilleux, Mutations in human interferon gamma affecting inclusion body formation identified by a general immunochemical screen. Biotechnology (N Y), 1991. 9(8): p. 731-7.

10. Hammarstrom, M., et al., Rapid screening for improved solubility of small human proteins produced as fusion proteins in Escherichia coli. Protein Sci, 2002. 11(2): p. 313-21.

11. Makrides, S.C., Strategies for achieving high-level expression of genes in Escherichia coli. Microbiol Rev, 1996. 60(3): p. 512-38.

12. Stevens, R.C., Design of high-throughput methods of protein production for structural biology. Structure, 2000. 8(9): p. R177-85.

13. ED., C., Protein refolding for industrial processes. Current Opinion in Biotechnology, 2001. 12(2): p. 202-207.

14. Tung, C.W. and S.Y. Ho, POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties.

Bioinformatics, 2007. 23(8): p. 942-9.

43

15. Ho, S.Y., J.H. Chen, and M.H. Huang, Inheritable genetic algorithm for

biobjective 0/1 combinatorial optimization problems and its applications. IEEE Trans Syst Man Cybern B Cybern, 2004. 34(1): p. 609-20.

16. Ho, S.Y., L.S. Shu, and J.H. Chen, Intelligent evolutionary algorithms for large parameter optimization problems. Ieee Transactions on Evolutionary

Computation, 2004. 8(6): p. 522-541.

17. Chang, C.-C.a.L., Chih-Jen, LIBSVM: A library for support vector machines.

ACM Transactions on Intelligent Systems and Technology, 2011. 2(3): p.

27:1--27:27.

18. Wilkinson, D.L. and R.G. Harrison, Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology (N Y), 1991. 9(5): p. 443-8.

19. Davis, G.D., et al., New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng, 1999. 65(4): p. 382-8.

20. Idicula-Thomas, S. and P.V. Balaji, Understanding the relationship between the primary structure of proteins and its propensity to be soluble on

overexpression in Escherichia coli. Protein Sci, 2005. 14(3): p. 582-92.

21. Idicula-Thomas, S., et al., A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics, 2006. 22(3): p. 278-84.

22. Smialowski, P., et al., Protein solubility: sequence based prediction and experimental verification. Bioinformatics, 2007. 23(19): p. 2536-42.

23. Magnan, C.N., A. Randall, and P. Baldi, SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics, 2009. 25(17): p. 2200-7.

24. Diaz, A.A., et al., Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol Bioeng, 2010. 105(2): p. 374-83.

25. Chan, W.C., et al., Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics, 2010. 11 Suppl 1: p.

S21.

26. Krogh, A., et al., Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol, 2001. 305(3): p.

567-80.

27. Bhasin, M. and G.P. Raghava, ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST.

Nucleic Acids Res, 2004. 32(Web Server issue): p. W414-9.

28. Grassmann, J., et al., Protein fold class prediction: new methods of statistical classification. Proc Int Conf Intell Syst Mol Biol, 1999: p. 106-12.

29. Bhasin, M. and G.P. Raghava, Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem, 2004.

44

279(22): p. 23262-6.

30. Kawashima, S., et al., AAindex: amino acid index database, progress report 2008. Nucleic Acids Res, 2008. 36(Database issue): p. D202-5.

31. Chan, H.S.a.D., K.A., Transition states and folding dynamics of proteins and heteropolymers. The Journal of Chemical Physics, 1994. 100 (12): p.

9238-9257.

32. Socci, N.D.a.O., J.N., Folding kinetics of proteinlike heteropolymers. Journal of Chemical Physics, 1994. 101(2): p. 1519–1528.

33. Chen, K., L.A. Kurgan, and J. Ruan, Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem, 2008. 29(10): p. 1596-604.

34. Lin, H. and H. Ding, Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol, 2011. 269(1): p. 64-9.

35. Fang, G.Z.a.B., The influence of dipeptide composition on optimum

temperature of alcohol dehydrogenase. Enzyme and Microbial Technology, 2006. 39(4): p. 811-816.

36. Ikai, A., Thermostability and aliphatic index of globular proteins. J Biochem, 1980. 88(6): p. 1895-8.

37. Christendat, D., et al., Structural proteomics of an archaeon. Nat Struct Biol, 2000. 7(10): p. 903-9.

38. Luan, C.H., et al., High-throughput expression of C. elegans proteins. Genome Res, 2004. 14(10B): p. 2102-10.

39. Bertone, P., et al., SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res, 2001. 29(13): p. 2884-98.

40. Costantini, S., G. Colonna, and A.M. Facchiano, Amino acid propensities for secondary structures are influenced by the protein structural class. Biochem Biophys Res Commun, 2006. 342(2): p. 441-51.

相關文件