according to the GDI values - Discovery of dominant and dormant genes from expression data usi

For gene i, let the mean and standard deviation of the gene expression values in class j be μ

_ij

^, σ

_ij

Gene i

Class 1 Class 2 Class 3 Class k

Normalize expression values of each gene across samples to 0 (Min) ~ 1 (Max) Samples

Genes Normalization

Computation of mean and standard deviation

Sorting of the mean values

Computation of GDI for dominant genes

Finding a list of dominant/dormant genes for each class For gene i, the associated mean values

Computation of GDI for dormant genes

)

For gene i, let the mean and standard deviation of the gene expression values in class j be μ

_ij

^, σ

_ij

Gene i

Class 1 Class 2 Class 3 Class k

Sort genes in

Class 1 Class 2 Class 3 Class k

1.4 Train classifier(s), C, using XTR considering all or part of the genes in SG.

1.5 Evaluate classifier(s), C, on the test set XTS.

2. Classifier evaluation: Summarize performance of the classifiers over the 100 outer level trials.

In our investigation in Step 1.2.2 and Step 1.3 we have used m = 5. In Step 1.4 we have used six kinds of classifiers for comparison (three of them are used in [8]): the Near-est Mean Classifier, the NearNear-est Neighbor Classifier, and four kinds of the Support Vector Machine Classifiers. The adopted SVM classifiers include the one-versus-one SVM with linear kernel (OVO.SVM-L), the one-versus-one SVM with Gaussian kernel (also called SVM with Radial Basis Function, OVO.SVM-R), the one-versus-all SVM with lin-ear kernel (OVA.SVM-L), and the one-versus-all SVM with Gaussian kernel (OVA.SVM-R). Note that, only the SVM.OVA-L was used in [8]. We have implemented the NMC and NNC classifiers; while for application of SVM to multi-class problems, we have used the e1071 library of R http://www.r-project.org which is based on the LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/. For SVMs, the training data are further randomly split into two equal parts (training and validation) for determining the opti-mal hyper-parameters for the SVM classifiers. The optiopti-mal hyper-parameters are then used to design SVM classifiers with the training data and their performance is evaluated on the test data. Here for C (the constant for regulariza-tion), we use four choices {1, 10, 100, 1000} and for the spread of Gaussian kernel γ, we consider eight choices {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}.

Gene dominant and dormant indices (GDI)

As we mentioned in Background, our main contribution is to develop a gene evaluation index, called "Gene Dom-inant/Dormant Index (GDI)", to select significant genes for multicategory classification problems. This GDI con-cept is similar in spirit to the Signal-to-Noise ratio (SNR), broadly adopted for gene selection in two-class problems [2], but the GDI can be applied to multicategory prob-lems. Moreover, GDI further helps to identify dominant and dormant genes as defined next.

Dominant Gene

A gene that is over-expressed in only one of the classes and under-expressed in the remaining classes. Thus a domi-nant gene is defined with respect to a set of diseases/

classes and it has a very strong class specific signature.

Dormant Gene

A gene that is under-expressed in only one of the classes but over-expressed in the remaining classes. Thus a

dor-mant gene is also defined with respect to a set of diseases/

classes and it also has a strong class specific signature.

From the above definitions, it is clear that dominant genes, if any, will be good biomarkers because such genes are expected to play active roles for the disease. It also appears that finding a dominant gene may not be a diffi-cult task, particularly for a given set of cancers, because usually some genes will be highly expressed for a particu-lar type of cancer. But dormant genes may not always be available in a given set of diseases as the requirements of dormant genes are harder to satisfy. It is easy to visualize that both dormant genes and dominant genes will have high discriminating power. Moreover, one can design a diagnostic system using the dominant genes and then can authenticate the decisions using information available with the dormant genes. These can lead to more reliable diagnostic systems. In simulation results we demonstrate that we can make more accurate prediction for several multiclass problems based on dominant or dormant genes selected by the GDI criterion (compared to two existing gene selection methods for multiple classes, such as SVM-RFE [8] and MMC-RFE [8]). For an easy under-standing, Fig. 18 depicts the steps involved in the compu-tation of GDI, which are explained next.

Normalization

The expression values of each gene are normalized in the range from 0 to 1 across samples. This step preserves the richness in the original expression values for each gene among the samples and helps us to easily visualize the dis-tribution of expression values for the dominant or dor-mant genes.

Computation of mean and standard deviation

For each gene, the mean and standard deviation of the gene expression values in each class are calculated. Let the mean and standard deviation for gene i in class j be μ_ij, σ_ij. Sorting of the mean values

For notational simplicity, to explain the computation of the GDI for gene i, we ignore the index i. We sort μ_j; j = 1, ..., k in descending order. Let the sorted mean values be μ_j(s); j = 1, ..., k. Suppose μ_1(s)is the mean for class m. This means that the gene under consideration is most highly expressed in class m. Similarly, if μ_2(s)corresponds to class r, then if we exclude class m, then amongst the remaining classes this gene has the highest expression level on aver-age in class r. Thus, if the gene under consideration has a distinct class specific signature, then μ_1(s)and μ_2(s)must be well separated and if that is not so, then this gene cannot be a dominant gene. Note that, to make this conclusion, we do not need to look at the mean values corresponding to other classes. We can do so because we have sorted the class means in descending order.

Computation of GDI for dominant genes

Now we define the GDI_Domfor the gene under considera-tion as:

As discussed above, the index at Equation 1 can be com-puted for each gene and then the GDI_Domvalues can be sorted in descending order. A higher value of GDI_Dom indi-cates that the gene for the m-th class is significantly over-expressed compared to the r-th class and obviously it is more strongly over-expressed compared to the remaining classes. Thus, it is a dominant gene for class m or 1(s).

Dominant genes, if exist, will appear at the top of the sorted list. A set of genes can then be selected from this sorted list for further processing. Note that, for a two class problem, although we do not use the absolute value in the numerator, because of the sorting, Equation 1 is exactly the same as that of Golub's SNR index [2]. In other words, the GDI_Domcan be viewed as true generalization of Golub's SNR for a multiclass problem.

Computation of GDI for dormant genes

However, the GDI_Domin Equation 1 will not be able to find the dormant genes, if any. In order to find the dor-mant genes we can proceed as follows. If the gene under consideration is a dormant one, then it will be unex-pressed for one class but at least moderately exunex-pressed for all of the remaining classes. In this case, (μ_k-1(s)- μ_k(s)) should be considerably high, where μ_k(s)is the last value in the sorted sequence; in other words, it is the mean expres-sion level for the class in which the gene under considera-tion is least expressed. Thus, we define the GDI_Dorfor identifying the dormant gene as

Note that, Equation 1 uses the class mean values and standard deviations of the top two classes in the sorted list while Equation 2 uses the class means and standard devi-ations corresponding to the last two values in the sorted list. Consequently, if GDI_Doris significantly high for a gene, then this gene is a dormant gene for the class repre-sented by k(s).

It is easy to see that for a two class problem, GDI_Dor reduces to the SNR of [2]. Thus both GDI_Domand GDI_Dor can be viewed as generalizations of SNR. We can combine Equations 1 and 2 and write in a convenient manner as in Equation 3.

In Equation 3 when x = Dom, p and q correspond to the top two classes, respectively, in the sorted list and when x

= Dor, then p and q correspond to the last two classes in the sorted list, respectively.

We want to emphasize that a dominant gene is dominant for a class with respect to the given set of classes/groups under consideration. For example, given the SRBCT group, a gene may be dominant for the Neuroblastoma class implying that this gene is highly expressed for the Neuroblastoma cases but unexpressed for the other three types of childhood cancers. Now if we augment the set of four childhood cancers by one more type, then this partic-ular gene may not remain dominant with respect to the group of five childhood cancers. Similar is the case with dormant genes.

Finding a list of dominant/dormant genes for each class

After calculating the GDI_Domvalues of all genes, a list of dominant genes for each class can be obtained as follows.

For each gene, the GDI_Domis associated to the class repre-sented by 1(s); in other words, it is associated to the class corresponding to the top element in the sorted list. In this way, every gene is associated with a class and a value of dominancy as expressed by GDI_Dom. We can now sort the genes associated with a particular class according to the GDI_Domvalues. In this way we get a sorted list for each class. We can now select useful genes for a class from the top of the list. Clearly, when selecting the dominant genes, the higher the GDI_Dom, the more dominant the gene is. A similar procedure can be applied for the generation of a list of dormant genes for each class using the GDI_Dor values.

Gene selection strategy

If we use several dominant (or dormant or both kinds of) genes from each class ranked according to GDI_Domvalues to design diagnostic systems, we are expected to get suffi-cient discriminating power for all classes in multi-class discrimination problems. But since in each resampling experiment we may get a different set of dominant (dor-mant) genes for a class, it would be better to aggregate the output of several resampling experiments. Different egies are possible for this. Next we propose one such strat-egy:

Frequency-based method

The gene selection scheme is displayed in Algorithm Gene Selection. It proceeds as follows. In each of the 100 trials, we select the top m (= 5) dominant (dormant) genes for

GDI s s

each class to compute the frequency with which each such gene appears as a candidate gene for a class. A good dom-inant (dormant) gene is likely to appear more frequently.

In order to find the set of interesting (marker) genes for each class we select the top five most frequently occurring genes. However, some class may have more than five genes with strong class specific signatures. If that happens, we should include those genes also if our goal is to find the set of interesting (marker) genes, not just designing of a classifier. Hence, in addition to the top five genes, if there are other genes with frequency of appearance 50 or more (in 100 trials) we also consider those genes impor-tant. In this manner we find a set of genes that may be bio-logically interesting. But all these genes may not be necessary for designing a classifier, because for a k-class discrimination, even a set of less than k good genes may be adequate. Tables 1, 2, 3, 4 are generated by this scheme.

Algorithm Gene Selection 1. Repeat 100 times.

1.1 Partition the data set X into XTR and XTS, such that XTR = X, XTS = X - XTR, p <q; here we use p = 2, q = 3, XTR = X.

1.2 Use XTR to compute GDIs for each gene.

1.3 Find the set of best m dominant and m dormant genes for each class.

1.4 Note the frequency of the selected genes.

2. Generate the set of dominant (dormant) genes with the m most frequently occurring dominant (dormant) genes from each class.

Permutation test to assess statistical significance of GDI indices

To assess the statistical significance of the GDI indices associated with the identified dominant and dormant genes, a permutation test has been performed. The proce-dure followed is summarized below. Both un-adjusted p-values and q-p-values adjusted for multiple comparisons are computed. Let G be the total number of genes and S be the total number of sample points.

(1) Given an expression matrix D (x_gsis the expression intensity of gene g and sample unit s; 1 ≤ g ≤ G, 1 ≤ s ≤ S) with class labels (y_s, 1 ≤ s ≤ S), we compute the gene dom-inant index GDI_Dom, m_gand gene dormant index GDI_Dor, r_g, for each gene g.

(2) Randomly permute the class labels y_sfor B times. In the bth permutation (1 ≤ b ≤ B), compute , the new GDI_Domand , the new GDI_Dorfor gene g using the expression matrix D and the permuted labels . (3) The p-value of the observed dominant GDI, m_g, for gene g is

where I(·) is an indicator function that takes the value one when true and zero otherwise. Similarly the p-value of the observed dormant GDI, r_g, is

(4) To account for the multiple tests being performed in the G genes, q-values of the observed m_gand r_gare calcu-lated as

Authors' contributions

All authors contributed significantly to the investigation.

YST, CTL, IFC, and NRP together formulated the new indi-ces. YST and IFC implemented the algorithms and con-ducted the experiments. GCT designed and carried out the statistical experiment. IFC and NRP led and coordinated the investigation. CTL, IFC, and NRP wrote the manu-script. All authors have read and approved the final man-uscript.

Acknowledgements

The work is supported in part by the National Science Council, Taiwan, under Contract No. NSC 97-2221-E-010-011, and in part by the Yen Tjing Ling Medical Fundation, Taiwan, under Contract No. CI-97-7, and in part by the "Aiming for the Top University Plan (ATU)" of the National Chiao-Tung University and the Ministry of Education (MOE), Taiwan, under Con-tract No. 97W806.

References

1. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 1998, 95:14863-14868.

2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537.

3. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci 1999, 96:2907-2912.

4. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expres-sion data. J Am Stats Assoc 2002, 97:77-87.

5. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A compre-hensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformat-ics 2005, 21:631-643.

6. Wang Y, Makedon FS, Ford JC, Pearlman J: HykGene: a hybrid approach for selecting marker genes for phenotype classifi-cation using microarray gene expression data. Bioinformatics 2005, 21:1530-1537.

7. Kim KJ, Cho SB: Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 2006, 70:187-199.

8. Niijima S, Kuhara S: Recursive gene selection based on maxi-mum margin criterion: a comparison with SVM-RFE. BMC Bioinformatics 2006, 7:543.

9. Guyon I, Weston J, Barnhill S, Vapnil V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46:389-422.

10. Pal NR, Aguan K, Sharma A, Amari SI: Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering. BMC Bioinformatics 2007, 8:5.

11. Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2:Research0042.

12. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 2001, 98:15149-15154.

13. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, et al.: Classifica-tion and diagnostic predicClassifica-tion of cancers using gene expres-sion profiling and artificial neural networks. Nat Med 2001, 7:673-679.

14. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: Translo-cations specify a distinct gene expression profile that distin-guishes a unique leukemia. Nat Genet 2002, 30:41-47.

15. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, et al.: Pre-diction of central nervous system embryonal tumour out-come based on gene expression. Nature 2002, 415:436-442.

16. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al.: Classification of human lung carcinomas by mRNA expression profiling reveals dis-tinct adenocarcinoma subclasses. Proc Natl Acad Sci 2001, 98:13790-13795.

17. Morris JS, Yin G, Baggerly K, Wu C, Zhang L: Pooling information across different studies and oligonucleotide chip types to identify prognostic genes for lung cancer. In Methods of Micro-array Data Analysis IV Edited by: Shoemaker JS, Lin SM. New York:

Springer; 2005:51-66.

18. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4:177-183.

19. Hong H, Tong W, Perkins R, Fang H, Xie Q, Shi L: Multiclass deci-sion forest – a novel pattern recognition method for multi-class multi-classification in microarray data analysis. DNA Cell Biol 2004, 23:685-694.

20. Pasic S, Vujic D, Djuricic S, Jevtic D, Grujic B: Burkitt lymphoma-induced ileocolic intussusception in Wiskott-Aldrich syn-drome. J Pediatr Hematol Oncol 2006, 28:48-49.

21. Filipovich AH, Mathur A, Kamat D, Shapiro RS: Primary immuno-deficiencies: genetic risk factors for lymphoma. Cancer Res 1992:5465s-5467s.

22. Sullivan KE, Mullen CA, Blaese RM, Winkelstein JA: A multiinstitu-tional survey of the Wiskott Aldrich syndrome. J Pediatr 1994, 125:876-885.

23. Ochs HD: The Wiskott-Aldrich syndrome. Clin Rev Allergy Immu-nol 2001, 20:61-86.

24. Palenzuela G, Bernard F, Gardiner Q, Mondain M: Malignant B cell non-Hodgkin's lymphoma of the larynx in children with Wiskott Aldrich syndrome. Int J Pediatr Otorhinolaryngol 2003, 67:989-993.

25. Tse W, Meshinchi S, Alonzo TA, Stirewalt DL, Gerbing RB, Woods WG, Appelbaum FR, Radich JP: Elevated expression of the AF1q gene, an MLL fusion partner, is an independent adverse prognostic factor in pediatric acute myeloid leukemia. Blood 2004, 104:3058-3063.

26. Li DQ, Hou YF, Wu J, Chen Y, Lu JS, Di GH, Ou ZL, Shen ZZ, Ding J, Shao ZM: Gene expression profile analysis of an isogenic tumor metastasis model reveals a functional role for onco-gene AF1Q in breast cancer metastasis. Eur J Cancer 2006, 42:3274-3286.

27. Eswarakumar VP, Lax I, Schlessinger J: Cellular signaling by fibrob-last growth factor receptors. Cytokine Growth Factor Rev 2005, 16:139-149.

28. Qian ZR, Sano T, Asa SL, Yamada S, Horiguchi H, Tashiro T, Li CC, Hirokawa M, Kovacs K, Ezzat S: Cytoplasmic expression of fibroblast growth factor receptor-4 in human pituitary ade-nomas: relation to tumor type, size, proliferation, and inva-siveness. J Clin Endocrinol Metab 2004, 89:1904-1911.

29. Wang J, Stockton DW, Ittmann M: The fibroblast growth factor receptor-4 arg388 allele is associated with prostate cancer initiation and progression. Clin Cancer Res 2004, 10:6169-6178.

30. Ezzat S, Huang P, Dackiw A, Asa SL: Dual inhibition of RET and FGFR4 restrains medullary thyroid cancer cell growth. Clin Cancer Res 2005, 11:1336-1341.

31. Nakamura N, Iijima T, Mase K, Furuya S, Kano J, Morishita Y, Noguchi M: Phenotypic differences of proliferating fibroblasts in the stroma of lung adenocarcinoma and normal bronchus tissue.

Cancer Sci 2004, 95:226-232.

32. Rimokh R, Gadoux M, Bertheas MF, Berger F, Garoscio M, Deléage G, Germain D, Magaud JP: FVT-1, a novel human transcription unit affected by variant translocation t(2;18)(p11;q21) of fol-licular lymphoma. Blood 1993, 81:136-142.

33. Fiucci G, Ravid D, Reich R, Liscovitch M: Caveolin-1 inhibits anchorage-independent growth, anoikis and invasiveness in MCF-7 human breast cancer cells. Oncogene 2002, 21:2365-2375.

34. Engelman JA, Wykoff CC, Yasuhara S, Song KS, Okamoto T, Lisanti MP: Recombinant expression of caveolin-1 in oncogenically transformed cells abrogates anchorage-independent growth. J Biol Chem 1997, 272:16374-16381.

35. Lee SW, Reimer CL, Oh P, Campbell DB, Schnitzer JE: Tumor cell growth inhibition by caveolin re-expression in human breast cancer cells. Oncogene 1998, 16:1391-1397.

36. Hurlstone AF, Reid G, Reeves JR, Fraser J, Strathdee G, Rahilly M, Par-kinson EK, Black DM: Analysis of the caveolin-1 gene at human

在文檔中 Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems (頁 28-33)