Data Analysis - 在微陣列資料上利用基因分群以減少冗贅之基因選取方法

We apply the proposed gene selection scheme to three popular microarray datasets in the literature, the leukemia data (Golub et al., 1999), colon cancer data (Alon et al., 1999), and breast cancer data (Hedenfalk et al., 2001). Combinations of each of the three ranking methods described before with RFE (Guyon et al., 2002) are considered in this study. For convenience, we call these combinations Scheme Ⅰ, Ⅱ, and Ⅲ, respectively. Furthermore, we compare our gene selection scheme with the corresponding ranking method and the RFE.

In the absence of test data in the colon cancer data and breast cancer data, leave-one-out cross validation (LOOCV) is adopted to evaluate the performance of the methods in this study.

More specifically, for each subject, remove it from the original dataset, train the rest of data to build a classifier, and then test the classifier on the removed subject. The three publicly available datasets have been processed in many different ways by analysts, including experimental design, normalization, outlier elimination, etc. Most of these preprocessing works were beyond our control, especially the variation removal between chips (subjects).

Without transforming the data to attain the consistency, we merely standardize each gene of the training set such that the mean is 0 and standard deviation is 1 across subjects to ensure comparability with each other.

We filter out 90% of genes and cluster the rest genes into =1~30 clusters. In addition, we use the cumulative frequency plot of the values of each ranking criterion as an auxiliary to judge whether it is adequate to cluster only 10% of genes.

4.1 Leukemia Data

The gene expression levels of the leukemia data (Golub et al., 1999) were produced by Affymetrix high-density oligonucleotide microarrays. The data contain two subsets: a training data set used to select genes and create the classifier, and an independent test data set used to

assess the performance of the classifier. The training set consists of 38 bone marrow subjects (27 ALL (acute lymphoblastic leukemia), 11 AML (acute myeloid leukemia)) obtained from acute leukemia patients at the time of diagnosis. The test set has 34 leukemia subjects (20 ALL, 14 AML), including 24 bone marrow and 10 peripheral blood subjects, and data are from different reference laboratories that used different subject preparation protocols. Each dataset contains 7,129 genes. The problem of interest is to distinguish between two types of leukemia, ALL and AML. We pool two datasets, training set and test set, together and implement LOOCV on it. The following are some results:

● Figure 4 displays the leave-one-out error rates of RFE and the three ranking methods. RFE is obviously better than all three ranking methods when the number of selected genes is small, and the results of all three ranking methods are similar.

● Figures 5-7 give, respectively, the cumulative frequency plots of the three different ranking values of all subjects. It is noted that filtering out 90% of genes seems plausible because of those 90% genes have smaller ranking values relatively.

● After clustering, we select about 10% genes from each cluster to form a set of candidate genes of size around 70. SVM-RFE is applied to this set. Figures 8-10 show the leave-one-out error rate only for 1-50 selected genes, for each in a subplot, of schemes Ⅰ-Ⅲ (solid line), respectively. In addition to the results of our gene selection scheme, we also plot the leave-one-out error rates of RFE (undertone solid line) and the corresponding ranking method (dashed line) in each subplot.

We note that when the number of genes reduces to 1, the error rate is always the largest in our three schemes. However, our schemes indeed perform better than the three ranking methods. Among three schemes, scheme Ⅱ performs the best and almost as good as RFE.

4.2 Colon Cancer Data

The colon cancer data (Alon et al., 1999) were also produced by Affymetrix

oligonucleotide arrays. After pre-processing, the data set contains the expression of the 2,000 genes with highest minimal intensity across the 62 tissues. The 62 tissues include 22 normal and 40 colon cancer tissues.

● Figure 11 displays the leave-one-out error rates of RFE and the three ranking methods. Although all curves are fairly flat about the value 0.2, it still can be seen that all three ranking methods all perform better than RFE.

● Figures 12-14 give, respectively, the cumulative frequency plots of the three different ranking values of all subjects. These ranking values are obviously smaller than that of the leukemia data. Filtering out 90% of genes also seems acceptable.

However, in order to avoid leaving out informative genes, we take 300 top-ranked genes for clustering in the three schemes.

● After that, we select 20% genes from each cluster such that the size of the gene set will be around 60 in number. The leave-one-out error rates of schemes Ⅰ-Ⅲ (solid line) on this subset for different are plotted in Figures 15-17, respectively. Each subplot accompanies the results of RFE (undertone solid line) and the corresponding ranking method (dashed line).

The curve of our method for each is still flat but slightly higher than the other two methods for all three schemes. This is probably due to that ranking methods in themselves perform better than RFE. And it is notable that all curves are fairly flat in the number of selected genes for this dataset, that is, we can not get better result even if we increase the number of selected genes.

4.3 Breast Cancer Data

The breast cancer data (Hedenfalk et al., 2001) were produced by cDNA mircorarray technique that is different from the Affymetrix oligonucleotide microarrays. This technique is

in this dataset. Each tissue corresponds to one of three mutations of breast cancer, that is, BRCA1, BRCA2, and Sporadic. There are 7 BRCA1, 8 BRCA2, and 7 Sporadic. We let BRCA1 as one class and pool BRCA2 and Sporadic as another class.

● Figure 18 displays the leave-one-out error rates of RFE and the three ranking methods. We observe that RFE is better than the ranking methods, especially when the number of the selected genes is less than 20.

● Figures 19-21 give, respectively, the cumulative frequency plots of the three different ranking values of all subjects. Filtering out 90% of genes is still acceptable.

● After clustering, we select 20% genes from each cluster so that the size of the gene set is around 60 in number. Apply SVM-RFE to this gene set. Figures 22-24 show the plots of leave-one-out error rate for schemes Ⅰ-Ⅲ (solid line), respectively.

Each subplot accompanies the results of RFE (undertone solid line) and the corresponding ranking method (dashed line).

It is obvious that our method has a smaller error rate than that of the corresponding ranking method when the number of genes is less than 20 for all three schemes. However, our three schemes perform poorly when the number of genes is larger than 20. Scheme Ⅲ (Figure 24) performs slightly better than others (Figures 22-23).

We repeat the experiment but select around 100 candidate genes from clusters.

SVM-RFE is applied to this new subset, and the results are shown in Figures 25-27. These results are better than the preceding case. When the number of genes is larger than 20, scheme

Ⅲ (Figure 27) performs better than three ranking methods. When the number of the selected genes is between 15 and 20, our schemes always perform better than RFE.

在文檔中在微陣列資料上利用基因分群以減少冗贅之基因選取方法 (頁 23-27)