• 沒有找到結果。

4.2 Experimental Results of iGEC

4.2.2 Experiment 1-Comparison between iGEC and the Vinterbo’s Fuzzy

For comparisons, we conducted two evaluations on the Vinterbo’s method using different numbers of pre-selected genes. One is to use 200 pre-selected genes (V200), which is the same with that in [13]. The other is to use 15 genes (V15), which is the same with that of the proposed method. Table 4.2 shows the statistical results (mean and standard deviation) of iGEC and the Vinterbo’s classifier in terms of training accuracy, test accuracy, number of rules, number of genes, and rule number per class. The results of the Vinterbo’s classifier were obtained by running the same program provided by Vinterbo

Table4.1.Theeightdatasetsfrom[12]. No.DataSetDescriptions]ofclasses]ofsamples]ofgenesNpReference 1braintumor15humanbraintumortypes59059201185[65] 2braintumor24malignantgliomatypes45010367951[66] 3DLBCLDiffuselargeb-celllymphomas andfollicularlymphomas2775469483[67] 4leukemia1Acutemyelogenousleukemia (AML),Acutelympboblastic leukemia(ALL)B-cell,andALL T-cell

3725327717[68] 5leukemia2AML,ALL,andmixed-lineage leukemia(MLL)37211225717[69] 6lungcancer4lungcancertypesandnormal tissues5203126001185[70] 7prostatetumorProstatetumorandnormaltissue210210509483[71] 8SRBCTSmall,roundbluecelltumorsof childhood4832308951[72]

(a) (b)

(c) (d) Figure 4.4. The box plots of the statistical results. (a) training accuracy, (b) test accuracy,

(c) number of rules, and (d) number of used genes.

et al. [13]. The same data which have the same partition are used for iGEC, V200, and V15. Figure 4.4 presents the experimental results using box plots. Figure 4.5(a) and 4.5(b) show the three-dimensional scatter plots in terms of test accuracy, rule number, and gene number for data sets lung cancer and SRBCT, respectively.

From Table 4.2, we can observe that iGEC performs better than the Vinterbo’s clas-sifier using 200 candidate genes (V200) in the five measures: T rCR (97.1% vs. 81.5%), T eCR (87.9% vs. 81.2%), Nr (3.9 vs. 4.9), Nf (5.0 vs. 7.2), and Nr/C (1.1 vs. 1.4).

Note that V200 is better than V15 but using more candidate genes and computation time. Moreover, the classifiers V200 compare favorably to those of a logistic regression model which is one of the frequently used classification method applied in the biomedical domain [13].

Figure 4.6 shows an example of iGEC using the data set leukemia1 where 90% samples are for training and the rest for test. The classifier has four fuzzy rules using three genes L05148, U46499, and U05259, where T rCR = 100% and T eCR = 100%. The fuzzy rules

Table 4.2. The statistical results of iGEC and the Vinterbo’s classifier on training accuracy (T rCR), test accuracy (T eCR), number of rules (Nr), number of genes (Nf), and rule number per class (N r/C).

Data Set Method T rCR(%) T eCR(%) Nr Nf Nr/C

iGEC 92.4 ± 0.5 88.7 ± 4.0 5.0 ± 0.2 5.9 ± 0.4 1.00

brain tumor1 V200 80.85 81.25 6.50 8.60 1.30

V15 78.66 85.00 6.00 9.20 1.20

iGEC 97.0 ± 0.5 72.4 ± 9.9 4.4 ± 0.2 5.5 ± 0.3 1.11

brain tumor2 V200 60.00 60.00 4.00 8.30 1.00

V15 66.60 63.33 5.10 6.70 1.27

iGEC 98.5 ± 0.3 91.2 ± 2.6 2.5 ± 0.1 3.7 ± 0.3 1.28

DLBCL V200 85.91 85.00 2.60 3.80 1.30

V15 84.65 78.33 7.00 6.90 3.50

iGEC 99.7 ± 0.1 94.0 ± 2.5 3.5 ± 0.1 4.1 ± 0.2 1.18

leukemia1 V200 90.15 92.00 5.30 7.30 1.76

V15 87.61 84.00 4.90 8.10 1.63

iGEC 98.7 ± 0.2 85.3 ± 4.4 3.3 ± 0.1 4.3 ± 0.3 1.12

leukemia2 V200 81.97 76.67 4.30 5.50 1.43

V15 74.70 71.67 3.50 4.10 1.16

iGEC 92.7 ± 0.7 88.0 ± 1.8 5.5 ± 0.3 6.9 ± 0.3 1.10

lung cancer V200 85.35 84.44 7.80 14.50 1.56

V15 81.57 82.78 8.30 8.90 1.66

iGEC 97.9 ± 0.2 90.9 ± 2.5 2.4 ± 0.1 4.1 ± 0.3 1.21

prostate tumor V200 81.50 82.00 3.00 3.30 1.50

V15 84.46 84.00 2.90 5.10 1.45

iGEC 99.8 ± 0.1 92.3 ± 2.7 4.3 ± 0.2 4.8 ± 0.4 1.08

SRBCT V200 86.36 88.33 5.80 6.20 1.45

V15 78.44 71.67 5.10 10.20 1.27

iGEC 97.1 87.9 3.9 5.0 1.1

Mean V200 81.5 81.2 4.9 7.2 1.4

V15 79.6 77.6 5.4 7.4 1.6

(a) (b) Figure 4.5. The 3D scatter plots. (a) lung cancer (b) SRBCT.

Gene L05148 Gene U46499 Gene U05259 CL CF

Figure 4.6. Fuzzy rules of the data set leukemia1 using 90% samples for training and the rest for test. The training and test accuracies are both 100%.

are linguistically interpretable as follows:

R1: If L05148 is not up-regulated and U05259 is not down-regulated, then Class “ALL B-Cell” with CF = 0.243;

R2: If L05148 is ALL and U46499 is neutral or up-regulated, then Class “ALL B-Cell”

with CF = 0.682;

R3: If L05148 is not down-regulated, U46499 is ALL and U05259 is ALL, then Class

“ALL T-Cell” with CF = 0.710;

R4: If L05148 is ALL, U46499 is ALL and U05259 is ALL, then Class “AML” with CF = 0.722.

Where the membership functions of genes U46499 and U05259 in R1 and R2, respectively, are “don’t care” which can reduce the rule length. From the compact rule base, it is easy to interpret the classification model from gene expression data. The fuzzy rules can be examined by biomedical researchers. Due to the natural clustering property of gene expression data, each of the classes “ALL T-Cell” and “AML” has one fuzzy rule corresponding to one fuzzy region while the class “ALL B-Cell” has two fuzzy regions overlapped. Furthermore, we can know the distribution of samples of each class from the corresponding membership function in the feature space. The fuzzy rule base can determine the class of unknown samples using Eq. 4.3.

To further realize whether these three genes L05148, U46499, and U05259 make sense

Table 4.3. Selected genes for the leukemia1 data set example. For each gene we counted the number of articles that were retrieved by a PubMed query consisting of the gene name and the string “leukemia”.

Gene Description ] of References

M11722 Human terminal transferase mRNA 154

L05148 Human protein tyrosine kinase related mRNA se-quence

26

M63138 Human cathepsin D 24

M31523 Human transcription factor (E2A) mRNA 17

U05259 Human MB-1 gene, complete cds 12

U46499 Homo sapiens microsomal glutathione transferase (MGST1) gene, 3’ sequence

10

M27891 Human cystatin C gene 5

U16954 Human (AF1q) mRNA 3

as a group and their biological relationship, we process the average linkage (average dis-tance, UPGMA) clustering based on Euclidean distances squared by EPCLUST [73].

Figure 4.7 shows the clustering result. From Figure 4.7, we can observe that most of the samples belonging to same class are grouped together. From thousands of genes, the proposed method can identify few but relevant genes to make accurate classification.

Furthermore, the biological finding is interpretable from the obtained compact fuzzy rule base. Therefore, iGEC is beneficial to microarray data analysis and development of inex-pensive diagnostic tests.

Besides the leukemia1 classifier using the gene set {L05148, U46499, U05259} shown in Figure 4.6, there are other sets of three genes which can establish the classifiers with both 100% training and test accuracies as follows: {L05148, M63138, U05259}, {M11722, L05148, U46499}, {M31523, U16954, U46499}, and {U16954, M27891, U05259}. This scenario results from that the microarray data have a large number of genes but a very small number of samples. iGEC can provide important knowledge to biological scientists.

Table 4.3 gives descriptions of the selected genes from the data set leukemia1 of 72 samples.

For each gene, we counted the number of articles that were retrieved by a PubMed query containing the gene name and the key string “leukemia”. By combining more gene sets of solutions, most of genes highly related to the leukemia disease can be obtained.

Due to different merits of fuzzy partitions such as grid partition, tree partition, and scatter partition, they cannot be directly compared using some specific measurements [37].

Figure 4.7. The clustering result of 72 samples in data set leukemia1 using the three selected genes by the clustering algorithm EPCLUST [73].

However, iGEC has 1.1 fuzzy regions for describing the sample distribution of each class averagely. Besides the above-mentioned advantages of easy interpretation and economical experiments, the proposed fuzzy rule-base method using a scatter partition of feature space can enclose all possible occurrences of samples in the same class with one or few hyperbox-type fuzzy regions. In other words, the fuzzy regions of scatter partition can represent one class more independently than those of grid partition. Therefore, iGEC can reject the unknown sample if it belongs to no fuzzy region that no fuzzy rule is fired.

4.2.3 Experiment 2-Comparison between iGEC and

相關文件