Semi-supervised Learning - 數據驅動的幾何學習

Step 1. Choosing a particular cancer type (which includes target labeled subjects and all unlabeled subjects) to cluster genes into groups.

Step 2. Classifying whole labeled and unlabeled subjects by each gene-subgroup.

Finding a particular gene-subgroup that can classify the target cancer type. Repeating the procedures to whole the cancer types. These proce-dures yield the ﬁrst dual relationship between the gene-subgroups and cancer subtypes. The cancer subtypes here may contain some unlabeled subjects within the cluster.

Step 3. Classifying genes again by a particular cancer subtype and the unknown ones that are in the same cluster as in step 2 yields the second gene-subgroups. Then, with these new gene-subgroups, classifying all subjects will yield the second dual-relationship.

Step 4. The calculation of ViVi

||V_i||||V_i|| =cos θ_ii, i, i = 1, .., n

is performed using the 2nd dual relationship to calculate. Here V_i is a vector for the unlabeled subject’s data and V_i is a vector for the other target labeled subject’s data..

Step 5. Plotting the density function of cos θii for each cancer subtype deter-mines the classiﬁcation with the function having the largest density mode.

By the method above, we can obtain clusters of the unlabeled data and labeled data. We will not lose any information from the unlabeled data. By repeating the re-clustering procedure, we can conﬁrm that the unlabeled subjects have been correctly classiﬁed.

2.2 Datasets

We applied our learning algorithm to several datasets. The ﬁrst dataset is the one from [7]. The dataset contains 20 pulmonary carcinoids (COID), 17 nor-mal lung (NL), and 21 squamous cell lung carcinomas (SQ) cases. The second dataset was obtained from [18], containing 83 subjects with 2308 genes with 4 diﬀerent cancer types: 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), and 25 cases of rhabdomyosar-coma (RMS). The third gene expression dataset comes from the breast cancer microarray study by [16]. The data includes information about breast cancer mutation in the BRCA1 and the BRCA2 genes. Here, we have 22 patients, 7 with BRCA1 mutations, 8 with BRCA2 mutations, and 7 with other types. The fourth gene expression dataset comes from [15]. The data contains a total of 31 malignant pleural mesothelioma (MPM) samples and 150 adenocarcinoma

Data Driven Geometry for Learning 399

Table 1. Data description

Data Number Number of subjects in each label Dimensions of labels

Chen 3 20 COID, 17 NL, 21 SQ 58×1543

Khan 4 29 EWS, 11 BL, 18 NB, 25 RMS 83×2308 Hedenfalk 3 7 BRCA1, 8 BRCA2, 7 others 22×3226

Gordon 2 31 MPM, 150 ADCA 181×1626

Table 2. Data description in semi-supervised setting

Data Number of Number of subjects in each label unlabeled

subjects

Chen 15 15 COID, 12 NL, 16 SQ

Khan 20 23 EWS, 8 BL, 12 NB, 20 RMS Hedenfalk 6 5 BRCA1, 6 BRCA2, 5 others

Gordon 20 21 MPM, 140 ADCA

Table 3. Accuracy rates for diﬀerent examples - semi-supervised learning

Data set Accuracy

Chen 15/15

Khan 1/20

Hedenfalk 4/4 Gordon 20/20

3 Results

We made some of the subjects unlabeled to perform semi-supervised learning.

For the Chen dataset, we took the last 5 subjects in each group as unlabeled.

For the Khan dataset, unlabeled data are the same as those mentioned in [18].

Since the sample size for Hedenfalk dataset is not large, we unlabeled only the last 2 subjects in BRCA1 and the last 2 subjects in BRCA2. We unlabeled 10 subjects for each group for the Gordon dataset. The number of labeled subjects and unlabeled subjects can be found in Table 2. The predicted results can be found in Table3. However, we could not ﬁnd the distinct dual-relationship for the second dataset.

4 Discussion

In the present study, we have proposed a semi-supervised data-driven

learn-400 E.P. Chou

eﬃciently classiﬁed most of the datasets with their dual relationships. In addi-tion, we incorporated unlabeled data into the learning rule to prevent misclassi-ﬁcation and the loss of some important information.

A large collection of covariate dimensions must have many hidden patterns embedded in it to be discovered. The model-based learning algorithm might cap-ture the aspects allowed by the assumed models. We made use computational approaches to uncover the hidden inter-dependence patterns embedded within the collection of covariate dimensions. However, we could not ﬁnd the dual rela-tionships for one dataset, as demonstrated in the previous sections. For that dataset, we could not predict precisely. The reason is that the distance function used was not appropriate for a description of the geometry of this particular dataset. We believe that the measuring of similarity or distance for two data nodes plays an important role in capturing the data geometry. However, choosing a correct distance measure is diﬃcult. With high dimensionality, it is impossible to make assumptions about data distributions or to geta priori knowledge of the data. Therefore, it is even more diﬃcult to measure the similarity between the data. Diﬀerent datasets may require diﬀerent methods for measuring similarity between the nodes. A suitable selection of measuring similarity will improve the results of clustering algorithms

Another limitation is that we have to decide the smoothing bandwidth for the kernel density curves. A diﬀerent smoothing bandwidth or kernel may lead to diﬀerent results. Therefore, we can not make exact decisions. Besides, when the size of gene is very large, a great deal of computing time may be required.

By using the inner product as our decision rule, we know that, when two subjects are similar, the angle between the two vectors will be close to 0 and cosθwill be close to 1. The use ofcosθmakes our decision rule easy and intuitive.

The performance of the proposed method is excellent. In addition, it can solve the classiﬁcation problem when we have outliers in the dual relationship.

The contributions of our studies are that the learning rules can specify gene-drug interactions or gene-disease relations in bioinformatics and can identify the clinical status of patients, leading them to early treatment. The application of this rule is not limited to microarray data. We can apply our rule of learning processes to any large dataset and ﬁnd the dual-relationship to shrink the dataset’s size.

For example, the learning rules can also be applied to human behavior research focusing on understanding people’s opinions and their interactions.

Traditional clustering methods assume that the data are independently and identically distributed. This assumption is unrealistic in real data, especially in high dimensional data. With high dimensionality, it is impossible to make assumptions about data distributions and diﬃcult to measure the similarity between the data. We believe that measuring the similarity between the data nodes is an important way of exploring the data geometry in clustering. Also, clustering is a way to improve dimensionality reduction, and similarity research is a pre-requisite for non-linear dimensionality reduction. The relationships among clustering, similarity and dimensionality reduction should be considered in future

Data Driven Geometry for Learning 401

References

1. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci.

96(12), 6745–6750 (1999)

2. Bagirov, A.M., Ferguson, B., Ivkovic, S., Saunders, G., Yearwood, J.: New algo-rithms for multi-class cancer diagnosis using tumor gene expression signatures.

Bioinformatics 19(14), 1800–1807 (2003)

3. Basford, K.E., McLachlan, G.J., Rathnayake, S.I.: On the classiﬁcation of microar-ray gene-expression data. Brieﬁngs Bioinf. 14(4), 402–410 (2013)

4. Ben-Dor, A., Bruhn, L., Laboratories, A., Friedman, N., Schummer, M., Nachman, I., Washington, U., Washington, U., Yakhini, Z.: Tissue classiﬁcation with gene expression proﬁles. J. Comput. Biol. 7, 559–584 (2000)

5. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. J. Com-put. Biol. 6(3–4), 281–297 (1999)

6. Bicciato, S., Luchini, A., Di Bello, C.: PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinf. 19(5), 571–578 (2003)

7. Chen, C.P., Fushing, H., Atwill, R., Koehl, P.: biDCG: a new method for discover-ing global features of dna microarray data via an iterative re-clusterdiscover-ing procedure.

PloS One 9(7), 102445 (2014)

8. Chen, L., Yang, J., Li, J., Wang, X.: Multinomial regression with elastic net penalty and its grouping eﬀect in gene selection. Abstr. Appl. Anal. 2014, 1–7 (2014) 9. Dreiseitl, S., Ohno-Machado, L.: Logistic regression and artiﬁcial neural network

classiﬁcation models: a methodology review. J. Biomed. Inf. 35(5–6), 352–359 (2002)

10. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863–14868 (1998) 11. Fushing, H., McAssey, M.P.: Time, temperature, and data cloud geometry. Phys.

Rev. E 82(6), 061110 (2010)

12. Fushing, H., Wang, H., Vanderwaal, K., McCowan, B., Koehl, P.: Multi-scale clus-tering by building a robust and self correcting ultrametric topology on data points.

PLoS ONE 8(2), e56259 (2013)

13. Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97(22), 12079–12084 (2000) 14. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,

Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomﬁeld, C.D., Lander, E.S.: Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)

15. Gordon, G.J., Jensen, R.V., Hsiao, L.L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expres-sion ratios in lung cancer and mesothelioma. Cancer Res. 62(17), 4963–4967 (2002) 16. Hedenfalk, I.A., Ringn´er, M., Trent, J.M., Borg, A.: Gene expression in inherited

breast cancer. Adv. Cancer Res. 84, 1–34 (2002)

17. Huynh-Thu, V.A., Saeys, Y., Wehenkel, L., Geurts, P.: Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioin-formatics 28(13), 1766–1774 (2012)

402 E.P. Chou

18. Khan, J., Wei, J.S., Ringn´er, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S.: Classi-ﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks. Nat. Med. 7(6), 673–679 (2001)

19. Liao, J., Chin, K.V.: Logistic regression for disease classiﬁcation using microarray data: model selection in a large p and small n case. Bioinformatics 23(15), 1945–

1951 (2007)

20. Mahmoud, A.M., Maher, B.A., El-Horbaty, E.S.M., Salem, A.B.M.: Analysis of machine learning techniques for gene selection and classiﬁcation of microarray data.

In: The 6th International Conference on Information Technology (2013)

21. Nguyen, D.V., Rocke, D.M.: Multi-class cancer classiﬁcation via partial least squares with gene expression proﬁles. Bioinformatics 18(9), 1216–1226 (2002) 22. Saber, H.B., Elloumi, M., Nadif, M.: Clustering Algorithms of Microarray Data.

In: Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Post-processing of Biological Data, pp. 557–568 (2013)

23. Shevade, S.K., Keerthi, S.S.: A simple and eﬃcient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)

24. Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and com-parison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)

25. Wasson, J.H., Sox, H.C., Neﬀ, R.K., Goldman, L.: Clinical prediction rules. Appli-cations and methodological standards. New Engl. J. Med. 313(13), 793–799 (1985).

PMID: 3897864

26. Zhou, X., Liu, K.Y., Wong, S.T.: Cancer classiﬁcation and prediction using logistic regression with bayesian gene selection. J. Biomed. Inform. 37(4), 249–259 (2004)

在文檔中數據驅動的幾何學習 (頁 22-27)