數據驅動的幾何學習

(1)

科技部補助專題研究計畫成果報告

期末報告

數據驅動的幾何學習

計畫類別：個別型計畫計畫編號： MOST 103-2118-M-004-006- 執行期間： 103 年 10 月 01 日至 104 年 07 月 31 日執行單位：國立政治大學統計學系計畫主持人：周珮婷計畫參與人員：碩士班研究生-兼任助理人員：楊俊隆碩士班研究生-兼任助理人員：章珅鎝報告附件：出席國際會議研究心得報告及發表論文處理方式： 1.公開資訊：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢 2.「本研究」是否已有嚴重損及公共利益之發現：否 3.「本報告」是否建議提供政府單位施政參考：否

中華民國 104 年 10 月 05 日

(2)

中文摘要：高維度變量提供機器學習和分類問題詳細的資料訊息。這些共變數之間的關係對研究人員是未知的。在古典與現代的機器學習文獻中，這問題較少被討論；大多數流行的算法為使用一些降維的方法，甚至強加一個內置的複雜性懲罰。這是一種對高維資料浪費的態度。相反的，我們應該可以利用這種高維變數間潛在的相互關係，而不是任意降維。在本研究中，我們利用上述所提到的概念，首先計算數據點之間的相似性，利用等距演化樹(Ultrametric tree)，從所有相關的共變數，得到數據幾何形式模式的信息。然後，我們利用這些模式去建立監督和半監督式的學習。這種計算方法主要是基於一個新的聚類方法，數據雲幾何（DCG），它是一種非監督式學習。我們的數據驅動的學習方法是集中在如何找出適當的距離來表示數據的幾何關係，以促進有效率的找到整體特徵矩陣作為學習的中心問題。中文關鍵詞：距離、數據雲幾何、機器學習

英文摘要： High dimensional covariate information provides a detailed description of any individuals involved in a machine learning and classification problem. The inter-dependence patterns among these covariate vectors may be unknown to researchers. This fact is not well recognized in classic and modern machine learning literature； most model-based popular

algorithms are implemented using some version of the dimension-reduction approach or even impose a built-in complexity penalty. This is a defensive attitude toward the high dimensionality. In contrast, an accommodating attitude can exploit such potential inter-dependence patterns embedded within the high dimensionality. In this research project, we

implement this latter attitude throughout by first computing the similarity between data nodes and then discovering pattern information in the form of

Ultrametric tree geometry among almost all the covariate dimensions involved. We then make use of these patterns to build supervised and

semi-supervised learning algorithms. The computations for such discovery are primarily based on the new

clustering technique, Data Cloud Geometry (DCG), a non-supervised learning algorithm. Our data-driven learning approach is focused on the central issue of

(3)

how to adaptively evolve a simple empirical distance into an effective one in order to facilitate an efficient global feature-matrix for learning purposes.

(4)

行

行政

政

院

院國

_國

_國家

家科

家

科

_學

_學委

_委

_委員

_員

_員會

_會專

會

_專

_專題

_題

_題研

研

研究

究

究計

計

畫

畫成

成

成果

果

果報

報

_報告

_告

數

數據

據

_驅

_驅動

動

動的

的

的幾

幾

幾何

何

何學

學

學習

習

Data Driven Geometry for Learning

計

畫編號：103-2118-M-004-006-執行期限：103年10月01日至104年07月31日

主持人：周珮婷

一、中中中文文文摘摘摘要要要高維度變量提供機器學習和分類問題詳細的資料訊息。這些共變數之間的關係對研究人員是未知的。在古典與現代的機器學習文獻中，這問題較少被討論;大多數流行的算法為使用一些降維的方法，甚至強加一個內置的複雜性懲罰。這是一種對高維資料浪費的態度。相反的，我們應該可以利用這種高維變數間潛在的相互關係，而不是任意降維。在本研究中，我們利用上述所提到的概念，首先計算數據點之間的相似性，利用等距演化樹(Ultrametric tree)，從所有相關的共變數，得到數據幾何形式模式的信息。然後，我們利用這些模式去建立監督和半監督式的學習。這種計算方法主要是基於一個新的聚類方法，數據雲幾何（DCG），它是一種非監督式學習。我們的數據驅動的學習方法是集中在如何找出適當的距離來表示數據的幾何關係，以促進有效率的找到整體特徵矩陣作為學習的中心問題。關關關鍵鍵鍵詞詞詞_：_：_{：距離、數據雲幾何、機器學習}

Abstract High dimensional covariate information provides a detailed description of any in-dividuals involved in a machine learning and classification problem. The inter-dependence patterns among these covariate vectors may be unknown to researchers. This fact is not well recognized in classic and modern machine learning literature; most model-based popular al-gorithms are implemented using some version of the dimension-reduction approach or even impose a built-in complexity penalty. This is a defensive attitude toward the high dimensional-ity. In contrast, an accommodating attitude can exploit such potential inter-dependence patterns embedded within the high dimensionality. In this research project, we implement this latter attitude throughout by first computing the similarity between data nodes and then discovering pattern information in the form of Ultrametric tree geometry among almost all the covariate di-mensions involved. We then make use of these patterns to build supervised and semi-supervised learning algorithms. The computations for such discovery are primarily based on the new clus-tering technique, Data Cloud Geometry (DCG), a non-supervised learning algorithm. Our data-driven learning approach is focused on the central issue of how to adaptively evolve a simple empirical distance into an effective one in order to facilitate an efficient global feature-matrix for learning purposes.

(5)

二、研研研究究究目目目的的的

Powered by information technology advances in this internet era, machine learning has be-come ubiquitous in scientific research and real-world business as an effective way of gaining insightful information and knowledge. It has been popularized by many free software packages made available on the websites of many institutes’ and individuals, and by commercial products available on the market. By and large, machine learning algorithms work to a reasonable extent when the number of covariate dimensions is low or moderate. And the majority of such algo-rithms are model-based, so they intrinsically can only contain low or moderate dimensionality. When facing high dimensionality, regularization is one popular approach. By building a penalty function for excessive model complexity into the algorithm, these approaches indeed take a de-fensive attitude toward high dimensionality. Though they have had many successes, they are still far from being universal. A potential reason for this lack of universality is inability to accommodate hidden inter-dependence patterns embedded within the collection of covariate dimensions. This view point is likely pertinent when we look at “learning” from a systemic perspective. By learning, we mean to see and capture insightful information pertaining to the particular system of interest. If a covariate dimension stands for a facet of the system, then a large collection of covariate dimensions must have many hidden patterns embedded in it to be discovered. The model-based learning algorithm might capture the aspects allowed by the as-sumed models. However, the view framed by such models can be rather limited, if not entirely improper.

Beyond the above systemic perspective, one key observation is potentially important: under the high dimensionality, typical real world data sets are “too small” to sustain smooth manifolds or distribution functions. For instance, a set of billions of 100 dimensional binary data points is

just a drop in the sea of the binary space {0, 1}100, which has a cardinality of order 1030. The

immediate implication of the smallness of data sizes is that it becomes unrealistic to build learn-ing algorithms based on required smoothness of manifolds or distributions. Specifically there are three implications as follows. First this fact of lack of distributional structure implies that the classic linear discriminant analysis might not be well supported. It is because the estimation of variance-covariance matrix is rather unstable. Second this lack of smooth manifold implies that the support vector machine based of various kernel modula could miss or even contradict the true data structures. Third this fact of jointly missing distribution and manifold structural information could lead to model based algorithms, such as Lassos and related methodologies, being far out of data’s structural features.

After recognizing the fact of missing smooth manifold and distribution structure in relatively small data set, it is clearly essential to extract authentic data structure in a data-driven fashion. Ideally if such computed structures can be coherently embedded into a visible geometry, then the developments of learning algorithm would be realistic and right to the point of solving the real issues in hand.

(6)

This is the direction pursued in this research. We attempt to take the accommodating attitude toward the high covariate dimensionality, and to make use computational approaches to uncover the hidden inter-dependence patterns embedded within the collection of covariate dimensions. The computed pattern information would be used as the foundation for constructing learning algorithm. So that the theme of “machine learning” here is a data-driven discovery in the com-putational and experimental enterprize in contrasting to heavily handed statistical modeling endeavors. This data-driven discovery theme is detailed as follows.

Consider n subjects indexed by i = 1, .., n, and each subject is encoded with class-category number and is equipped with P-dimensional covariate information. Let a n × P matrix collec-tively record all covariate information available. Here we assume that an empirical distance among the n row vectors, and another empirical distance for the P column vectors are avail-able. By using either one of empirical distances, we calculate a symmetric distance matrix. Then we apply the Data Cloud Geometry computational algorithm, developed in [1] and [2], to

build an Ultrametric tree geometryTSon the subject space of n P-dimensional row vectors, so

another Ultrametric tree geometryTCon the covarite space of P n-dimensional column vectors.

The measuring of similarity or distance for two data nodes plays an important role in capturing the data geometry. Clustering is a method of grouping data nodes into a number of clusters based on the similarity between them [3]. Therefore, within-class members are similar, and between-class members are dissimilar [4]. Finding a suitable measure of similarity between two data nodes has been an issue in data clustering. Exploring data geometry is an important way to describe the similarity between the data in clustering. However, choosing a correct dis-tance measure is difficult. With high dimensionality, it is impossible to make assumptions about data distributions or to get a priori knowledge of the data. Therefore, it is even more difficult to measure the similarity between the data. Different datasets may require different methods for measuring similarity between the nodes. A suitable selection of measuring similarity will improve the results of clustering algorithms [5, 6].

Some widely used distance matrices are the Hamming distance, Euclidean distance, Manhattan distance, and Pearson’s correlation. Each method has its strengths and limitations [5, 6, 7, 8]. For example, Pearson’s correlation shows the linear relationship between two data nodes. How-ever, outliers will affect the correlation computation. Hamming distance is inherent robustness to noise [9]. However, it forces an equal emphasis on all nodes of a binary descriptor. Thus, choosing an appropriate similarity measurement for data is a difficult task.

The first goal of the study is to identify an appropriate similarity measurement, and deter-mine the importance of distance function. In this research, We tackle the fundamental issue in machine learning: how to iteratively modify the empirical distance in order to achieve better efficiency in building learning algorithm. The key idea is motivated from the fact that the

(7)

measures are modified according to the corresponding tree structures in an iterative fashion. We illustrate our development throughout here via 3 real datasets.

The second goal of the study is to establish a learning rule for classification. As a final re-mark, under the Ultrametric tree structure, it becomes very clear on how to extend a supervised learning algorithm to a semi-supervised learning algorithm. The essential component in this extension is to include all covariate information when constructing the DCG tree geometry. It is important because the geometry pertaining to a subset of covariate data might be significantly different from the geometry pertaining to the whole. The DCG tree is better to be based all in-volving covariate information of labeled and unlabeled subjects.

三、研研研究究究方方_方法_法_法

We introduce a two-layered distance measurement for clustering high dimensional data based on Data Cloud Geometry (DCG) first. Through the DCG clustering method, we update the distance matrix with the clustering results to better capture the data geometry. Later, we extend two-layered distance idea to multi-layered distance.

We believe that choice of the best distance function is a key to obtaining an accurate classi-fication. We have to explore the data geometry with a suitable distance function to describe the similarity of the data nodes. However, there is no unique distance function for all datasets. We must check each distance function to get a suitable one for each dataset. If we don’t use a proper distance measure to describe the similarity, then it will be impossible to get a correct classification.

Suppose that we have an Xn× p dataset. n is the number of subjects, and n = n1 + n2 + n3 +

. . . + nq. In total, we have q groups of subjects. p is the dimension of a dataset. The first-layered

distance matrix, d0, is the traditional distance matrix. The second-layered distance matrix, d1,

is the average distance calculated from incorporating the known label information.

d1= d[i, i0] = n1{mean(X1[, i]) − mean(X1[, i0])}2+ n2{mean(X2[, i]) − mean(X2[, i0])}2

+ . . . + nq{mean(Xq[, i]) − mean(Xq[, i0])}2, i, i0= 1, . . . , p

For example, by using the Euclidean distance:

d0=r

_∑

j (Xi j− Xi0j)2, i, i0= 1, ..., n d1=r

_∑

g N_g( ¯X_g_i− ¯X_g i0) 2_{, i, i}0_{= 1, ..., n}

d0+ d1 is called the “two-layered distance". We apply d0+ d1 as the distance in the DCG

method to get the results for the variables’ clusterings. With the clustering results, we update the distance for the subjects and apply it to the DCG method again to get another classification

(8)

result. With several iterations, we may capture the geometry more accurately.

If we can get an acceptable classification result when applying only d0as the distance function

in the DCG method, then this distance function is suitable for describing this particular data ge-ometry. We should incorporate the labeled information to get an average distance and use both

the original distance and the average distance, d0+ d1, to improve the classification results.

We will apply two-layered distance to establish new learning rules to check whether it can im-prove the current classification results. The idea of the new learning rule involves first clustering the subjects by the variables and then using the subjects’ clustering information to get a two-layered distance. With this two-two-layered distance, we can cluster the variables. Final, we can use the variables’ clustering results to get another two-layered distance to classify the subjects. The additional iterations to build multi-layered distance may be needed in order to obtain accurate classification results in supervised learning and semi-supervised learning.

Data collection

We apply the method to three datasets. The first dataset comes from Dr. Lin in the Institute of Environmental Health, National Taiwan University. She applied nuclear magnetic resonance (NMR) - based metabolomics to characterize the metabolic effects on some tissues from rats treated with 3 pesticides (2 insecticides and 1 herbicide). The original idea was trying to clus-ter the metabolic effects for 3 pesticides (the hypothesis is that a similar toxic mechanism will cause similar metabolic responses). Our data totals 51 cases with 1905 variables. 18 cases (6 control, 6 low-dose, and 6 high-dose) were treated with dicofol, 14 cases (7 low-dose and 7 high-dose) with ethion, 13 cases (6 low-dose and 7 high-dose) with bifenox, and 6 cases with the control group.

Another dataset from [10] was also tested. Via immunohistochemistry (IHC), the initial 49 tu-mors with the 3883 gene expressions were classified as estrogen receptors, ER+ or ER-, at the time of diagnosis and later again via protein immunoblotting assay for ER to check the IHC results. The second analysis considers the clinically important issue of the metastatic spreading of the tumor. The determination of the extent of lymph node, LN+ or LN-, involvement in primary breast cancer is also addressed in [10] study. The final collection of tumors consisted of 13 ER+ LN+ tumors, 12 ER- LN+ tumors, 12 ER+ LN- tumors, and 12 ER- LN- tumors. Further information about the microarray data can be found in [10].

The third dataset was obtained from [11], containing 83 subjects with 2308 genes with 4 differ-ent cancer types: 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), and 25 cases of rhabdomyosarcoma (RMS).

(9)

四_{、結}結結果果果

First, we must find an accurate distance function to describe the data geometry for each dataset. The distance function should capture the geometry more precisely than do the other functions. Therefore, we use the DCG tree to check the classification results. In the DCG tree, subjects within the same group should be in the same or the near clusters in the classification result. Here, we use colors to represent the groups in the DCG tree figures. In addition, we consider only three different distance functions in this study.

Figure 1(a) shows that similar colors tend to merge into one cluster. In other words, the Euclidean distance works better than the other distance functions in describing the geometry for the Lin dataset. Therefore, we choose the Euclidean distance as our first-layered distance.

Then, we obtain a two-layered distance function, d0+ d1, with d0=q∑j(Xi j− Xi0j)2, i, i0 =

1, ..., 51, and d1=q∑gNg( ¯Xgi− ¯Xgi0)

2_{, i, i}0_{= 1, ..., 51. For the West dataset, we find that it is}

more appropriate to use the Euclidean distance without the square root as our distance function

in describing the geometry for the West dataset, as shown in Figure 2(b). Therefore, d0 =

∑j(Xi j− Xi0j)2, i, i0= 1, ..., 49, and d1= ∑gNg( ¯Xgi− ¯Xgi0)

2_{, i, i}0_{= 1, ..., 49. Figure 3(c) shows}

that for the Khan dataset using the Spearman’s rank correlation is more appropriate. Therefore,

d0= |ρ|, and d1= ∑g

Ng

(10)

● 51 ● 35 ● 27 ● 28 ● 45 ● 33 ● 49 ● 48 ● 47 ● 31 ● 13 ● 26 ● 50 ● 32 ● 46 ● 1● 17 ● 7●8● 25 ● 23 ● 14 ● 11 ● 9●3●5● 12 ● 15 ● 42 ● 30 ● 34 ● 41 ● 39 ● 37 ● 19 ● 21 ● 36 ● 22 ● 29 ● 2● 18 ● 10 ● 4●6● 38 ● 16 ● 24 ● 44 ● 43 ● 20 ● 40 ● ● ● ● ● ● ● ● dicofol_nc dicofol_ld dicofol_hd control ethion_ld ethion_hd bifenox_ld bifenox_hd (a) ● 47 ● 49 ● 48 ● 31 ● 13 ● 26 ● 44 ● 43 ● 40 ● 38 ● 25 ● 20 ● 24 ● 39 ● 37 ● 29 ● 22 ● 19 ● 21 ● 9● 16 ● 10 ● 4●6● 11 ● 51 ● 28 ● 14 ● 3●5● 50 ● 46 ● 45 ● 35 ● 27 ● 23 ● 17 ● 15 ● 1● 12 ● 42 ● 33 ● 30 ● 32 ● 34 ● 41 ● 36 ● 2● 18 ● 7●8 ● ● ● ● ● ● ● ● dicofol_nc dicofol_ld dicofol_hd control ethion_ld ethion_hd bifenox_ld bifenox_hd (b) ● 35 ● 46 ● 42 ● 39 ● 38 ● 31 ● 30 ● 28 ● 22 ● 12 ● 14 ● 49 ● 20 ● 24 ● 51 ● 41 ● 40 ● 37 ● 21 ● 19 ● 13 ● 15 ● 8 ● 43 ● 44 ● 25 ● 32 ● 48 ● 36 ● 26 ● 6 ● 1 ● 4 ● 27 ● 18 ● 2 ● 7 ● 33 ● 47 ● 23 ● 17 ● 10 ● 16 ● 50 ● 29 ● 3 ● 11 ● 5 ● 45 ● 9 ● 34 ● ● ● ● ● ● ● ● dicofol_nc dicofol_ld dicofol_hd control ethion_ld ethion_hd bifenox_ld bifenox_hd (c)

Figure 1: A comparison of DCG trees with different distance functions for the Lin dataset: (a) with Euclidean distance; (b) with Euclidean distance but without square root; (c) with Spearman’s rank correlation.

(11)

● 31 ● 40 ● 30 ● 18 ● 11 ● 49 ● 48 ● 46 ● 45 ● 44 ● 43 ● 42 ● 41 ● 39 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 29 ● 28 ● 27 ● 26 ● 25 ● 24 ● 23 ● 22 ● 21 ● 20 ● 19 ● 17 ● 16 ● 15 ● 14 ● 13 ● 12 ● 10 ● 9●7●6●4●3●1●2●₄₇●₃₈●5●8 ● ● ● ● nn np pn pp (a) ● 40 ● 31 ● 30 ● 11 ● 18 ● 25 ● 45 ● 48 ● 44 ● 43 ● 42 ● 41 ● 39 ● 38 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 29 ● 28 ● 27 ● 26 ● 24 ● 23 ● 21 ● 20 ● 19 ● 17 ● 16 ● 15 ● 14 ● 13 ● 12 ● 10 ● 9●7●6●4●3●1●2●5●₂₂●₄₇●₄₉●8●₄₆ ● ● ● ● nn np pn pp (b) ● 49 ● 40 ● 31 ● 30 ● 11 ● 18 ● 14 ● 46 ● 38 ● 45 ● 47 ● 22 ● 5 ● 8 ● 25 ● 23 ● 44 ● 42 ● 39 ● 37 ● 34 ● 33 ● 28 ● 27 ● 26 ● 20 ● 17 ● 16 ● 15 ● 9 ● 7 ● 6 ● 4 ● 1 ● 3 ● 48 ● 43 ● 41 ● 36 ● 35 ● 32 ● 29 ● 24 ● 21 ● 19 ● 13 ● 12 ● 2 ● 10 ● ● ● ● nn np pn pp (c)

Figure 2: A comparison of DCG trees with different distance functions for the West dataset: (a) with Euclidean distance; (b) with Euclidean distance but without square root; (c) with Spearman’s rank correlation.

(12)

● 79 ● 78 ● 73 ● 40 ● 38 ● 24 ● 25 ● 81 ● 61 ● 39 ● 26 ● 29 ● 10 ● 27 ● 28 ● 67 ● 65 ● 64 ● 63 ● 60 ● 52 ● 51 ● 50 ● 49 ● 48 ● 47 ● 46 ● 45 ● 44 ● 42 ● 41 ● 20 ● 18 ● 19 ● 76 ● 71 ● 70 ● 69 ● 68 ● 66 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 31 ● 30 ● 23 ● 22 ● 21 ● 17 ● 16 ● 15 ● 14 ● 12 ● 11 ● 9 ● 8 ● 7●6●5●3●1●2●₈₃●₇₇●₇₅●₇₄●₇₂●●₆₂₅₉●₅₈●₅₇●₅₆●₅₅●●₅₃₁₃●₄₃●₅₄●₈₂●4●₈₀ ● ● ● ● ● EWS BL NB RMS unlab (a) ● 40 ● 24 ● 38 ● 79 ● 78 ● 25 ● 73 ● 81 ● 75 ● 61 ● 29 ● 26 ● 28 ● 83 ● 77 ● 74 ● 72 ● 62 ● 59 ● 13 ● 11 ● 12 ● 76 ● 71 ● 70 ● 69 ● 68 ● 67 ● 66 ● 65 ● 64 ● 63 ● 60 ● 58 ● 57 ● 56 ● 55 ● 53 ● 52 ● 51 ● 50 ● 49 ● 48 ● 47 ● 46 ● 45 ● 44 ● 43 ● 42 ● 41 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 31 ● 30 ● 23 ● 22 ● 21 ● 20 ● 19 ● 18 ● 17 ● 16 ● 15 ● 14 ● 9 ● 8●7●6●5●3●1●●24●₃₉●₁₀●₂₇●₈₂●₅₄●₈₀ ● ● ● ● ● EWS BL NB RMS unlab (b) ● 62 ● 46 ● 42 ● 43 ● 58 ● 57 ● 56 ● 55 ● 41 ● 53 ● 21 ● 17 ● 16 ● 15 ● 3 ● 1 ● 2 ● 40 ● 39 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 30 ● 31 ● 52 ● 51 ● 50 ● 49 ● 48 ● 47 ● 45 ● 44 ● 23 ● 22 ● 20 ● 19 ● 14 ● 18 ● 67 ● 63 ● 65 ● 75 ● 72 ● 59 ● 64 ● 83 ● 79 ● 77 ● 73 ● 61 ● 60 ● 4 ● 29 ● 82 ● 81 ● 80 ● 78 ● 76 ● 71 ● 70 ● 69 ● 68 ● 66 ● 54 ● 38 ● 24 ● 25 ● 10 ● 27 ● 74 ● 28 ● 12 ● 11 ● 9 ● 8 ● 5 ● 6 ● 26 ● 7 ● 13 ● ● ● ● ● EWS BL NB RMS unlab (c)

Figure 3: A comparison of DCG trees with different distance functions for the Khan dataset: (a) with Euclidean distance; (b) with Euclidean distance but without square root; (c) with Spearman’s rank correlation.

Figures 4(a), 5(a), and 6(a) show the results from the DCG approach with traditional distance functions. Figures 4(b), 5(b), and 6(b) show the results of the two-layered distance functions. Clearly, with two-layered distance, the classification results are improved.

● 51 ● 35 ● 27 ● 28 ● 45 ● 33 ● 49 ● 48 ● 47 ● 31 ● 13 ● 26 ● 50 ● 32 ● 46 ● 1 ● 17 ● 7 ● 8 ● 25 ● 23 ● 14 ● 11 ● 9 ● 3 ● 5 ● 12 ● 15 ● 42 ● 30 ● 34 ● 41 ● 39 ● 37 ● 19 ● 21 ● 36 ● 22 ● 29 ● 2 ● 18 ● 10 ● 4 ● 6 ● 38 ● 16 ● 24 ● 44 ● 43 ● 20 ● 40 ● ● ● ● ● ● ● ● dicofol_nc dicofol_ld dicofol_hd control ethion_ld ethion_hd bifenox_ld bifenox_hd (a) ● 50 ● 45 ● 46 ● 49 ● 48 ● 47 ● 35 ● 23 ● 26 ● 38 ● 25 ● 32 ● 36 ● 31 ● 22 ● 30 ● 41 ● 42 ● 33 ● 34 ● 12 ● 16 ● 15 ● 11 ● 10 ● 9 ● 8 ● 4 ● 6 ● 29 ● 39 ● 18 ● 17 ● 7 ● 1 ● 2 ● 37 ● 19 ● 21 ● 51 ● 44 ● 28 ● 27 ● 14 ● 13 ● 3 ● 5 ● 43 ● 40 ● 20 ● 24 ● ● ● ● ● ● ● ● dicofol_nc dicofol_ld dicofol_hd control ethion_ld ethion_hd bifenox_ld bifenox_hd (b)

Figure 4: A comparison of DCG trees with 1 and 2-layered distances for the Lin dataset: (a) with d0;

(13)

● 40 ● 31 ● 30 ● 11 ● 18 ● 25 ● 45 ● 48 ● 44 ● 43 ● 42 ● 41 ● 39 ● 38 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 29 ● 28 ● 27 ● 26 ● 24 ● 23 ● 21 ● 20 ● 19 ● 17 ● 16 ● 15 ● 14 ● 13 ● 12 ● 10 ● 9●7●6●4●3●1●2●5●₂₂●₄₇●₄₉●8●₄₆ ● ● ● ● nn np pn pp (a) ● 40 ● 39 ● 38 ● 36 ● 37 ● 25 ● 24 ● 23 ● 21 ● 22 ● 10 ● 9●8●6●7●₄₉●₄₈●₄₆●₄₇●₄₅●₄₄●₄₃●₄₁●₄₂●₃₀●₂₉●₂₈●₂₆●₂₇●₃₅●₃₄●₃₃●₃₁●₃₂●₂₀●₁₉●₁₈●₁₆●₁₇●5●4●3●1●2●₁₅●₁₄●₁₃●₁₁●₁₂ ● ● ● ● nn np pn pp (b)

Figure 5: A comparison of DCG trees with 1 and 2-layered distances for the West dataset: (a) with

d0; (b) with d0+ d1 ● 62 ● 46 ● 42 ● 43 ● 58 ● 57 ● 56 ● 55 ● 41 ● 53 ● 21 ● 17 ● 16 ● 15 ● 3 ● 1 ● 2 ● 40 ● 39 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 30 ● 31 ● 52 ● 51 ● 50 ● 49 ● 48 ● 47 ● 45 ● 44 ● 23 ● 22 ● 20 ● 19 ● 14 ● 18 ● 67 ● 63 ● 65 ● 75 ● 72 ● 59 ● 64 ● 83 ● 79 ● 77 ● 73 ● 61 ● 60 ● 4 ● 29 ● 82 ● 81 ● 80 ● 78 ● 76 ● 71 ● 70 ● 69 ● 68 ● 66 ● 54 ● 38 ● 24 ● 25 ● 10 ● 27 ● 74 ● 28 ● 12 ● 11 ● 9 ● 8 ● 5 ● 6 ● 26 ● 7 ● 13 ● ● ● ● ● EWS BL NB RMS unlab (a) ● 72 ● 59 ● 60 ● 71 ● 82 ● 23 ● 22 ● 14 ● 20 ● 18 ● 19 ● 46 ● 52 ● 50 ● 49 ● 48 ● 44 ● 45 ● 47 ● 43 ● 41 ● 42 ● 39 ● 37 ● 36 ● 35 ● 34 ● 33 ● 32 ● 30 ● 31 ● 38 ● 40 ● 21 ● 17 ● 16 ● 15 ● 3 ● 1 ● 2 ● 12 ● 11 ● 9 ● 8 ● 7 ● 5 ● 6 ● 67 ● 65 ● 62 ● 63 ● 76 ● 70 ● 69 ● 66 ● 68 ● 74 ● 73 ● 10 ● 13 ● 83 ● 81 ● 80 ● 79 ● 78 ● 64 ● 77 ● 29 ● 24 ● 25 ● 56 ● 55 ● 54 ● 51 ● 53 ● 75 ● 4 ● 61 ● 28 ● 26 ● 27 ● 57 ● 58 ● ● ● ● ● EWS BL NB RMS unlab (b)

Figure 6: A comparison of DCG trees with 1 and 2-layered distances for the Khan dataset: (a) with

d0; (b) with d0+ d1

五、討討討論論論

In the present study, we propose a two-layered distance by incorporating the group/labeled in-formation. We find that choice of the best distance function is a key to obtaining an accurate classification. We have to explore the data geometry with a suitable distance function to de-scribe the similarity of the data nodes. However, there is no unique distance function for all datasets. We must check each distance function to get a suitable one for each dataset. If we don’t use a proper distance measure to describe the similarity, then it will be impossible to get a correct classification.

We find that, if we can get an acceptable classification result when applying only d0 as the

distance function in the DCG method, then this distance function is suitable for describing this particular data geometry. We should incorporate the labeled information to get an average

distance and use both the original distance and the average distance, d0+ d1, to improve the

(14)

Evaluation of the performance of a distance function requires a comparison of its results with those of other functions. Future work will involve summarizing the classification performance of the use of different distance functions on a particular dataset. Meanwhile, the classification results of the same distance function on different kinds of datasets should also be summarized. These experimental results can help researchers in selecting suitable distance functions in fu-ture classification studies.

In the future, we will apply this two-layered distance to our former learning rules to check whether it can improve the current classification results. We will also use this two-layered distance idea to develop a new learning rule. The idea of the new learning rule involves first clustering the subjects by the variables and then using the subjects’ clustering information to get a two-layered distance. With this two-layered distance, we can cluster the variables. Final, we can use the variables’ clustering results to get another two-layered distance to classify the subjects. The additional iterations may be needed in order to obtain accurate classification re-sults in supervised learning and semi-supervised learning.

Traditional clustering methods assume that the data are independently and identically dis-tributed. This assumption is unrealistic in real data, especially in high dimensional data. With high dimensionality, it is impossible to make assumptions about data distributions and difficult to measure the similarity between the data. We believe that measuring the similarity between the data nodes is an important way of exploring the data geometry in clustering. Also, cluster-ing is a way to improve dimensionality reduction, and similarity research is a pre-requisite for non-linear dimensionality reduction. The relationships among clustering, similarity and dimen-sionality reduction should be considered in future research.

(15)

REFERENCES REFERENCES

References

[1] H. Fushing, H. Wang, K. Vanderwaal, B. McCowan, and P. Koehl, “Multi-scale clustering by building a robust and self correcting ultrametric topology on data points,” PLoS ONE, vol. 8, p. e56259, Jan. 2013.

[2] H. Fushing and M. P. McAssey, “Time, temperature, and data cloud geometry,” Physical Review E, vol. 82, p. 061110, Dec. 2010.

[3] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., vol. 31, pp. 264–323, Sept. 1999.

[4] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall College Div, March 1988.

[5] V. Kumar, J. Chhabra, and D. Kumar, “Impact of distance measures on the performance of clustering algorithms,” in Intelligent Computing, Networking, and Informatics (D. P. Mohapatra and S. Patnaik, eds.), vol. 243 of Advances in Intelligent Systems and Computing, pp. 183–190, Springer India, 2014.

[6] R. Gentleman, B. Ding, S. Dudoit, and J. Ibrahim, “Distance measures in dna microarray data analysis,” in Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, eds.), Statistics for Biology and Health, pp. 189–208, Springer New York, 2005.

[7] R. C. Baishya, R. Sarmah, D. K. Bhattacharyya, and M. A. Dutta, “A similarity measure for clustering gene expression data,” in Applied Algorithms, pp. 245–256, Springer, 2014.

[8] Z. Xu, “Distance, similarity, correlation, entropy measures and clustering algorithms for hesitant fuzzy information,” in Hesitant Fuzzy Sets Theory, vol. 314 of Studies in Fuzziness and Soft Computing, pp. 165–279, Springer International Publishing, 2014.

[9] O. Pele and M. Werman, “Robust real-time pattern matching using bayesian sequential hypoth-esis testing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, pp. 1427–1443, Aug. 2008. [10] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. a. Olson,

J. R. Marks, and J. R. Nevins, “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, pp. 11462–7, Sept. 2001.

[11] J. Khan, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature medicine, vol. 7, pp. 673–9, June 2001.

(16)

科技部補助專題研究計畫出席國際學術會議心得報告

日期：104 年 10 月 1 日

一、參加會議經過

整個大會為 The Frontiers in Intelligent Data and Signal Analysis 會議的三個分支

會議之一，一月份時投稿，三月中時公佈審查結果，入取率約三分之一。40

篇文章被收入至 Springer 出版的會議刊物。主題包含 classification,

clustering, association rule and pattern mining, specific data

mining methods for the different multimedia data types such

as image mining, text mining, video mining and web mining.

我在會議第一天下午演講分享研究。

計畫編號

MOST 103－2118－M－004－006－

計畫名稱

數據驅動的幾何學習

出國人員

姓名

周珮婷

服務機構

及職稱

國立政治大學統計學系助理教授

會議時間

104 年 7 月 20

日至 104 年 7 月

23 日

會議地點

德國漢堡

會議名稱

(中文)機器學習與資料採礦國際會議

(英文)International Conference on Machine Learning and Data Mining

發表題目

(中文)數據驅動的幾何學習

(英文)Data Driven Geometry for Learning

附件五

(17)

二、與會心得

與會者大多為電腦資訊專業的學者，因此可以了解不同領域的學者做的研究，

也得到了不同的意見，互相交流。並可透過大會出版的刊物，進一步深入了解

個學者們此次發表的研究。唯一可惜的是，可能是地點和氣候的關係，實際參

與的人數沒有如大會預期多，造成大會當天臨時修改報告順序及開會時間。

三、發表論文全文或摘要

附在文末。

四、建議

希望科技部能補助多點經費，以後才有更多機會能去各會議增廣見聞，充實自

己，認識各領域學者，研究交流。

五、攜回資料名稱及內容

大會環保購物袋，一些簡介及廣告，會議論文集，會議行程及簡章。

六、其他

(18)

(19)

Data Driven Geometry for Learning

Elizabeth P. Chou(B)

Department of Statistics, National Chengchi University, Taipei, Taiwan eptchou@nccu.edu.tw

Abstract. High dimensional covariate information provides a detailed

description of any individuals involved in a machine learning and classi-fication problem. The inter-dependence patterns among these covariate vectors may be unknown to researchers. This fact is not well recognized in classic and modern machine learning literature; most model-based pop-ular algorithms are implemented using some version of the dimension-reduction approach or even impose a built-in complexity penalty. This is a defensive attitude toward the high dimensionality. In contrast, an accommodating attitude can exploit such potential inter-dependence pat-terns embedded within the high dimensionality. In this research, we implement this latter attitude throughout by first computing the similar-ity between data nodes and then discovering pattern information in the form of Ultrametric tree geometry among almost all the covariate dimen-sions involved. We illustrate with real Microarray datasets, where we demonstrate that such dual-relationships are indeed class specific, each precisely representing the discovery of a biomarker. The whole collec-tion of computed biomarkers constitutes a global feature-matrix, which is then shown to give rise to a very effective learning algorithm.

Keywords: Microarray·Semi-supervised learning·Data cloud geom-etry·biDCG

1 Introduction

Under the high dimensionality, it becomes unrealistic to build learning algo-rithms based on required smoothness of manifolds or distributions to typical real world datasets. After recognizing the fact of that, it is clearly essential to extract authentic data structure in a data-driven fashion. Ideally if such com-puted structures can be coherently embedded into a visible geometry, then the developments of learning algorithm would be realistic and right to the point of solving the real issues in hand.

Microarrays are examples of the high dimensional datasets. Microarrays pro-vide a means of measuring thousands of gene expression levels simultaneously. Clustering genes with similar expression patterns into a group can help biologists

obtain more information about gene functioning [5,10]. In addition, clustering

subjects into groups by their gene expression patterns can help medical

(20)

396 E.P. Chou

has been discussed extensively in this setting because it can help researchers investigate medical data in a more eﬃcient way. Therefore, many methods for classifying microarray data have been developed and reviewed by researchers [17,20,22,24].

Many studies have shown that the logistic regression approach is a fast and

standardizable method for data classiﬁcation [9,25]. Regardless of its extensive

use, it might not be appropriate for dealing with gene expression data [19,23,26].

Since most of the microarray data are in a large p small n setting, a subset of

the genes is selected through some methods and the regression prediction is per-formed with these genes. However, it is difficult to determine the size of the gene subset that will be chosen. If too few genes are included, the prediction error may be large. If too many genes are used, the model may be overestimated and either fail to converge or yield an unstable result. It is difficult to find a reliable method for both selecting the genes and performing logistic regression. Although logis-tic regression can be extended to a multi-class classification problem, a suitable

method for multi-class classiﬁcation with gene expression is needed [2,6,8,21].

Multicollinearity may be another problem in regression analysis on gene expression data. Since gene expression is highly correlated to the expression of other genes, the classiﬁcation line that we obtain to separate the data may be unstable. Another problem may be sparseness. The regression model may not reach convergence under these conditions. When the sample size is too small, logistic regression may not provide enough power for performing the prediction. Cross-validation is a measure for checking the performance of a predicted model. However, in such high dimensional microarray data, it may not be eﬃcient and may yield a range of predicted results.

Two-way clustering was introduced to microarray clustering decades ago. Researchers tried to narrow down the numbers of genes and of subjects and

found features for a small subset of genes and a small subset of subjects [1,13].

The two-way method overcomes the problems identiﬁed above and also decreases the noise from irrelevant data. Feature selections can improve the quality of

the classiﬁcation and clustering techniques in machine learning. Chen et al. [7]

developed an innovative iterative re-clustering procedure biDCG through a DCG

clustering method [12] to construct a global feature matrix of dual relationships

between multiple gene-subgroups and cancer subtypes.

In this research, we attempt to take the accommodating attitude toward the high covariate dimensionality, and to make use computational approaches to uncover the hidden inter-dependence patterns embedded within the collec-tion of covariate dimensions. The essential component is to include all covariate information when constructing the DCG tree geometry. It is important because the geometry pertaining to a subset of covariate data might be signiﬁcantly diﬀerent from the geometry pertaining to the whole. The DCG tree is better to be based all involving covariate information of labeled and unlabeled sub-jects. The computed pattern information would be used as the foundation for constructing learning algorithm. So that the theme of “machine learning” here

(21)

Data Driven Geometry for Learning 397

contrasting to heavily handed statistical modeling endeavors. This data-driven discovery theme is detailed as follows.

Considern subjects indexed by i = 1, .., n, and each subject is encoded with

class-category number and is equipped withp-dimensional covariate information.

Let a n × p matrix collectively record all covariate information available. Here

we assume that an empirical distance among the n row vectors, and another

empirical distance for thep column vectors are available. By using either one of

empirical distances, we calculate a symmetric distance matrix. Then we apply the Data Cloud Geometry (DCG) computational algorithm, developed by Fushing

and McAssey [11,12], to build an Ultrametric tree geometry TS on the subject

space ofn p-dimensional row vectors, and another Ultrametric tree geometry TC

on the covarite space ofp n-dimensional column vectors.

In our learning approach, we try to make simultaneous use of computed

pattern information in the Ultrametric tree geometries T_S and T_C. The key

idea was motivated by the interesting block patterns seen by coupling these

two DCG tree geometries on the n × p covariate matrix. The coupling is meant

to permute the rows and columns according to the two rooted trees in such a fashion that subject-nodes and covariate-nodes sharing the core clusters are placed next to each other, while nodes belonging to diﬀerent and farther apart branches are placed farther apart. This is the explicit reason why a geometry is needed in both subject and covariate spaces. Such a block pattern indicates that each cluster of subjects has a tight and close interacting relationship with a corresponding cluster of covariate dimensions. This block-based interacting

relationship has been discovered and explicitly computed in [7], and termed

a “dual relationship” between a target subject cluster and a target covariate cluster. Functionally speaking, this dual relationship describes the following fact: By restricting focus to a target subject cluster, the target covariate cluster can be exclusively brought out on the DCG tree as a standing branch. That is, this target covariate cluster is an entity distinct from the rest of the covariate dimensions with respect to the target subject cluster. Vice versa, by focusing only on the target covariate cluster, the target subject cluster can be brought out in the corresponding DCG tree.

Several real cancer-gene examples are analyzed here. Each cancer type turns out to be one target subject cluster. And interestingly, a cancer type has somehow formed more than one dual relationship with distinct target covariate (gene) clus-ters. If an identified dual relationship constitutes the discovery of a biomarker, then multiple dual relationships mean multiple biomarkers for the one cancer type. Further, the collection of dual relationships would constitute a global-feature matrix of biomarkers. A biomarker for a cancer type not only has the capability to identify such a cancer type, but at the same time it provides nega-tive information to other cancer types that have no dual relationships with the biomarker. Therefore, a collection of dual-relation-based blocks discovered on the covariate matrix would form a global feature identification for all involved cancer types. An effective learning algorithm is constructed in this paper.

(22)

398 E.P. Chou

2 Method

2.1 Semi-supervised Learning

Step 1. Choosing a particular cancer type (which includes target labeled subjects and all unlabeled subjects) to cluster genes into groups.

Step 2. Classifying whole labeled and unlabeled subjects by each gene-subgroup. Finding a particular gene-subgroup that can classify the target cancer type. Repeating the procedures to whole the cancer types. These proce-dures yield the ﬁrst dual relationship between the gene-subgroups and cancer subtypes. The cancer subtypes here may contain some unlabeled subjects within the cluster.

Step 3. Classifying genes again by a particular cancer subtype and the unknown ones that are in the same cluster as in step 2 yields the second gene-subgroups. Then, with these new gene-subgroups, classifying all subjects will yield the second dual-relationship.

Step 4. The calculation of

ViVi

||Vi||||Vi|| =cos θii

, i, i = 1, .., n

is performed using the 2nd dual relationship to calculate. Here V_i is a

vector for the unlabeled subject’s data and V_i is a vector for the other

target labeled subject’s data..

Step 5. Plotting the density function of cos θii for each cancer subtype

deter-mines the classiﬁcation with the function having the largest density mode.

By the method above, we can obtain clusters of the unlabeled data and labeled data. We will not lose any information from the unlabeled data. By repeating the re-clustering procedure, we can conﬁrm that the unlabeled subjects have been correctly classiﬁed.

2.2 Datasets

We applied our learning algorithm to several datasets. The ﬁrst dataset is the

one from [7]. The dataset contains 20 pulmonary carcinoids (COID), 17

nor-mal lung (NL), and 21 squamous cell lung carcinomas (SQ) cases. The second

dataset was obtained from [18], containing 83 subjects with 2308 genes with 4

diﬀerent cancer types: 29 cases of Ewing sarcoma (EWS), 11 cases of Burkitt lymphoma (BL), 18 cases of neuroblastoma (NB), and 25 cases of rhabdomyosar-coma (RMS). The third gene expression dataset comes from the breast cancer

microarray study by [16]. The data includes information about breast cancer

mutation in the BRCA1 and the BRCA2 genes. Here, we have 22 patients, 7 with BRCA1 mutations, 8 with BRCA2 mutations, and 7 with other types. The

fourth gene expression dataset comes from [15]. The data contains a total of

(23)

Table 1. Data description

Data Number Number of subjects in each label Dimensions of labels

Chen 3 20 COID, 17 NL, 21 SQ 58×1543 Khan 4 29 EWS, 11 BL, 18 NB, 25 RMS 83×2308 Hedenfalk 3 7 BRCA1, 8 BRCA2, 7 others 22×3226 Gordon 2 31 MPM, 150 ADCA 181×1626

Table 2. Data description in semi-supervised setting

Data Number of Number of subjects in each label unlabeled

subjects

Chen 15 15 COID, 12 NL, 16 SQ Khan 20 23 EWS, 8 BL, 12 NB, 20 RMS Hedenfalk 6 5 BRCA1, 6 BRCA2, 5 others Gordon 20 21 MPM, 140 ADCA

Table 3. Accuracy rates for diﬀerent examples - semi-supervised learning

Data set Accuracy Chen 15/15 Khan 1/20 Hedenfalk 4/4 Gordon 20/20

3 Results

We made some of the subjects unlabeled to perform semi-supervised learning. For the Chen dataset, we took the last 5 subjects in each group as unlabeled.

For the Khan dataset, unlabeled data are the same as those mentioned in [18].

Since the sample size for Hedenfalk dataset is not large, we unlabeled only the last 2 subjects in BRCA1 and the last 2 subjects in BRCA2. We unlabeled 10 subjects for each group for the Gordon dataset. The number of labeled subjects

and unlabeled subjects can be found in Table 2. The predicted results can be

found in Table3. However, we could not ﬁnd the distinct dual-relationship for

the second dataset.

4 Discussion

(24)

learn-400 E.P. Chou

efficiently classified most of the datasets with their dual relationships. In addi-tion, we incorporated unlabeled data into the learning rule to prevent misclassi-fication and the loss of some important information.

A large collection of covariate dimensions must have many hidden patterns embedded in it to be discovered. The model-based learning algorithm might cap-ture the aspects allowed by the assumed models. We made use computational approaches to uncover the hidden inter-dependence patterns embedded within the collection of covariate dimensions. However, we could not ﬁnd the dual rela-tionships for one dataset, as demonstrated in the previous sections. For that dataset, we could not predict precisely. The reason is that the distance function used was not appropriate for a description of the geometry of this particular dataset. We believe that the measuring of similarity or distance for two data nodes plays an important role in capturing the data geometry. However, choosing a correct distance measure is diﬃcult. With high dimensionality, it is impossible

to make assumptions about data distributions or to geta priori knowledge of the

data. Therefore, it is even more difficult to measure the similarity between the data. Different datasets may require different methods for measuring similarity between the nodes. A suitable selection of measuring similarity will improve the results of clustering algorithms

Another limitation is that we have to decide the smoothing bandwidth for the kernel density curves. A diﬀerent smoothing bandwidth or kernel may lead to diﬀerent results. Therefore, we can not make exact decisions. Besides, when the size of gene is very large, a great deal of computing time may be required.

By using the inner product as our decision rule, we know that, when two subjects are similar, the angle between the two vectors will be close to 0 and

cosθwill be close to 1. The use ofcosθmakes our decision rule easy and intuitive.

The performance of the proposed method is excellent. In addition, it can solve the classiﬁcation problem when we have outliers in the dual relationship.

The contributions of our studies are that the learning rules can specify gene-drug interactions or gene-disease relations in bioinformatics and can identify the clinical status of patients, leading them to early treatment. The application of this rule is not limited to microarray data. We can apply our rule of learning processes to any large dataset and ﬁnd the dual-relationship to shrink the dataset’s size. For example, the learning rules can also be applied to human behavior research focusing on understanding people’s opinions and their interactions.

Traditional clustering methods assume that the data are independently and identically distributed. This assumption is unrealistic in real data, especially in high dimensional data. With high dimensionality, it is impossible to make assumptions about data distributions and diﬃcult to measure the similarity between the data. We believe that measuring the similarity between the data nodes is an important way of exploring the data geometry in clustering. Also, clustering is a way to improve dimensionality reduction, and similarity research is a pre-requisite for non-linear dimensionality reduction. The relationships among clustering, similarity and dimensionality reduction should be considered in future

(25)

References

1. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci.

96(12), 6745–6750 (1999)

2. Bagirov, A.M., Ferguson, B., Ivkovic, S., Saunders, G., Yearwood, J.: New algo-rithms for multi-class cancer diagnosis using tumor gene expression signatures. Bioinformatics 19(14), 1800–1807 (2003)

3. Basford, K.E., McLachlan, G.J., Rathnayake, S.I.: On the classiﬁcation of microar-ray gene-expression data. Brieﬁngs Bioinf. 14(4), 402–410 (2013)

4. Ben-Dor, A., Bruhn, L., Laboratories, A., Friedman, N., Schummer, M., Nachman, I., Washington, U., Washington, U., Yakhini, Z.: Tissue classiﬁcation with gene expression proﬁles. J. Comput. Biol. 7, 559–584 (2000)

5. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. J. Com-put. Biol. 6(3–4), 281–297 (1999)

6. Bicciato, S., Luchini, A., Di Bello, C.: PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinf. 19(5), 571–578 (2003)

7. Chen, C.P., Fushing, H., Atwill, R., Koehl, P.: biDCG: a new method for discover-ing global features of dna microarray data via an iterative re-clusterdiscover-ing procedure. PloS One 9(7), 102445 (2014)

8. Chen, L., Yang, J., Li, J., Wang, X.: Multinomial regression with elastic net penalty and its grouping eﬀect in gene selection. Abstr. Appl. Anal. 2014, 1–7 (2014) 9. Dreiseitl, S., Ohno-Machado, L.: Logistic regression and artiﬁcial neural network

classiﬁcation models: a methodology review. J. Biomed. Inf. 35(5–6), 352–359 (2002)

10. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863–14868 (1998) 11. Fushing, H., McAssey, M.P.: Time, temperature, and data cloud geometry. Phys.

Rev. E 82(6), 061110 (2010)

12. Fushing, H., Wang, H., Vanderwaal, K., McCowan, B., Koehl, P.: Multi-scale clus-tering by building a robust and self correcting ultrametric topology on data points. PLoS ONE 8(2), e56259 (2013)

13. Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA 97(22), 12079–12084 (2000) 14. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,

Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomﬁeld, C.D., Lander, E.S.: Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)

15. Gordon, G.J., Jensen, R.V., Hsiao, L.L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expres-sion ratios in lung cancer and mesothelioma. Cancer Res. 62(17), 4963–4967 (2002) 16. Hedenfalk, I.A., Ringn´er, M., Trent, J.M., Borg, A.: Gene expression in inherited

breast cancer. Adv. Cancer Res. 84, 1–34 (2002)

17. Huynh-Thu, V.A., Saeys, Y., Wehenkel, L., Geurts, P.: Statistical interpretation of machine learning-based feature importance scores for biomarker discovery. Bioin-formatics 28(13), 1766–1774 (2012)

(26)

402 E.P. Chou

18. Khan, J., Wei, J.S., Ringnér, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S.: Classi-fication and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001)

19. Liao, J., Chin, K.V.: Logistic regression for disease classiﬁcation using microarray data: model selection in a large p and small n case. Bioinformatics 23(15), 1945– 1951 (2007)

20. Mahmoud, A.M., Maher, B.A., El-Horbaty, E.S.M., Salem, A.B.M.: Analysis of machine learning techniques for gene selection and classiﬁcation of microarray data. In: The 6th International Conference on Information Technology (2013)

21. Nguyen, D.V., Rocke, D.M.: Multi-class cancer classiﬁcation via partial least squares with gene expression proﬁles. Bioinformatics 18(9), 1216–1226 (2002) 22. Saber, H.B., Elloumi, M., Nadif, M.: Clustering Algorithms of Microarray Data.

In: Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Post-processing of Biological Data, pp. 557–568 (2013)

23. Shevade, S.K., Keerthi, S.S.: A simple and eﬃcient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)

24. Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and com-parison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)

25. Wasson, J.H., Sox, H.C., Neﬀ, R.K., Goldman, L.: Clinical prediction rules. Appli-cations and methodological standards. New Engl. J. Med. 313(13), 793–799 (1985). PMID: 3897864

26. Zhou, X., Liu, K.Y., Wong, S.T.: Cancer classiﬁcation and prediction using logistic regression with bayesian gene selection. J. Biomed. Inform. 37(4), 249–259 (2004)

(27)

科技部補助計畫衍生研發成果推廣資料表

日期:2015/09/21

科技部補助計畫

計畫名稱: 數據驅動的幾何學習計畫主持人: 周珮婷計畫編號: 103-2118-M-004-006- 學門領域: 其他應用統計

無研發成果推廣資料

(28)

103 年度專題研究計畫研究成果彙整表

計畫主持人：周珮婷計畫編號： 103-2118-M-004-006-計畫名稱：數據驅動的幾何學習量化成果項目實際已達成數（被接受或已發表）預期總達成數(含實際已達成數) 本計畫實際貢獻百分比單位備註（質化說明：如數個計畫共同成果、成果列為該期刊之封面故事 ... 等）期刊論文 0 0 100% 研究報告/技術報告 0 0 100% 研討會論文 0 0 100% 篇論文著作專書 0 0 100% 申請中件數 0 0 100% 專利已獲得件數 0 0 100% 件件數 0 0 100% 件技術移轉權利金 0 0 100% 千元碩士生 2 2 100% 博士生 0 0 100% 博士後研究員 0 0 100% 國內參與計畫人力（本國籍）專任助理 0 0 100% 人次期刊論文 0 0 100% 研究報告/技術報告 0 0 100% 研討會論文 1 1 100% 篇論文著作專書 0 0 100% 章/本申請中件數 0 0 100% 專利已獲得件數 0 0 100% 件件數 0 0 100% 件技術移轉權利金 0 0 100% 千元碩士生 0 0 100% 博士生 0 0 100% 博士後研究員 0 0 100% 國外參與計畫人力（外國籍）專任助理 0 0 100% 人次

(29)

其他成果

(

無法以量化表達之成果如辦理學術活動、獲得獎項、重要國際合作、研究成果國際影響力及其他協助產業技術發展之具體效益事項等，請以文字敘述填列。) 無成果項目量化 名稱或內容性質簡述 測驗工具(含質性與量性) 0 課程/模組 0 電腦及網路系統或工具 0 教材 0 舉辦之活動/競賽 0 研討會/工作坊 0 電子報、網站 0 科教處計畫加填項目計畫成果推廣之參與（閱聽）人數 0

(30)

數據驅動的幾何學習

科技部補助專題研究計畫成果報告

期末報告

數據驅動的幾何學習

中 華 民 國 104 年 10 月 05 日

行

行

行政

政

政

院

院

院國

國

國家

家科

家

科

科

學

學

學委

委

委員

員

員會

會專

會

專

專題

題

題研

研

研究

究

究計

計

計

畫

畫

畫成

成

成果

果

果報

報

報告

告

告

數

數

數據

據

據

驅

驅

驅動

動

動的

的

的幾

幾

幾何

何

何學

學

學習

習

習

Data Driven Geometry for Learning

計

畫編號：103-2118-M-004-006-執行期限：103年10月01日至104年07月31日

主持人：周珮婷

∑

∑

References

科技部補助專題研究計畫出席國際學術會議心得報告

一、 參加會議經過

整個大會為 The Frontiers in Intelligent Data and Signal Analysis 會議的三個分支

會議之一，一月份時投稿，三月中時公佈審查結果，入取率約三分之一。40

中華民國 104 年 10 月 05 日

_國

_國家

_學

_學

_學委

_委

_委員

_員

_員會

_會專

_專

_專題

_題

_題研

_報告

_告

_告

_驅

_驅

_驅動

_∑

_∑

一、參加會議經過

二、與會心得

三、發表論文全文或摘要

四、建議

五、攜回資料名稱及內容

六、其他