Cross-laboratory normalization - 應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料

The goal of cross-laboratory in DGE analysis is that whether models bulit from dataset of one laboratory can differentiate dataset of another. Moreover, we can combine all the datasets from different laboratories to a huge dataset, and use it to differentiate another dataset. However, in high throughput technology, there are lots of difference among the datasets published by different laboratories, such as enviroments, machines, sample races, and many experiment conditions [23, 12]. The datasets from different laboratories may have different distribution in raw data. If we comprise these datasets directly, it will influ-ence the result strongly. It has to be normalized between datasets to the same distribution before further analysis [19].

Not only cross-laboratory may have huge influence, but also cross-platform. Several studies show that even using the same samples, the measurements from different platforms are still poorly correlated [2, 18]. Therefore, many researches have been conducted to re-duce the bias which is resulted from cross-laboratory or cross-platform. Many approaches have been proposed to solve this bias in microarray technology, such as log transform [5], mean scale, rank-based normalization [52]. In RNA-seq technology, still few research discuss the normalization problem across laboratories or platforms [16, 35]

Many studies have shown that the rank-based normalization is effective to raise predic-tion accuracy and let it more stable than only using expression values. Expression values may be biased because the scale of each gene may vary among different experimental en-vironments. To rank gene’s order of a sample instead of using its expression value is much better to eliminate systematic biases and improve the prediction accuracyn [52]. There are several variants of rank-based normalization. First, the basic type of rank-based normal-ization which we used in our study, only use the rank of gene in one sample to replace the

expression value of the gene [44]. Second, median rank is another rank-based normaliza-tion. It calculates the median of each gene between the samples and sorts those medians as the value of rank [47]. Another rank-based normalization is quantile normalization [3].

The value of rank is measured by taking the average of the expression values of defined rank in samples, and then replaces the expression value of each gene by the value of its rank. Some researches show that using simple rank-based normalization performs better than quantile normalization method [44, 31]. Then, we choose the simple rank-based nor-malization. Using the gene’s rank in the sample to replace the expression value. In this study, we use both RPKM (FPKM) and rank levels for feature selection to observe the improvement by applying rank-based normalization. After this procedure, we use RPKM and rank value to do the step of feature selection.

Chapter 3 Feature selection and classification

Finding relevent genes from tens of thousands of genes is an important and difficult task in differetial gene analysis. We apply an embedded feature selection method, Random Forest [4] to select a relevent gene list from the training set. Then, we evaluat the gene list by checking classification accuracy of testing set by the well-known classifier, Support Vector Machine (SVM) [7]. We will introduce the details of these two techniques in the folowing sections.

3.1 Feature selection by Random Forest

Random Forest which is first proposed by Breiman [4] is an embedded feature selec-tion method which interacts with classifier. We apply Random Forest for feature selecselec-tion on training set to obtain a ranking list of gene which is sorted from the most relevent to the least relevent for classification. We use the ‘randomForest’ [20] of R-package for this step. In this section, we first introduce the decision tree which Random Forest uses, and ensemble of all dicision trees. Finally, we introduce the whole procedure of Random Forest.

3.1.1 Building decision tree

Decision tree is a predictive model which can be used in classification or regression.

Here we use the classification tree. Figure 3.1 is the structure of a classification tree.

X1> 1.3...

Figure 3.1: An example of a decision tree.

Assume that a data A has a feature vector X = (x₁, x₂, ..., x_p) which is a p-dimensional vector. We want to predict the class Y = {1, −1} of A from the feature vector X. The classification tree is a binary tree, and each internal node represents a test to A. At each internal node of the binary tree, we apply the test to get the outcome (yes or no) to decide which way A has to go. If the test return yes, then A goes to the left branch, and A goes to the right branch otherwise. Finally, A reaches the leaf node, where we make a prediction of class Y.

Constructing classification or regression tree is based on greedy algorithms. The clas-sification tree is constructed top-down, starting from a root node. For choosing the internal node, we calculate the value of:

|S| · H(S) − |St| · H(St)− |Sf| · H(Sf), (3.1)

where S denotes the set of samples that reach the node, S_tand S_f denote the subset of S which the the test is true and false, respectively. The function H is the Shannon entropy:

H(S) = −

∑Y i=1

p(c_i)· log2p(c_i), (3.2)

where Y is the number of class and p(ci) is the proportion of samples in S belonging to class c_i. The feature which maximizes the Eq 3.1 will be chosen for the internal node, and remove from the feature vector. The tree is constructed recursively, until all the features

are assigned for internal node.

3.1.2 Ensemble of trees

Ensemble method is to aggregate the prediction of several trees, and usually improves the performance of a single tree. The goal of ensemble method is to use diversified models

to reduce the variance. Random Forest is an ensemble of N decision trees{T1(X), T₂(X), ..., T_N(X)}, where X = (x₁, x₂, ..., x_p) is a p-dimensional vector of features. The ensemble outputs

{ ˆY1 = T₁(X), ..., ˆY_N = T_N(X)}, where ˆYi(i = 1, ..., N ) is the prediction result of the tree T_i. The outputs of all trees are aggregated to produce one final prediction which is a vajority vote of trees for classification problem.

3.1.3 Training procedure

Given data on a set of size n, D ={(X1, Y₁), (X₂, Y₂), ..., (X_n, Y_n)}, where Xiis the feature vector and Y_i is the class label of sample i. The training procedure of Random Forest is as follow:

1. From training data of n sample, random sampling N subsets with replacement from n samples.

2. For each subset, build a decision tree with the following rule: at each internal node choose the gene which can split the subset best.

3. Repeat above steps until all N trees are constructed well. In our study, we set N to 1000.

3.1.4 Measuring feature importance

Breiman has proposed a procedure to compute the feature importance. Consider out-of-bag samples S_owhich are the training samples that are not in the samples which be used in the construction of emsemble trees. The prediction accuracy p_iof S_o, where i stands for the i-th tree. Randomly permute the value of gene j in S_oto get S_o_j After permutation, we

get the predition accuracy p_i_j of S_o_j. We can get the importance s_j of gene j to average all the value of p_i substrct p_i_j. At last, sort the important list, and get the top 250 genes for calssification.

在文檔中應用特徵選取於跨實驗室前列腺癌核醣核酸序列資料 (頁 25-30)