• 沒有找到結果。

Gene selection in training data by LA

4.2 Liquid association

4.2.3 Gene selection in training data by LA

We implemented the LA hub genes selection procedure in the training data set. First we implemented the modified imputation. In the second steps, a total of 19,546,878 LA scores of all possible gene pairs with respected to the imputed value were computed. The scatter plot of ratios versus cutoffs in step 3 is given as figure 4.5. Based on the scatter plot of ratios versus cutoffs we decided to choose the cutoff to be 0.326. Then, we noticed that the gene SRP54 appeared 4 times in a total of 11 pairs with LA score greater than the cutoff. Then we permuted the imputed survival time for 1,000 runs and computed the average number of genes appeared in gene pairs with LA score greater than 0.326 at least 4 times. The average number was only 0.032

CHAPTER 4. GENE SELECTION 35

0.30 0.32 0.34 0.36 0.38

0.00.10.20.30.40.5

Absolute value of LA score

Expected numbers/Observed numbers

Figure 4.5: The scatter plot of ratios of expected number to the observed number versus cutoffs of the absolute value of LA scores in the training data.

in 1,000 permutation runs. Therefore, we selected these LA pairs as the candidate genes for the subsequent analysis. The LA hub gene SRP54 with the 4 paired genes and the LA scores with respect to the imputed survival value are given in table 4.3.

Table 4.3: The LA hub gene SRP54 and its LA paired genes

Symbols Full names LA scores

SRP54 signal recognition particle 54kDa

-SART3 squamous cell carcinoma antigen recognized by T cells 3 0.3810 NR2C1 nuclear receptor subfamily 2, group C, member 1 0.3659 CROP cisplatin resistance-associated overexpressed protein 0.3269

PAWR PRKC, apoptosis, WT1, regulator 0.3268

Chapter 5

Signature construction

5.1 Methodology of modified sliced inverse regression for censored data

In the gene signature construction part, first we applied the modified sliced inverse regression for censored data by Li et al. (1999) to reduce di-mensions. After finding out the effective dimension reduction (e.d.r.) space, we could project the gene expression profiles on it and fit further survival model if necessary.

Sliced inverse regression by Li (1991) [15] was originally introduced for di-mension reduction. Assuming the p-didi-mension regressor X and the response Y satisfied the dimension reduction model

Y = g(β1X, β2X, ..., βkX, ϵ) (5.1) and the linear design condition; for any b in Rp

E(bX1X, β2X, ..., βkX) = c0 + c1β1X, ..., ckβkX, (5.2) 36

CHAPTER 5. SIGNATURE CONSTRUCTION 37 for some constants c1, c2, ..., ck, the effective dimension reduction (e.d.r.) space, B = span(β1, β2, ..., βk), could be estimated by the eigenvalue decom-position of ΣE[X|Y] with respect to ΣX, where ΣE[X|Y] = cov(E[X | Y]) and ΣX= cov(X). The function g and the distribution of ϵ were not need to be specified for estimating the e.d.r. space. The key observation described as Theorem 3.1 in Li (1991) [15] was that under conditions (5.1) and (5.2), the centered inverse regression curve E[X | Y] − E[X] was contained in the linear subspace spanned by β1ΣX, β2ΣX, ..., βkΣX. Therefore, we could estimate the e.d.r. space by estimating the inverse regression curve. To de-termine how many e.d.r. directions should be select, Li (1991) also proposed a large sample chi-squared test for testing the significance of the estimated e.d.r. directions which is called the SIR directions. In practice, the imple-mentation of sliced inverse regression method is summarized as the following steps: 5. Applied the large sample chi-squared test to select the significant leading eigenvectors to be the SIR directions.

In survival data analysis, due to the data censoring, applying the original

CHAPTER 5. SIGNATURE CONSTRUCTION 38 sliced inverse regression by directly slicing observed time Y may cause the estimation bias. Li et al. (1999) [16] studied the effects to the original sliced inverse regression caused by two data censoring conditions. Under the independent censoring condition

C is independent of X and Y,

we can show that independent censoring did not affect the sliced inverse regression by directly slicing observed time Y . Under the conditional inde-pendent censoring condition, a more general condition,

Conditional on X, C is independent of Y,

directly slicing observed time Y does cause the estimation bias. Li et al.

(1999) modified the original sliced inverse regression for this situation. As described above, an important step in sliced inverse regression is estimating the inverse regression curve E[X | Y]. For Y ∈ [yi, yi+1) the inverse regression curve can be expressed by

E[X| Y ∈ [yi, yi+1)] = E[X1[y time. One can observe that

E[X1(Y ≥ yi)]

CHAPTER 5. SIGNATURE CONSTRUCTION 39 where the last equation holds under the condition (5.4). A weighted function is defined by ω(t, t, X) = E[1(YE[1(Y≥t≥t)|X])|X] = SS(t(t|X)|X). Under the conditional independent assumption, we can show that SS(t(t|X)|X) = S(tS(t|X)|X) where S is the survival function of failure time and S is the survival function of observed time. Then we have,

with a similar argument. Therefore, the inverse regression curve can be estimated by replacing the expectations by the first sample moments and plugging the weighted function ω(·, ·, ·) by its kernel estimation ˆω(·, ·, ·). The proof of the consistency and the root-n rate convergence under some regu-larity conditions were given in Li et al. (1999) [16].

Since the kernel estimation only performed well in the low-dimension case, they also proposed an initial dimension reduction step called double-slice before applying the modified double-sliced inverse regression for censored data.

We assumed that the censor time C also satisfied the dimension reduction model

C = h(θ1X, θ2X, ..., θlX, ϵ).

By applying the original sliced inverse regression, the space spanned by β’s and θ’s, called the joint e.d.r. space, can be estimated by slicing the observed time Y for δ = 1 and 0 separately. Then we can replace X by its projection in the estimated joint e.d.r. space for a low-dimension kernel estimation of weight function ˆω(·, ·, ·). The modified sliced inverse regression procedure can

CHAPTER 5. SIGNATURE CONSTRUCTION 40 be summarized as the following steps:

1. Double-slice the survival time and censoring time and apply the original sliced inverse regression;

2. Applied the large sample chi-squared test to select the first few significant joint SIR directions;

3. Project the regressors into the space spanned by the joint SIR directions to estimate the conditional survival function and the weight function by kernel estimation;

4. Compute the estimated conditional expectation in each slice by plugging the estimated weight function and the first sample moments;

5. Compute the estimated between slices covariant matrix ˆΣE[X|Y] and the sample covariant matrix ˆΣX;

6. Conduct a eigenvalue decomposition of ˆΣE[X|Y] with respect to ˆΣX; 7. Applied the large sample chi-squared test to select the significant leading eigenvectors be the survival time SIR directions.

5.2 Signature construction in the training data

相關文件