Gene selection in training data by LA

4.2 Liquid association

4.2.3 Gene selection in training data by LA

We implemented the LA hub genes selection procedure in the training data set. First we implemented the modiﬁed imputation. In the second steps, a total of 19,546,878 LA scores of all possible gene pairs with respected to the imputed value were computed. The scatter plot of ratios versus cutoﬀs in step 3 is given as ﬁgure 4.5. Based on the scatter plot of ratios versus cutoﬀs we decided to choose the cutoﬀ to be 0.326. Then, we noticed that the gene SRP54 appeared 4 times in a total of 11 pairs with LA score greater than the cutoﬀ. Then we permuted the imputed survival time for 1,000 runs and computed the average number of genes appeared in gene pairs with LA score greater than 0.326 at least 4 times. The average number was only 0.032

CHAPTER 4. GENE SELECTION 35

0.30 0.32 0.34 0.36 0.38

0.00.10.20.30.40.5

Absolute value of LA score

Expected numbers/Observed numbers

Figure 4.5: The scatter plot of ratios of expected number to the observed number versus cutoﬀs of the absolute value of LA scores in the training data.

in 1,000 permutation runs. Therefore, we selected these LA pairs as the candidate genes for the subsequent analysis. The LA hub gene SRP54 with the 4 paired genes and the LA scores with respect to the imputed survival value are given in table 4.3.

Table 4.3: The LA hub gene SRP54 and its LA paired genes

Symbols Full names LA scores

SRP54 signal recognition particle 54kDa

-SART3 squamous cell carcinoma antigen recognized by T cells 3 0.3810 NR2C1 nuclear receptor subfamily 2, group C, member 1 0.3659 CROP cisplatin resistance-associated overexpressed protein 0.3269

PAWR PRKC, apoptosis, WT1, regulator 0.3268

Chapter 5 Signature construction

5.1 Methodology of modiﬁed sliced inverse regression for censored data

In the gene signature construction part, ﬁrst we applied the modiﬁed sliced inverse regression for censored data by Li et al. (1999) to reduce di-mensions. After ﬁnding out the eﬀective dimension reduction (e.d.r.) space, we could project the gene expression proﬁles on it and ﬁt further survival model if necessary.

Sliced inverse regression by Li (1991) [15] was originally introduced for di-mension reduction. Assuming the p-didi-mension regressor X and the response Y^◦ satisﬁed the dimension reduction model

Y^◦ = g(β₁^′X, β₂^′X, ..., β_k^′X, ϵ) (5.1) and the linear design condition; for any b in R^p

E(b^′X|β1^′X, β₂^′X, ..., β_k^′X) = c0 + c1β₁^′X, ..., ckβ_k^′X, (5.2) 36

CHAPTER 5. SIGNATURE CONSTRUCTION 37 for some constants c₁, c₂, ..., c_k, the eﬀective dimension reduction (e.d.r.) space, B = span(β1, β₂, ..., β_k), could be estimated by the eigenvalue decom-position of Σ_E[X_|Y◦] with respect to Σ_X, where Σ_E[X_|Y◦] = cov(E[X | Y^◦]) and ΣX= cov(X). The function g and the distribution of ϵ were not need to be speciﬁed for estimating the e.d.r. space. The key observation described as Theorem 3.1 in Li (1991) [15] was that under conditions (5.1) and (5.2), the centered inverse regression curve E[X | Y^◦] − E[X] was contained in the linear subspace spanned by β₁Σ_X, β₂Σ_X, ..., β_kΣ_X. Therefore, we could estimate the e.d.r. space by estimating the inverse regression curve. To de-termine how many e.d.r. directions should be select, Li (1991) also proposed a large sample chi-squared test for testing the signiﬁcance of the estimated e.d.r. directions which is called the SIR directions. In practice, the imple-mentation of sliced inverse regression method is summarized as the following steps: 5. Applied the large sample chi-squared test to select the signiﬁcant leading eigenvectors to be the SIR directions.

In survival data analysis, due to the data censoring, applying the original

CHAPTER 5. SIGNATURE CONSTRUCTION 38 sliced inverse regression by directly slicing observed time Y may cause the estimation bias. Li et al. (1999) [16] studied the eﬀects to the original sliced inverse regression caused by two data censoring conditions. Under the independent censoring condition

C is independent of X and Y^◦,

we can show that independent censoring did not aﬀect the sliced inverse regression by directly slicing observed time Y . Under the conditional inde-pendent censoring condition, a more general condition,

Conditional on X, C is independent of Y^◦,

directly slicing observed time Y does cause the estimation bias. Li et al.

(1999) modiﬁed the original sliced inverse regression for this situation. As described above, an important step in sliced inverse regression is estimating the inverse regression curve E[X | Y^◦]. For Y^◦ ∈ [y^◦_i, y^◦_i+1) the inverse regression curve can be expressed by

E[X| Y^◦ ∈ [y^◦i, y^◦_i+1)] = E[X1_[y◦ time. One can observe that

E[X1(Y^◦ ≥ yi^◦)]

CHAPTER 5. SIGNATURE CONSTRUCTION 39 where the last equation holds under the condition (5.4). A weighted function is deﬁned by ω(t, t^′, X) = _E[1(Y^E[1(Y_◦^◦_≥t^≥t)|X]_′₎_|X] = _S^S_◦^◦_(t^(t_′^|X)_|X). Under the conditional independent assumption, we can show that _S^S_◦^◦_(t^(t_′^|X)_|X) = _S(t^S(t_′^|X)_|X) where S^◦ is the survival function of failure time and S is the survival function of observed time. Then we have,

with a similar argument. Therefore, the inverse regression curve can be estimated by replacing the expectations by the ﬁrst sample moments and plugging the weighted function ω(·, ·, ·) by its kernel estimation ˆω(·, ·, ·). The proof of the consistency and the root-n rate convergence under some regu-larity conditions were given in Li et al. (1999) [16].

Since the kernel estimation only performed well in the low-dimension case, they also proposed an initial dimension reduction step called double-slice before applying the modiﬁed double-sliced inverse regression for censored data.

We assumed that the censor time C also satisﬁed the dimension reduction model

C = h(θ₁^′X, θ₂^′X, ..., θ_l^′X, ϵ^′).

By applying the original sliced inverse regression, the space spanned by β’s and θ’s, called the joint e.d.r. space, can be estimated by slicing the observed time Y for δ = 1 and 0 separately. Then we can replace X by its projection in the estimated joint e.d.r. space for a low-dimension kernel estimation of weight function ˆω(·, ·, ·). The modiﬁed sliced inverse regression procedure can

CHAPTER 5. SIGNATURE CONSTRUCTION 40 be summarized as the following steps:

1. Double-slice the survival time and censoring time and apply the original sliced inverse regression;

2. Applied the large sample chi-squared test to select the ﬁrst few signiﬁcant joint SIR directions;

3. Project the regressors into the space spanned by the joint SIR directions to estimate the conditional survival function and the weight function by kernel estimation;

4. Compute the estimated conditional expectation in each slice by plugging the estimated weight function and the ﬁrst sample moments;

5. Compute the estimated between slices covariant matrix ˆΣ_E[X_|Y◦] and the sample covariant matrix ˆΣX;

6. Conduct a eigenvalue decomposition of ˆΣ_E[X_|Y◦] with respect to ˆΣ_X; 7. Applied the large sample chi-squared test to select the signiﬁcant leading eigenvectors be the survival time SIR directions.

5.2 Signature construction in the training data

在文檔中生物晶片資料分析與肺腺癌存活之預測 (頁 45-51)