4.2 Liquid association
4.2.3 Gene selection in training data by LA
We implemented the LA hub genes selection procedure in the training data set. First we implemented the modified imputation. In the second steps, a total of 19,546,878 LA scores of all possible gene pairs with respected to the imputed value were computed. The scatter plot of ratios versus cutoffs in step 3 is given as figure 4.5. Based on the scatter plot of ratios versus cutoffs we decided to choose the cutoff to be 0.326. Then, we noticed that the gene SRP54 appeared 4 times in a total of 11 pairs with LA score greater than the cutoff. Then we permuted the imputed survival time for 1,000 runs and computed the average number of genes appeared in gene pairs with LA score greater than 0.326 at least 4 times. The average number was only 0.032
CHAPTER 4. GENE SELECTION 35
0.30 0.32 0.34 0.36 0.38
0.00.10.20.30.40.5
Absolute value of LA score
Expected numbers/Observed numbers
Figure 4.5: The scatter plot of ratios of expected number to the observed number versus cutoffs of the absolute value of LA scores in the training data.
in 1,000 permutation runs. Therefore, we selected these LA pairs as the candidate genes for the subsequent analysis. The LA hub gene SRP54 with the 4 paired genes and the LA scores with respect to the imputed survival value are given in table 4.3.
Table 4.3: The LA hub gene SRP54 and its LA paired genes
Symbols Full names LA scores
SRP54 signal recognition particle 54kDa
-SART3 squamous cell carcinoma antigen recognized by T cells 3 0.3810 NR2C1 nuclear receptor subfamily 2, group C, member 1 0.3659 CROP cisplatin resistance-associated overexpressed protein 0.3269
PAWR PRKC, apoptosis, WT1, regulator 0.3268
Chapter 5
Signature construction
5.1 Methodology of modified sliced inverse regression for censored data
In the gene signature construction part, first we applied the modified sliced inverse regression for censored data by Li et al. (1999) to reduce di-mensions. After finding out the effective dimension reduction (e.d.r.) space, we could project the gene expression profiles on it and fit further survival model if necessary.
Sliced inverse regression by Li (1991) [15] was originally introduced for di-mension reduction. Assuming the p-didi-mension regressor X and the response Y◦ satisfied the dimension reduction model
Y◦ = g(β1′X, β2′X, ..., βk′X, ϵ) (5.1) and the linear design condition; for any b in Rp
E(b′X|β1′X, β2′X, ..., βk′X) = c0 + c1β1′X, ..., ckβk′X, (5.2) 36
CHAPTER 5. SIGNATURE CONSTRUCTION 37 for some constants c1, c2, ..., ck, the effective dimension reduction (e.d.r.) space, B = span(β1, β2, ..., βk), could be estimated by the eigenvalue decom-position of ΣE[X|Y◦] with respect to ΣX, where ΣE[X|Y◦] = cov(E[X | Y◦]) and ΣX= cov(X). The function g and the distribution of ϵ were not need to be specified for estimating the e.d.r. space. The key observation described as Theorem 3.1 in Li (1991) [15] was that under conditions (5.1) and (5.2), the centered inverse regression curve E[X | Y◦] − E[X] was contained in the linear subspace spanned by β1ΣX, β2ΣX, ..., βkΣX. Therefore, we could estimate the e.d.r. space by estimating the inverse regression curve. To de-termine how many e.d.r. directions should be select, Li (1991) also proposed a large sample chi-squared test for testing the significance of the estimated e.d.r. directions which is called the SIR directions. In practice, the imple-mentation of sliced inverse regression method is summarized as the following steps: 5. Applied the large sample chi-squared test to select the significant leading eigenvectors to be the SIR directions.
In survival data analysis, due to the data censoring, applying the original
CHAPTER 5. SIGNATURE CONSTRUCTION 38 sliced inverse regression by directly slicing observed time Y may cause the estimation bias. Li et al. (1999) [16] studied the effects to the original sliced inverse regression caused by two data censoring conditions. Under the independent censoring condition
C is independent of X and Y◦,
we can show that independent censoring did not affect the sliced inverse regression by directly slicing observed time Y . Under the conditional inde-pendent censoring condition, a more general condition,
Conditional on X, C is independent of Y◦,
directly slicing observed time Y does cause the estimation bias. Li et al.
(1999) modified the original sliced inverse regression for this situation. As described above, an important step in sliced inverse regression is estimating the inverse regression curve E[X | Y◦]. For Y◦ ∈ [y◦i, y◦i+1) the inverse regression curve can be expressed by
E[X| Y◦ ∈ [y◦i, y◦i+1)] = E[X1[y◦ time. One can observe that
E[X1(Y◦ ≥ yi◦)]
CHAPTER 5. SIGNATURE CONSTRUCTION 39 where the last equation holds under the condition (5.4). A weighted function is defined by ω(t, t′, X) = E[1(YE[1(Y◦◦≥t≥t)|X]′)|X] = SS◦◦(t(t′|X)|X). Under the conditional independent assumption, we can show that SS◦◦(t(t′|X)|X) = S(tS(t′|X)|X) where S◦ is the survival function of failure time and S is the survival function of observed time. Then we have,
with a similar argument. Therefore, the inverse regression curve can be estimated by replacing the expectations by the first sample moments and plugging the weighted function ω(·, ·, ·) by its kernel estimation ˆω(·, ·, ·). The proof of the consistency and the root-n rate convergence under some regu-larity conditions were given in Li et al. (1999) [16].
Since the kernel estimation only performed well in the low-dimension case, they also proposed an initial dimension reduction step called double-slice before applying the modified double-sliced inverse regression for censored data.
We assumed that the censor time C also satisfied the dimension reduction model
C = h(θ1′X, θ2′X, ..., θl′X, ϵ′).
By applying the original sliced inverse regression, the space spanned by β’s and θ’s, called the joint e.d.r. space, can be estimated by slicing the observed time Y for δ = 1 and 0 separately. Then we can replace X by its projection in the estimated joint e.d.r. space for a low-dimension kernel estimation of weight function ˆω(·, ·, ·). The modified sliced inverse regression procedure can
CHAPTER 5. SIGNATURE CONSTRUCTION 40 be summarized as the following steps:
1. Double-slice the survival time and censoring time and apply the original sliced inverse regression;
2. Applied the large sample chi-squared test to select the first few significant joint SIR directions;
3. Project the regressors into the space spanned by the joint SIR directions to estimate the conditional survival function and the weight function by kernel estimation;
4. Compute the estimated conditional expectation in each slice by plugging the estimated weight function and the first sample moments;
5. Compute the estimated between slices covariant matrix ˆΣE[X|Y◦] and the sample covariant matrix ˆΣX;
6. Conduct a eigenvalue decomposition of ˆΣE[X|Y◦] with respect to ˆΣX; 7. Applied the large sample chi-squared test to select the significant leading eigenvectors be the survival time SIR directions.