• 沒有找到結果。

Imputation of survival time with right censoring

3.3 Small variation gene expressions

4.1.1 Imputation of survival time with right censoring

Here we used the normal quantile transformed correlation coefficient de-scribed in the previous chapter to measure the correlation between survival time y and each gene expression profile xg, where g = 1, 2, ..., G. Due to data censoring, we could not use the correlation coefficient between observed time y and gene expression xg. Let δ = 1{Y≤C}(Y, C) be the indicator that indicated the status of each patient. To reduce the effect caused by right censoring, an imputation ˆY for δ = 0 was needed.

Suppose that Ywas the true survival time with survival function S(y) = P (Y > y) and density function f (y). In elementary statistics, we knew that the conditional mean, E(Y | Y > y), minimized the 2-norm impu-tation error loss, l2( ˆY) = E[( ˆY − Y)2 | Y > y]. However, Wu et al.

(2008) pointed out the limitation of estimating the conditional mean given that Y > y by Kaplan-Meier estimate when the last observation was cen-sored in practice. Therefore, we did not adopt the conditional mean. If the 1-norm imputation error loss l1( ˆY) = E[| ˆY − Y| | Y > y] was used, we could impute the censored data by the conditional median given that Y > y and evaluate the normal quantile transformed correlation coefficient, corr(N (xg), N (ˆy)), to estimate the correlation coefficient between survival time Y and each gene expression profile Xg.

First we noted that the normal quantile transformation N (·) only de-pended of the ranks of variables. Therefore, it was invariant under any mono-tone transformation, that is N ((h(x1), h(x2), ..., h(xn))) = N ((x1, x2, ..., xn)) for any monotone function h. Second, the conditional median given that

CHAPTER 4. GENE SELECTION 19 func-tion, so that we have

N (ˆy) = N(

Therefore, instead of estimating the conditional median, we could evaluate the N (ˆy) by plugging the estimated survival function ˆS(·). Wu et al. (2008) proposed a nonparametric imputation procedure by using the Kaplan-Meier estimation for the survival function. The procedure is summarized as the following steps:

Imputation - Kaplan-Meier based

1. Calculate ˆSi the Kaplan-Meier estimate of the survival probability;

2. Impute the survival probability by the predicted conditional median

S˜i =

4. Calculate the imputed N (ˆy) by performing the normal quantile transfor-mation on ˆpi.

CHAPTER 4. GENE SELECTION 20 The implementation of this nonparametric imputation procedure is easy since we only have to calculate the Kaplan-Meier estimate of the survival probability. However, an issue is how we improve the imputation if we have extra information. The TNM tumor stage is strongly related to the survival time of NSCLC patients. It is not suitable to impute the same survival time for different stage patients at the same censored time. It motivated us to modify the imputation procedure with this extra information.

Modified imputation - Cox proportional hazard model based

Previous studies indicated that the survival of NSCLC patients in dif-ferent TNM tumor stage were significantly different and it motivated us to modify the imputation procedure. Since the original imputation procedure directly followed by the Kaplan-Meier estimate of survival probability, one nature idea was to modify the survival probability estimation by incorpo-rating the TNM tumor stage Z. We assumed that the conditional survival function given Z = z satisfied Cox proportional hazard model.

Cox proportional hazard model is one of the well-known regression sur-vival models. It modeled that the hazard function given Z = z is propor-tional to a baseline hazard function and the logarithm of the ratio is linearly dependent on the regressors,

λ(y | Z = z) = λ0(y)eγz.

Here we let Z be a four levels factor which indicated that the patient’s TNM tumor stage is IA, IB, II or III/IV. Then, the relationship between conditional survival function given Z = z and the baseline survival function can be

CHAPTER 4. GENE SELECTION 21

50 100 150 200

-2024

Log-Log Survival curve

Figure 4.1: The log-log Kaplan-Meier curves of different stages for the train-ing data set.

expressed as

S(Y | Z = z) = [S0(Y)]eγz. (4.1) To check the assumption of Cox proportional hazard model, first we drew the log-log Kaplan-Meier curves of different stages for the training data set.

From the equation (4.1) above, the log-log Kaplan-Meier curves should be parallel if the assumption held. Figure 4.1.1 showed that there was no strong evidence of non-parallelism for our data. Second, we drew the observed ver-sus expected plot for the training data set and it also showed that there was no strong evidence to reject the assumption.

CHAPTER 4. GENE SELECTION 22

0 50 100 150

0.00.20.40.60.81.0

Observed and Expected Survival Curve

Figure 4.2: Observed Kaplan-Meier plot and Cox proportional hazard model.

Cox coefficients of (Stage IB, II, III/IV)= (0.50, 1.05, 1.81). The hazard ratio of (Stage IB, II, III/IV)= (1.64, 2.85, 6.12).

Therefore, we assumed that Y | Z = z satisfied the Cox proportional hazard model and imputed the censored time by ˆy = median(Y | Y >

where S0(·) was the baseline survival function and it was also a monotone function. Furthermore, we had S0yi) = S0(yi) if δi = 1 and

CHAPTER 4. GENE SELECTION 23 From the equations above, we observed that the modified imputation method gave different weights for the censored survival probability of the patients in different stages. In practice, the Cox coefficients γ could be estimated by finding the γ that maximized the partial likelihood. The baseline survival function could be estimated by the Nelson-Allen estimate or Breslow esti-mate. Then the original procedure could be implemented by replacing the survival probability estimation and the imputed weights. The modified im-putation procedure is summarized as the following steps:

1. Estimate the Cox coefficients γs for each TNM tumor stage and the baseline survival probability ˆS0i;

2. Impute the survival probability by the predicted conditional median

S˜0i =



Sˆ0i, if δi = 1 (12)exp γzi1 Sˆ0i, if δi = 0;

3. Calculate the percentile ˜pi = 1− ˜S0i;

4. Calculate the imputed N (ˆy) by performing the normal quantile transfor-mation on ˜pi.

CHAPTER 4. GENE SELECTION 24

4.1.2 A simulation comparison between two imputa-tion methods

To present the improvement of our modification, we did a simulation study. First we randomly generated 256 survival time samples from Cox proportional hazard model with a four levels factor regressor. The levels of the factor regressor were uniformly random generated. The baseline survival function was exponential distribution with rate parameter set to be 1 and the Cox coefficients γ’s for each level were set to be (0, 2, 4, 8). Another 256 cen-sored time samples were randomly generated from exponential distribution with rate parameter 3. We set the minimum of the survival time samples and the censoring time samples to be the observed time samples. The average censoring rate was 0.5099.

To assess the performances of the imputation methods, the normal quan-tile transformed correlation coefficient between true survival time samples and the imputed values, corr(N (y), N (ˆy)), was used to measure the close-ness. For 1,000 simulation runs, we implemented both imputation methods and recorded the correlation coefficients in each run. The average of the cor-relation coefficients of the Kaplan-Meier based imputation was 0.8684 and the average correlation coefficients of Modified imputation was 0.9096. More-over, there were only three times that the Kaplan-Meier based imputation had correlation coefficient greater than the Modified imputation. We con-cluded that the modified imputation method had better performance when the Cox proportional hazard model assumption held. The results of our sim-ulation were given in the following table.

CHAPTER 4. GENE SELECTION 25

Table 4.1: Simulation comparison between two imputation methods Estimated Cox model coefficients Average S.D.

Cox model coefficient ˆγ2 2 = 2) 2.1096 0.6260 Cox model coefficient ˆγ3 3 = 4) 4.1355 0.6264 Cox model coefficient ˆγ4 4 = 8) 8.2389 0.7948 Normal quantile transformed correlation coefficient Average S.D.

between true survival time and imputed value

No imputation (Observed time) 0.7786 0.0312

KM based imputation 0.8683 0.0205

Modified imputation 0.9096 0.0146

Censor rate 0.5099 0.0308

相關文件