Imputation of survival time with right censoring

3.3 Small variation gene expressions

4.1.1 Imputation of survival time with right censoring

Here we used the normal quantile transformed correlation coeﬀicient de-scribed in the previous chapter to measure the correlation between survival time y^◦ and each gene expression proﬁle xg, where g = 1, 2, ..., G^∗. Due to data censoring, we could not use the correlation coeﬀicient between observed time y and gene expression xg. Let δ = 1_{Y◦≤C}(Y^◦, C) be the indicator that indicated the status of each patient. To reduce the eﬀect caused by right censoring, an imputation ˆY^◦ for δ = 0 was needed.

Suppose that Y^◦was the true survival time with survival function S^◦(y^◦) = P (Y^◦ > y^◦) and density function f (y^◦). In elementary statistics, we knew that the conditional mean, E(Y^◦ | Y^◦ > y), minimized the 2-norm impu-tation error loss, l₂( ˆY^◦) = E[( ˆY^◦ − Y^◦)² | Y^◦ > y]. However, Wu et al.

(2008) pointed out the limitation of estimating the conditional mean given that Y^◦ > y by Kaplan-Meier estimate when the last observation was cen-sored in practice. Therefore, we did not adopt the conditional mean. If the 1-norm imputation error loss l₁( ˆY^◦) = E[| ˆY^◦ − Y^◦| | Y^◦ > y] was used, we could impute the censored data by the conditional median given that Y^◦ > y and evaluate the normal quantile transformed correlation coeﬀicient, corr(N (xg), N (ˆy^◦)), to estimate the correlation coeﬀicient between survival time Y^◦ and each gene expression proﬁle X_g.

First we noted that the normal quantile transformation N (·) only de-pended of the ranks of variables. Therefore, it was invariant under any mono-tone transformation, that is N ((h(x₁), h(x₂), ..., h(x_n))^′) = N ((x₁, x₂, ..., x_n)^′) for any monotone function h. Second, the conditional median given that

CHAPTER 4. GENE SELECTION 19 func-tion, so that we have

N (ˆy^◦) = N(

Therefore, instead of estimating the conditional median, we could evaluate the N (ˆy^◦) by plugging the estimated survival function ˆS^◦(·). Wu et al. (2008) proposed a nonparametric imputation procedure by using the Kaplan-Meier estimation for the survival function. The procedure is summarized as the following steps:

Imputation - Kaplan-Meier based

1. Calculate ˆS_i^◦ the Kaplan-Meier estimate of the survival probability;

2. Impute the survival probability by the predicted conditional median

S˜_i^◦ =

4. Calculate the imputed N (ˆy^◦) by performing the normal quantile transfor-mation on ˆp_i.

CHAPTER 4. GENE SELECTION 20 The implementation of this nonparametric imputation procedure is easy since we only have to calculate the Kaplan-Meier estimate of the survival probability. However, an issue is how we improve the imputation if we have extra information. The TNM tumor stage is strongly related to the survival time of NSCLC patients. It is not suitable to impute the same survival time for diﬀerent stage patients at the same censored time. It motivated us to modify the imputation procedure with this extra information.

Modiﬁed imputation - Cox proportional hazard model based

Previous studies indicated that the survival of NSCLC patients in dif-ferent TNM tumor stage were signiﬁcantly diﬀerent and it motivated us to modify the imputation procedure. Since the original imputation procedure directly followed by the Kaplan-Meier estimate of survival probability, one nature idea was to modify the survival probability estimation by incorpo-rating the TNM tumor stage Z. We assumed that the conditional survival function given Z = z satisﬁed Cox proportional hazard model.

Cox proportional hazard model is one of the well-known regression sur-vival models. It modeled that the hazard function given Z = z is propor-tional to a baseline hazard function and the logarithm of the ratio is linearly dependent on the regressors,

λ^◦(y^◦ | Z = z) = λ^◦0(y^◦)e^γz.

Here we let Z be a four levels factor which indicated that the patient’s TNM tumor stage is IA, IB, II or III/IV. Then, the relationship between conditional survival function given Z = z and the baseline survival function can be

CHAPTER 4. GENE SELECTION 21

50 100 150 200

-2024

Log-Log Survival curve

Figure 4.1: The log-log Kaplan-Meier curves of diﬀerent stages for the train-ing data set.

expressed as

S^◦(Y^◦ | Z = z) = [S0^◦(Y^◦)]^e^γz. (4.1) To check the assumption of Cox proportional hazard model, ﬁrst we drew the log-log Kaplan-Meier curves of diﬀerent stages for the training data set.

From the equation (4.1) above, the log-log Kaplan-Meier curves should be parallel if the assumption held. Figure 4.1.1 showed that there was no strong evidence of non-parallelism for our data. Second, we drew the observed ver-sus expected plot for the training data set and it also showed that there was no strong evidence to reject the assumption.

CHAPTER 4. GENE SELECTION 22

0 50 100 150

0.00.20.40.60.81.0

Observed and Expected Survival Curve

Figure 4.2: Observed Kaplan-Meier plot and Cox proportional hazard model.

Cox coeﬀicients of (Stage IB, II, III/IV)= (0.50, 1.05, 1.81). The hazard ratio of (Stage IB, II, III/IV)= (1.64, 2.85, 6.12).

Therefore, we assumed that Y^◦ | Z = z satisﬁed the Cox proportional hazard model and imputed the censored time by ˆy^◦ = median(Y^◦ | Y^◦ >

where S₀^◦(·) was the baseline survival function and it was also a monotone function. Furthermore, we had S₀^◦(ˆy^◦_i) = S₀^◦(yi) if δi = 1 and

CHAPTER 4. GENE SELECTION 23 From the equations above, we observed that the modiﬁed imputation method gave diﬀerent weights for the censored survival probability of the patients in diﬀerent stages. In practice, the Cox coeﬀicients γ could be estimated by ﬁnding the γ that maximized the partial likelihood. The baseline survival function could be estimated by the Nelson-Allen estimate or Breslow esti-mate. Then the original procedure could be implemented by replacing the survival probability estimation and the imputed weights. The modiﬁed im-putation procedure is summarized as the following steps:

1. Estimate the Cox coeﬀicients γ^′s for each TNM tumor stage and the baseline survival probability ˆS_0i^◦;

2. Impute the survival probability by the predicted conditional median

S˜_0i^◦ =





Sˆ_0i^◦, if δ_i = 1 (¹₂)^{exp γzi}¹ Sˆ_0i^◦, if δ_i = 0;

3. Calculate the percentile ˜p_i = 1− ˜S_0i^◦;

4. Calculate the imputed N (ˆy^◦) by performing the normal quantile transfor-mation on ˜p_i.

CHAPTER 4. GENE SELECTION 24

4.1.2 A simulation comparison between two imputa-tion methods

To present the improvement of our modiﬁcation, we did a simulation study. First we randomly generated 256 survival time samples from Cox proportional hazard model with a four levels factor regressor. The levels of the factor regressor were uniformly random generated. The baseline survival function was exponential distribution with rate parameter set to be 1 and the Cox coeﬀicients γ’s for each level were set to be (0, 2, 4, 8). Another 256 cen-sored time samples were randomly generated from exponential distribution with rate parameter 3. We set the minimum of the survival time samples and the censoring time samples to be the observed time samples. The average censoring rate was 0.5099.

To assess the performances of the imputation methods, the normal quan-tile transformed correlation coeﬀicient between true survival time samples and the imputed values, corr(N (y^◦), N (ˆy^◦)), was used to measure the close-ness. For 1,000 simulation runs, we implemented both imputation methods and recorded the correlation coeﬀicients in each run. The average of the cor-relation coeﬀicients of the Kaplan-Meier based imputation was 0.8684 and the average correlation coeﬀicients of Modiﬁed imputation was 0.9096. More-over, there were only three times that the Kaplan-Meier based imputation had correlation coeﬀicient greater than the Modiﬁed imputation. We con-cluded that the modiﬁed imputation method had better performance when the Cox proportional hazard model assumption held. The results of our sim-ulation were given in the following table.

CHAPTER 4. GENE SELECTION 25

Table 4.1: Simulation comparison between two imputation methods Estimated Cox model coeﬀicients Average S.D.

Cox model coeﬀicient ˆγ2 (γ2 = 2) 2.1096 0.6260 Cox model coeﬀicient ˆγ₃ (γ₃ = 4) 4.1355 0.6264 Cox model coeﬀicient ˆγ4 (γ4 = 8) 8.2389 0.7948 Normal quantile transformed correlation coeﬀicient Average S.D.

between true survival time and imputed value

No imputation (Observed time) 0.7786 0.0312

KM based imputation 0.8683 0.0205

Modiﬁed imputation 0.9096 0.0146

Censor rate 0.5099 0.0308

在文檔中生物晶片資料分析與肺腺癌存活之預測 (頁 29-36)