• 沒有找到結果。

使用有效性指標選取基於EM半參數混合風險的模型

N/A
N/A
Protected

Academic year: 2021

Share "使用有效性指標選取基於EM半參數混合風險的模型"

Copied!
88
0
0

加載中.... (立即查看全文)

全文

(1)國立臺灣師範大學數學系碩士班碩士論文. 指導教授:張少同 博士. Selection of EM-based Semi-Parametric Mixture Hazard Models Using Validity Indices 使用有效性指標選取基於 EM 半參數混合風險的模型. 研究生:張怡雯. 中華民國 一零七 年 七 月.

(2) 摘要 Cox 比例風險模型(Cox proportional hazards model)是一種經常在存活分析中 使用的迴歸模型,此模型探討生存時間的分布和自變量的關係,可以應用在醫學、 健康照護等領域。當模型中隱含著潛在變數(latent variables)時,利用混合迴歸模 型(mixture regression model)分析這些變數的影響是一種合適的方法。 在使用混合模型時,選擇適當的模型組件個數是的一個重要議題,雖然有效 性指標(validity indices)是選擇模型的方法中重要的一環,但是目前很少學者利用 有效性指標選擇混合迴歸模型的模型組件個數。在這篇論文中,我們參考現有其 它模型的指標,利用後驗概率(posterior probabilities)和殘差(residuals)發展出新的 指標,且做一系列模擬來驗證新指標的有效性。 Cox 比例風險模型包含基準風險函數(baseline hazard function)及比例迴歸模 型(proportional regression model)兩個部分,估計基準風險函數一直是個富有挑戰 性的議題,有的學者假設基準風險函數服從特定的時間分配,有的假設為分段常 數函數(piecewise constant)。在這篇論文中,我們利用內核方法(kernel estimator) 來估計基準風險函數,並發展 EM 演算法來估計混合迴歸模型的參數。 模擬結果顯示,估計基準風險函數時,利用內核方法表現的結果優於分段常 數函數,因為內核方法將曲線估計得更為平滑,改善分段常數函數僵硬結構的缺 點。此外,根據新指標選擇正確模型個數的高比例,推測新指標在選擇模型組件 個數的表現上是有效的。. 關鍵字:混合迴歸模型、Cox 比例風險模型、EM 演算法、內核方法、有效性指 標。. i.

(3) Abstract The Cox proportional hazards model is commonly used in survival analysis for describing the relationship between the survival time and covariates. The model is applied in many fields such as medicine, health care, and so on. In cases that some latent variables are involved in the model, mixture regression models are more suitable for analyzing the effects of these variables. The determination of the number of model components is an important issue when using the mixture models. Although validity indices are a vital branch of model selection, however, they are less used for deciding the number of components in mixture regression models. In this thesis, we propose some new indices based on the posterior probabilities and residuals by referring to the existing methods. The effectiveness of the proposed new indices has been verified through extensive simulations. The Cox proportional hazard model consists of two parts: the baseline hazard function and the proportional regression model. The estimation of baseline hazard function is known to be a challenging issue. Some researchers assumed that the baseline hazard function follow a specific lifetime distribution and some others assumed it is piecewise constant. In this thesis, the baseline hazard function is estimated by kernel estimator and the mixture regression model is estimated by using the expectation and maximization (EM) algorithm. In estimating the baseline hazard function, the simulation results show that the estimated model with the kernel estimator is better suited for the data set than the piecewise constant model because the fitted curve is smoother and the kernel estimator improves the stiff structure of the piecewise constant estimator as well. Moreover, the effectiveness of the new indices in selecting the number of components is verified through experiments that a high precision of number selection of components using the new indices.. Key words. Mixture regression model, Cox proportional hazards model, EM-algorithm, Kernel estimator, Validity indices ii.

(4) 致謝 本論文能夠順利的完成,首先,衷心的感謝我的指導教授 張少同老師在這 段時間循序漸進的指導,一開始,老師給予大量的論文與期刊,讓我在閱讀這些 著作的過程中擴展與精練知識,並尋找感興趣的研究方向,找到研究方向後便開 始學習撰寫程式碼,剛開始接觸程式語言時,遇到許多困難與瓶頸,老師總是耐 心的傾聽與糾正我的錯誤,並示範如何撰寫正確且有效率的程式碼,經過兩年密 集的討論,我在撰寫程式這方面走的辛苦卻也收穫滿囊,老師擁有淵博的專業知 識,總是能針對我的研究方法提出更佳的方案,如此精益求精的態度讓我相當尊 敬,在此,謹向老師表示崇高的敬意和誠摯的謝意。 接下來,感謝呂翠珊老師擔任我的校內口試委員,呂翠珊老師於我碩一那年 開出存活分析之課程,修習這門課程激發我對存活分析的興趣,並奠定我在這方 面的知識基礎。在口試的過程中,感謝老師提供更多的模擬例子以及對於分析真 實資料的建議,使本論文可以更臻完善。 再者,感謝李孟峰老師於百忙之中擔任我的校外口試委員,感謝老師耐心的 指正本論文中數學式子的錯誤寫法,並指出在模擬這個章節中漏掉的資訊,對於 我所使用的無母數估計方法,老師提供了生動的說詞來展現此方法的優點,能獲 得老師的肯定讓我深感榮幸。 另外,感謝我的研究夥伴何莉維一路上的陪伴,妳擁有認真盡責的處事原則 與奮發向上的學習態度,從妳身上我學習到很多寶貴的知識,感謝妳對本論文英 文撰寫的指點,這段時間我們相互扶持與鼓勵,相信我們都有所成長。 最後,感謝我最敬愛的爸爸和媽媽,在我求學的階段給予我精神與經濟上最 大的支援,沒有你們的支持與期望,就沒有今日的我,養育之恩,無以回報,你 們能永保健康快樂是我最大的心願。我的哥哥和妹妹專精於撰寫電腦程式,感謝 你們這段時間的協助與鼓勵,我將這份喜悅與你們分享。 兩年半的碩士班生涯即將劃上一個句號,然而對於我的人生卻只是一個逗號, 我將踏上下一個嶄新的旅程,朝著堅定志向,乘風破浪,勇往邁進! 張怡雯 謹識於 國立臺灣師範大學 數學所 統計組 中華民國一零七年七月 iii.

(5) Contents 摘要. …………………………………………………………………… ⅰ. Abstract ……………………………………………………………… ⅱ ………………………………………………………………… ⅲ 致謝 Contents. …………………………………………………………… ⅳ. List of table …………………………………………………………… ⅴ List of figure ………………………………………………………… ⅶ Chapter Ⅰ. Introduction …………………………………………… 1 Chapter Ⅱ. Model ………………………………………………… 5 ………………… 5. 2.1. Observed and latent variables in survival analysis. 2.2. Cox proportional hazards model. 2.3. Likelihood function under survival model with censored data. 2.4. Semi-parametric mixture model. 2.5. Complete-data log-likelihood function under the mixture model. ………………………………… 6 ……… 8. ………………………………… 11 …… 13. ………………………………………… 15 Chapter Ⅲ. Estimation 3.1 EM algorithm ………………………………………………… 15 …………………………… 18. 3.2. Estimation of mixing probabilities p. 3.3. Estimation of the baseline hazard function h0 (t ) 3.3.1 Piecewise constant estimator 3.3.2 Kernel estimator. ………………… 19. ……………………………… 19. ………………………………………… 21. 3.4. Estimation of regression coefficients . 3.5. Algorithm and convergence. ………………………… 24. …………………………………… 26. Chapter Ⅳ. Validity indices ……………………………………… 29 ………………………… 29 4.1 Guidelines for mixture model selection 4.2. Validity indices involving only posterior probabilities………………… 30. 4.3. Validity indices involving posterior probabilities and data characteristics 33. Chapter Ⅴ. Simulation. …………………………………………… 37. 5.1. Compare two methods of estimating the baseline hazard function. 5.2. Select appropriate number of model components. … 38. ………………… 50. A practical example ………………………………… 67 Chapter Ⅶ. Conclusion ………………………………………… 75 References …………………………………………………………… 77 Chapter Ⅵ.. iv.

(6) List of table 3-1.. Different types of kernel function. ………………………………………. 21. 5-1.. The information including sample size, number of risk types, number of covariates, degree of covariates, baseline hazard function under six cases. 35. 5-1-1. The estimated parameters, CR, MsSAE/n and MsSSE/n from the Cox mixture model with baseline hazard function estimated by method 1(a) and method 2(b). ………………………………………………………………. 37 5-1-2. The mean and variance of estimators from each simulation, bias, MSE, ARB , MSE , CR , MsSAE and MsSSE from the Cox mixture model with. baseline hazard function estimated by method 1 (a) and method 2 (b). …. 40 5-2-1. The estimated parameters, CR, MsSAE/n and MsSSE/n from the Cox mixture model with baseline hazard function estimated by method 1(a) and method 2(b). ………………………………………………………………. 42 5-2-2. The mean and variance of estimators from each simulation, bias, MSE, ARB , MSE , CR , MsSAE and MsSSE from the Cox mixture model with. baseline hazard function estimated by method 1 (a) and method 2 (b). …. 43 5-3-1. The estimated parameters, CR, MsSAE/n and MsSSE/n from the Cox mixture model with baseline hazard function estimated by method 1(a) and method 2(b). ………………………………………………………………. 45 5-3-2. The mean and variance of estimators from each simulation, bias, MSE, ARB , MSE , CR , MsSAE and MsSSE from the Cox mixture model with. baseline hazard function estimated by method 1 (a) and method 2 (b). …. 47 5-4-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 48 5-4-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 50 5-4-3. The correct rate of choosing right number of risk types by each index. …. 50 5-5-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 51 v.

(7) 5-5-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 51 5-5-3. The correct rate of choosing right number of risk types by each index. …. 52 5-6-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 53 5-6-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 53 5-6-3. The correct rate of choosing right number of risk types by each index. …. 55 5-7-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 56 5-7-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 56 5-7-3. The correct rate of choosing right number of risk types by each index. …. 57 5-8-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 59 5-8-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 60 5-8-3. The correct rate of choosing right number of risk types by each index. …. 60 5-9-1. Each index calculated by the observed data and the estimators of the Cox mixture model with different number of risk types. ………………………. 61 5-9-2. The estimated parameters from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). ……………………………………. 62 6-1.. The correct rate of choosing right number of risk types by each index. …. 63 Each index calculated by the observed data and the estimators of an EMbased semi-parametric mixture hazard model with different number of risk types.. 6-2.. ……………………………………………………………………. 66. The estimates (with standard errors) of mixing probabilities p and regression coefficients  from an EM-based semi-parametric mixture hazard model with the number of risk types is 3. …………………………. 67 vi.

(8) List of figure 3-1.. Different types of kernel function. ………………………………………. 21. 3-2.. The process of the iteration. ………………………………………………. 26. 5-1.. The scatter plot of the observed data under six cases. ……………………. 35. 5-1-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ……………………. 38 5-1-2. The real baseline hazard function according to each type of risk and the estimated baseline hazard rate at the survival time for each individual matched into these two types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). …. 39 5-1-3. The real cumulative baseline hazard function according to each type of risk and the estimated cumulative baseline hazard rate at the survival time for each individual matched into these two types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ………………………………………………………………………. 39 5-2-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ……………………. 42 5-2-2. The real baseline hazard function according to each type of risk and the estimated baseline hazard rate at the survival time for each individual matched into these two types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). …. 42 5-2-3. The real cumulative baseline hazard function according to each type of risk and the estimated cumulative baseline hazard rate at the survival time for each individual matched into these two types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ………………………………………………………………………. 43 5-3-1. The scatter plot of the observed data and the expectation line according to vii.

(9) each type of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ……………………. 45 5-3-2. The real baseline hazard function according to each type of risk and the estimated baseline hazard rate at the survival time for each individual matched into these three types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). …. 46 5-3-3. The real cumulative baseline hazard function according to each type of risk and the estimated cumulative baseline hazard rate at the survival time for each individual matched into these three types of risk from the Cox mixture model with baseline hazard function estimated by method 1 (a) and method 2 (b). ………………………………………………………………………. 46 5-4-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 49 5-5-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 52 5-6-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 54 5-7-1. The scatter plot of the observed data and the expectation line according to each type of risk from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 57 5-8-1. The scatter plot of the observed data and the expectation plane according to each type of risk from the Cox mixture model with the number of risk viii.

(10) types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 59 5-9-1. The scatter plot of the observed data and the expectation plane according to each type of risk from the Cox mixture model with the number of risk types is 1 (a), 2(b), 3(c) and 4(d). Note: risk(.) represent the type of risk for estimation. …………………………………………………………………. 62 6-1.. The estimated cumulative baseline hazard rate at the survival time for each individual matched into the type of risk from an EM-based semiparametric mixture hazard model with the number of risk types is 2 (a), 3(b) and 4(c). Note: risk(.) represent the type of risk for estimation; b represents the bandwidth of kernel estimator. ……………………………. 66. 6-2 (a).The estimated cumulative hazard rate of death due to the 2nd type of risk for young patients (AG=0) with weight index > 99 (WT=0), normal PF (PF=0), no history of CDV (HX=0), haemoglobin > 12 g/100 (HG=0), small size of primary lesion (SZ=0), and high-grade tumors (SG=1). ……. 68 6-2 (b).The estimated cumulative hazard rate of death due to the 2nd type of risk for young patients (AG=0) with weight index > 99 (WT=0), normal PF (PF=0), no history of CDV (HX=0), haemoglobin > 12 g/100 (HG=0), small size of primary lesion (SZ=0), and allocated high-dosage DES (RX=1). ……………………………………………………………………. 68 6-3 (a).The estimated cumulative hazard rate of death due to the 1st type of risk for young patients (AG=0) with weight index > 99 (WT=0), normal PF (PF=0), with history of CDV (HX=1), haemoglobin > 12 g/100 (HG=0), small size of primary lesion (SZ=0), and low-grade tumors (SG=0). ……. 69 6-3 (b).The estimated cumulative hazard rate of death due to the 1st type of risk for young patients (AG=0) with weight index > 99 (WT=0), normal PF (PF=0), haemoglobin > 12 g/100 (HG=0), small size of primary lesion (SZ=0), low-grade tumors (SG=0) and allocated high-dosage DES (RX=1). 69. ix.

(11) Chapter Ⅰ.. Introduction. Survival analysis is a branch of statistics for analyzing time-to-event data, such as survival rate in medicine and failure in mechanical systems. Survival analysis attempts to analyze the proportion of a population which will survive past a certain time. Cox proportional hazards model, introduced by Cox [1], is a regression model that is often considered in survival analysis. It relates the relationship between the distribution of survival times and covariates. However, there may be some latent or implicit variables in the model, such as competing risks. The traditional approach, say Prentice et al. [2], based on statistical models for the observable failure/censored time and covariates is to represent the competing-risk failure rates by the cause-specific hazard functions via Cox proportional hazards assumption. See also Kalbfleisch et al. [3] and Benichou et al. [4]. Some examples that apply this approach to analyze competing risks data can be found in Gaynor et al. [5] and Fusaro et al. [6]. This thesis is mainly inspired by the paper written by McLachlan et al. [7], who proposed an EM-based Cox proportional hazards mixture model to analyze the competing risks data. Their analysis is built on the fact that the number of competing risks are already known. In this thesis, we also discuss an EM-based Cox proportional hazards mixture model. The difference is that we assume we don’t have the full information about the number of risk types. Therefore, the major research topic of this thesis is model selection. Choosing the appropriate number of model components, such as the types of risks in survival analysis, is an important issue in estimating mixture models. "All models are wrong", mentioned by George Box (1976), is a common aphorism in statistics. Why do we need model selection? The purpose of model selection may be to replace poorly performing models, to save computer operating time, or to find a model that can better. 1.

(12) explain the relationship between the input and the output data. At present, there are some methods for model selection, such as F tests for nested models, stepwise procedure for picking a good model, AIC and BIC for judging the quality of a model. Validity indices are a vital branch of model selection. As mentioned above, AIC and BIC are currently widely used validity indices. But few scholars use validity indices to select the number of model components under mixture regression model. Tang et al. [8] and Weina et al. [9] introduce some cluster validity indices to find an optimal cluster number in fuzzy clustering algorithm. Bedalli and Ninka [10] implement some cluster validity indices for fuzzy cluster analysis. In fuzzy clustering algorithm, memberships play an important role. Similarly, under the EM-based Cox proportional hazards mixture model, the posterior probabilities are closely related to the role of memberships. In this thesis, we refer and modify the idea of other indices from existing models. Some of the ideas come from fuzzy clustering algorithm. Consider the principle of selecting traditional regression models, we use the posterior probabilities and residuals to develop new indices. We do a series of simulations to verify the effectiveness of the new indices. Cox proportional hazards model consists of two parts: the baseline hazard function and the proportional regression model. The estimation of baseline hazard function has always been a challenging issue. Bender et al. [11] assume that the baseline hazard function follows a specific lifetime distribution. But obviously, this assumption is too restrictive. The common lifetime distributions may fail to adequately explain data for which the failure rate is not monotonic increasing or decreasing. Alternatively, there are some scholars use nonparametric estimator which is a flexible method to estimate the baseline hazard function. McLachlan et al. [7] treat each observed survival time as a. 2.

(13) cut-point and assume the baseline hazard function to be piecewise constants. But the disadvantage of the piecewise constant function is that the estimated curve will not be so smooth. Agathe et al. [12] proposed kernel estimator to estimate the baseline hazard function under non-mixture Cox proportional hazards model. In this thesis, we extend the idea of kernel estimator to the mixture model and develop the EM algorithm to estimate the parameters of the mixture regression model. The estimate for baseline hazard function is nonparametric and the estimate for proportional regression model is parametric. Thus, our method is a semi-parametric mixture model approach. The remainder of this thesis is organized as follows. In chapter Ⅱ, first, we briefly review the non-mixture Cox model. Then we describe the mixture Cox model and the complete-data log-likelihood function under this model with latent variables. In chapter Ⅲ, we review the EM algorithm first. Then describe the estimation of mixing probabilities, the baseline hazard function and regression coefficients under Cox proportional mixture hazards model. For estimating the baseline hazard function, we propose the kernel estimator. In chapter Ⅳ, we review some validity indices in fuzzy cluster algorithm. Then we propose some new indices for selecting the number of model components in mixture regression models. In chapter V, several simulations are carried out to compare the performance of piecewise constant estimator and kernel estimator. The effectiveness of indices is assessed according to the proportion of correct number of model components that is selected. In chapter VI, we analyze a practical data set of prostate cancer patients treated with different dosages of the drug diethylstilbestrol. Finally, we state conclusions in chapter Ⅶ.. 3.

(14) 4.

(15) Chapter Ⅱ.. Model. 2.1 Observed and latent variables in survival analysis Survival analysis is a study to analyze the distribution of lifetime or failure time. The response variable measures the time from initial event to terminate event. For example, the time from birth to death, the time from the start of treatment to the time of relapse. This time-to-event data will appear in different fields. For example, the survival rate in medicine, mortality in public health, life table in epidemiology, the unemployment rate in economics, and so on. Suppose there is a sample of individuals with sample size n from a specific population whose failure times are F1 , F2 ,..., Fn . Consider right censored data, there are two results for tracking. If the observation occurred research interest event, then we can observe the failure time. On the other hand, if the observation didn’t occur research interest event, then we don’t have the opportunity to observe the failure time. Denote C1 , C2 ,..., Cn be the potential censored times. If the observation didn’t occur research. interest event from the beginning to the end of the study, then we denote the censored time by study terminate time. Else if the observation lost to follow-up, then we denote the censored time by the time he/she left the study. Note that, if we have observed the failure time, we won’t have the opportunity to observe the censored time and vice versa. Therefore, the observed survival time is the minimum of the failure time and the censored time. Here we define  be the zero-one indicator function to indicate whether or not the observation is censored. Therefore, we have the observed data (t j ,  j ) for j  1,..., n , where. t j  min( Fj , C j ) ,. 1 if Fj  C j 0 if Fj  C j.  j  I ( Fj  C j )  . 5. (observed the failure time) (observed the censored time) ..

(16) Suppose we also have observed some covariates that will influence the failure time. The covariates may be gender, age, the dosage of certain medications, and so on. Denote X j  ( x j1 , x j 2 ,..., x j d )T for j  1,..., n be a d-dimensional vector of covariates for j-th. individual and the superscript T denotes vector transpose. All in all, we have the observed data (t j ,  j , X j ) for j  1,..., n . Actually, there may be some factors that we have not observed, but it will affect the failure time. We call it latent variables. For example, animals have a large number of death, and the cause of death may be illness, viral infection, food poisoning, and so on. But we have not observe this information. Another example, in the medical records, lots of people have been recorded for death because of cardiovascular disease. However, we know that there are many risk factors for cardiovascular disease. It may be diabetes, high blood pressure, high cholesterol, smoking, genetic, and so on. In the above two examples, although we don’t know that each individual failed or may fail due to which type of risk, we are still interested in exploring these competing risks. Therefore, we need to find an effective method to classify these individuals into appropriate classification of risks. At present, some scholars have proposed how to analyze data with competing risks. This thesis is mainly inspired by the paper written by McLachlan et al. [7], who proposed a mixture model to analyze the competing risks data. In next section, we will introduce the non-mixture model first. Then in section 2.4, we will present the mixture model to analyze the data with latent variables.. 2.2 Cox proportional hazards model In this section, we introduce the non-mixture model. The main data we have observed are survival times and covariates. We want to find a model that can. 6.

(17) characterize the relationship between these two. As a result, we consider the Cox proportional hazards model. The model is introduced as follows. The Cox proportional hazards model, introduced by Cox (1972), is a regression model often considered in survival analysis to relate the distribution of survival times and covariates. The hazard function of t is defined by. h(t X j ,  )  h0 (t ) exp( X jT  ) ,. (2.1). T where X j  ( x j1 , x j 2 ,..., x j d ) for j  1,..., n be a vector of covariates for the j-th. individual,   (1 , 2 ,..., d )T is the vector of regression coefficients and h0 (t ) is the baseline hazard function. The regression equation can be expressed as hazard ratio (HR),.  h(t X j ,  )  log  HR(X j )   log    X jT  , h ( t )   0 where HR(X) indicates that the risk ratio of the event will occur if the X value is given. Note that the baseline hazard function is a function of survival times, and is not related to covariates. The relationship between covariates and survival times can be explained through the estimates of parameters  . We know that the hazard function of t is the probability that an event occurs within a very short period of time in the condition of surviving over time t. That is,. P(t  T  t  t T  t ) . t 0 t. h(t )  lim. Writing down the conditional probability, we obtain. P( t  T  t  t ) P( t  T  t  t ) 1 f (t )  lim   , t 0 t 0 t  P(T  t ) t P(T  t ) S (t ). h(t )  lim. (2.2). where f (t ) is the probability density function and S (t ) is the survival function. Taking logarithm for the survival function, then adding the negative sign and taking the first derivate with respect to variable t, we obtain an equation of the hazard function.. 7.

(18) d 1 d 1 d f (t ) (2.2)  S (t )    1  F (t )    h(t ) .   ln S (t )   dt S (t ) dt S (t ) dt S (t ). (2.3). From (2.3), we can see the relationship between the survival function and the hazard function. The survival function of t can be derived as t S (t )  exp    h(u )du  .  0 . Define H 0 (t )   h0 (u ) du be the cumulative baseline hazard function. The survival t. 0. function of t under Cox model can be written as. S (t X j ,  )  exp  H 0 (t ) exp( X jT  )  .. (2.4). Moreover, from (2.2), the probability density function of t under Cox model is. f (t X j ,  )  h(t X j ,  )  S (t X j ,  )  h0 (t ) exp  X j T   H 0 (t ) exp( X j T  )  . (2.5). 2.3 Likelihood function under survival model with censored data A likelihood function is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play an important role in frequentist inference, especially for estimating parameters. A likelihood L( ) shown as follows is the probability for the occurrence of a sample configuration x1 , x2 ,..., xn given that the probability density f ( x  ) with parameter  is known. n. L( )   f ( x j  ) j 1. In the survival model with right censored data, the likelihood is different from the general wording. Daowen [13] has shown the following theorem. Theorem : Let f F and S F be the probability density function and the survival function of failure times, and  be the parameter characterizing the distribution of F. Then the likelihood function of  under survival model with censored data is proportional to 8.

(19) n. j. 1 j. L( )    f F (t j  )   SF (t j  ) . ,. j 1. where for j  1,..., n , t j  min( Fj , C j ) ,  j  I ( Fj  C j ) and Fj , C j are the failure time and the censored time for j-th individual. Proof: Actually, failure time F and censored time C have different distributions and, of course, have different parameters characterizing their distributions. Note that these two distributions are independent. We already have f F and S F be the probability density function and the survival function of failure time. Now, suppose fC and SC be the probability density function and the survival function of censored time. The data have two results for tracking, so there are two cases for calculating the probability density function. The first case is that we have observed the failure time. The observed data are.  =1 and T  min( F , C )  F . Then the probability of these individuals having an event at time t can be expressed as follows,. P(t  T  t  t ,   1)  P(t  F  t  t , C  F )  P(t  F  t  t , C  T ) . Note that t is a fixed number, so the formula above is similar to. P(t  F  t  t , C  T )  P (t  F  t  t , C  t ) . Then because the distribution of failure time and the distribution of censored time are independent. The joint probability is equal to the multiplication of these two individual probability. Therefore, we have P(t  F  t  t , C  t )  P(t  F  t  t )  P(C  t )  f F ( )t  SC (t ),   [t , t  t ) .. Comprehending the above calculation, we obtain P(t  T  t  t ,   1)  f F ( )t  SC (t ),   [t , t  t ) .. Therefore, the probability density function of failure time is 9.

(20) f ( )t  SC (t ) P(t  T  t  t ,   1)  lim F  f F (t ) SC (t ) . t  0 t  0 t t. f (t ,   1)  lim. (2.6). The second case is that we have not observed the failure time. The observed data are  =0 and T  min( F , C )  C . Then the probability of these individuals having an event at time t can be expressed as follows,. P(t  T  t  t ,   0)  P(t  C  t  t , F >C )  P (t  C  t  t , F >T ) . Similar to the first case, because t is a fixed number and the distribution of failure and censored time are independent, we have P(t  C  t  t , F >T )  P(t  C  t  t , F >t )  f C ( )t  S F (t ),   [t , t  t ) .. Consequently, we can obtain P(t  T  t  t ,   0)  fC ( )t  S F (t ),   [t , t  t ) .. Therefore, the probability density function of censored time is f ( )t  S F (t ) P (t  T  t  t ,   0)  lim C  f C (t )  S F (t ) . (2.7) t  0 t  0 t t. f (t ,   0)  lim. Combining (2.6) and (2.7), the probability density function of the survival time is f (t ,  )   f F (t ) SC (t ) . .  fC (t )S F (t ). 1. .  f. (t )   S F (t )  . F. 1.  f (t ). 1. C.  SC (t )  . . We have  be the parameter characterizing the distribution of the failure time. Now, denote  be the parameter characterizing the distribution of the censored time. The likelihood function of these two parameters is n. . j. 1 j. L( ,  )    f F (t j  )   S F (t j  )  j 1.  f (t ). 1 j. C. j.  SC (t j  ) . j. .. Keep in mind that we are mainly interested in making inference on the parameter  . Therefore, the likelihood function is proportional to n. j. 1 j. L( )    f F (t j  )   SF (t j  ) . .. ■. j 1. Note that, in this thesis, the parameters in all functions are characterized the. 10.

(21) distribution of the failure time. Here we applied an intuitive idea to explain this likelihood function. If we have observed the failure time, it means that we know when the individual occurred research interest event. Therefore, the contribution of the likelihood function is the probability density function. However, if we didn’t observe the failure time, we can’t know when the individual failed. We only know that at least individual is alive up to censored time. Thus, the contribution of the likelihood function is the survival function. According to the above theorem, the likelihood function under Cox proportional hazards model can be written as follows, n. .  exp  H (t ) exp( X. L(  )   h0 (t j ) exp  X j T   H 0 (t j ) exp( X j T  )  j 1 n. j. 0. j. T j. .  ) . 1 j. j.    h0 (t j ) exp( X j T  )  exp   H 0 (t j )) exp( X j T  )  . j 1. Then we can maximize the likelihood function to estimate the parameters.. 2.4 Semi-parametric mixture model As was mentioned in the last paragraph of section 2.1, on survival analysis, there may be some factors that we have not observed but will influence the failure time. In this section, we will introduce the mixture model to analyze the data with unobserved factors or called latent variables. Suppose there are g different types of risk that will influence the failure time. Each type of risk has its own specific distribution. In this thesis, a component represents failure due to a specific type of risk. That is, f i (t ) , Si (t ) , hi (t ) are called componentprobability density function, component-survival function and component-hazard function that given failure is due to the i-th type of risk, respectively. The mixture probability density function of t is defined by 11.

(22) g. f (t )   pi  fi (t ) , i 1. where pi is the mixing probability of failure due to the i-th type of risk, and pi sum g. to one, that is. p i 1. i.  1.. Following the Cox proportional hazards model (2.1), the component-hazard functions, hi (t X j ,  i ) for i  1,..., g , are defined by. hi (t X j ,  i )  h0i (t ) exp( X jT  i ) , where i  (i1 , i 2 ,..., id )T for i  1,..., g is the vector of regression coefficients and h0i (t ) for i  1,..., g are the component-baseline hazard functions.. For the estimation of the baseline hazard function, at first, we refer to the paper written by Bender et al. [11], that assume the baseline hazard function follows some common lifetime distributions, such as Weibull distribution or Gompertz distribution. However, common lifetime distributions may fail to adequately explain data for which the failure rate is not monotonic increasing or decreasing. For example, in general, the hazard rate will be decreasing after the treatment therapy or taking some medicines. After a while, the hazard rate will tend to stabilize. And after a long time, because the effectiveness of treatment is gradually reduced, the hazard rate will be increasing. In this situation, the hazard function is more suitable assuming bathtub-shape with three phases. As a result, if we use a specific monotone baseline to estimate this data, it is not suitable. Alternatively, a flexible baseline hazard function can be adopted. Larson and Dinse [14] specified the baseline hazard function to be piecewise constant. However, there are several things to be considered. First, how many segments or constant parameters should be estimated. Second, how to determine the value of each cut point. McLachlan et al. [7] treat each observed survival time as a cut point. But we think if we use 12.

(23) piecewise constant function to estimate curve it will not be so smooth. As a result, we propose using kernel estimator to estimate the baseline hazard function. For details, see Chapter Ⅳ. In our study, we use maximum likelihood estimation to estimate mixing probabilities and regression coefficients. These estimates are parametric. And we use kernel estimator to estimate the baseline hazard function. This estimate is nonparametric. Therefore, this method is called the semi-parametric mixture hazard model approach.. 2.5 Complete-data log-likelihood function under the mixture model In order to write down the likelihood function under the mixture model, we must know the component-survival function and the component-probability density function. From (2.4), the i-th component-survival function following Cox model is. Si (t X j ,  i )  exp  H 0i (t ) exp( X j T  i )  , where H 0i (t )   h0i ( s ) ds is the component-cumulative baseline hazard function. t. 0. From (2.5), the i-th component-probability density function following Cox model is. fi (t X j ,  i )  h0i (t ) exp  X jT  i  H 0i (t ) exp( X j T  i )  . In our mixture model, the unknown parameters are p  ( p1 , p2 ,..., pg 1 )T and.   ( 1T ,  2T ,...,  g T )T where i  (i1 , i 2 ,..., id )T . Note that we only need to estimate (g-1)’s mixing probabilities of failure, because the last mixing probability is equal to one minus the sum of other (g-1)’s mixing probabilities. According to the theorem in section 2.3, the likelihood function for p and  under the mixture model is n. L( p,  )   f (t j X j ,  ) j S (t j X j ,  ) . j 1. 13. 1 j. ,.

(24) g. g. i 1. i 1. where f (t j X j ,  )   pi  fi (t j X j ,  i ) and S (t j X j ,  )   pi  Si (t j X j ,  i ) . And the log-likelihood function for p and  under the mixture model is n   g   g  l ( p,  )    j log  pi  fi (t j X j ,  i )   (1   j ) log  pi  Si (t j X j ,  i )   j 1   i 1   i 1  g. n. . .    j log  pi  fi (t j X j ,  i )   (1   j ) log  pi  Si (t j X j ,  i )  . j 1 i 1. The information we have collected is incomplete because we don’t know every individual failed or may fail due to which type of risk. Therefore, the above loglikelihood function is also incomplete. That means, one might fail due to a particular risk, but every risk is possible. Therefore, the log-likelihood function is contributed by each possible risk. That is the mixture probability density function. In order to write down the complete-data log-likelihood, an unobservable random matrix Z is introduced for each observation. The dimension for matrix Z is n by g. The entry in this matrix is zero-one indicator variables. For example, the entry in the j-th row and i-th column zij  1 or 0 according to the j-th individual would have failed from the i-th type of risk or not. That is. 1 if j - th individual would have failed from i-th type of risk zij   , 0 if j - th individual would not have failed from i-th type of risk . If the individual has failed or may fail from the i-th type of risk, then the log-likelihood it contributes is the i-th component-probability density function or i-th componentsurvival function according to the survival time is failure or censored. Therefore, the complete-data log-likelihood function under the mixture model is n. g. . . lc ( p,  )   zij  j log  pi  fi (t j X j ,  i )   (1   j ) log  pi  Si (t j X j ,  i )  . (2.8) j 1 i 1. In the next chapter, we will introduce how to estimate the parameters of the Cox proportional mixture hazards model. 14.

(25) Chapter Ⅲ.. Estimation. 3.1 EM algorithm Recall that, in the survival analysis, there may be some competing risks that we have not observed, but it will affect the failure time. In this thesis, we are interested in exploring these competing risks. That means, in our study, part of the data is unobserved. McLachlan et al. [7] proposed the EM method, which is an efficient iterative procedure to compute the maximum likelihood estimate. It is first introduced by Dempster et al. [15]. We will do a series of iteration steps to update the estimator until a convergence criterion is met. Note that in our study the observed variables are survival times T =(t1 ,t2 ,...,tn )T and. covariates. X  ( X1T , X 2T ,..., X nT )T. where X j  ( x j1 , x j 2 ,..., x j d )T .. The. unobserved variables are Z  ( Z1T , Z2T ,..., ZnT )T where Z j  ( z j1 , z j 2 ,..., z j g )T . The parameters are mixing probabilities p  ( p1 , p2 ,..., pg 1 )T and regression coefficients.   ( 1T ,  2T ,...,  g T )T where i  (i1 , i 2 ,..., id )T . Let   ( pT ,  T )T be the vector containing these unknown parameters. Next, we will briefly introduce the principle of EM method. Suppose the complete-data likelihood function for  can be written as Lc ()  g (T , X , Z ) .. The marginal distribution of observed variables can be obtained by integral out unobserved variables. That is, L( )  h(T , X  )   g (T , X , Z  ) dz . Z. Our goal is to maximum likelihood to get the estimate of parameters  given the observed variables.. ˆ  arg max L( T , X )  arg max h(T , X )  . . 15.

(26) Note that the conditional distribution of the unobserved variables given the observed variables and parameters can be derived as. k ( Z T , X , ) . g (T , X , Z  ) . h(T , X  ). (3.1). Taking logarithm for (3.1), we have log h(T , X )  log g (T , X , Z )  log k ( Z T , X , ) .. (3.2). It holds for any value of Z according to the distribution k ( Z T , X ,  ) for a chosen value of  , say  0 . Taking the expectation on both sides of (3.2) with respect to. k ( Z T , X ,  0 ) , we have EZ T , X ,0 log h(T , X  )   EZ T , X ,0 log g (T , X , Z  )   EZ T , X ,0 log k ( Z T , X ,  )  .. (3.3). Because log h(T , X  ) is not relative to Z, we have EZ T , X , 0 log h(T , X  )   log h(T , X  )  log L(  ) .. Then the equation (3.3) can we written as follows, log L( )  EZ T , X , 0  log Lc ( )   EZ T , X , 0 log k ( Z T , X ,  )  .. (3.4). In the EM algorithm, while we aim to maximize log L(  ) , only the first term on the right side of equation (3.4) need to be considered. Define Q-function as follows, Q(  0 )  EZ. T , X ,0. log Lc ( ) .. Our goal is maximize the log-likelihood function log L( ) by maximizing the Qfunction Q(  0 ) . EM method has two steps: Pick a starting value  0 and set k  0 . (1) Expectation (E-step): Calculate Q-function. Q(  ( k ) )  EZ T , X , ( k )  log Lc ( ) . (2) Maximization (M-step): Maximize Q(  ( k ) ) in  and take. ( k +1) = arg max Q( ( k ) ) . . 16.

(27) Repeat E-step and M-step until sequence convergence is achieved. On the (k+1)-th iteration of the E-step in our mixture model, we calculate Qfunction, which is the expectation of the complete-data log-likelihood (2.8) conditional on the current estimate of the parameter and the observed data. n. . g. . Q( p,  p( k ) ,  ( k ) )   zij ( k )  j log  pi  f i (t j X j ,  i )   (1   j ) log  pi  Si (t j X j ,  i )  , j 1 i 1. where p( k ) ,  ( k ) is the estimate of p,  after the kth iteration and. zij. (k ). .  E zij p ,  (k ). (k ). . . 1 j. (k )  j. ( k ) 1 j. pi ( k ) fi (t j X j ,  i ( k ) ) j Si (t j X j ,  i ( k ) ) g. p l 1. l. (k ). (3.5). fl (t j X j ,  l ) Sl (t j X j ,  l ). is the posterior probability that the j-th individual with survival time t j would have failed due to the i-th type of risk. The (k+1)-th iteration of M-step provides the updated estimate p( k +1) ,  ( k +1) that maximizes the Q function with respect to p,  . Note that Q-function can be decomposed into. Q( p,  p ,  (k ). n. (k ). g. )   zij ( k ) log pi j 1 i 1 n.   z1 j ( k )  j log f1 (t j X j , 1 )  (1   j ) log S1 (t j X j , 1 )  j 1.  n.   z gj ( k )  j log f g (t j X j ,  g )  (1   j ) log S g (t j X j ,  g )  j 1.  Q0  Q1 .  Qg .. (3.6). with respect to the unknown parameters p and 1 ,...,  g , respectively. It implies that the estimates of p and 1 ,...,  g can be updated separately by maximizing Q0 and Q1 ,. , Qg .. The M-step for estimating mixing probabilities p is shown in section 3.2. The Mstep for estimating regression coefficients 1 ,...,  g is shown in section 3.4. In the 17.

(28) section 3.3, we will introduce two non-parametric methods of estimating the baseline hazard function.. 3.2 Estimation of mixing probabilities In the decomposition of the Q-function (3.6), only Q0 is related to the parameters of mixing probabilities p  ( p1 , p2 ,..., pg 1 ) . Therefore, to obtain an updated estimate of p , we only need to maximize Q0 . On the (k+1)-th iteration of the M-step, Q0 is expressed as g. n. Q0   zij ( k ) log pi . j 1 i 1. Note that, if we directly take the first derivate of Q0 with respect to pi and set it to be zero, we will receive zij ( k ) pi 1  0 . Then we can’t calculate the estimate pi (k +1) .This problem occurs because we don’t consider the constraints for p . In mathematical optimization, the method of Lagrange multipliers is a strategy for calculating the maxima and minima of a function subject to constraints. Here, we use this method to get the estimate of p . Consider the following constraints for p , g.  p 1 i 1. i. Define a variable  called the Lagrange multiplier. Then we can write down the Lagrange function Q0 L as follows which ensure the constraints are satisfied. n. g. Q0   zij L. j 1 i 1. g. (k ). log pi   ( pi  1) i 1. The estimate pi (k +1) is obtained by setting the first derivate of Q0 L with respect to pi equal to zero. As follows, g    n g (k ) z log p   ( pi  1)   0 .   ij  i pi  j 1 i 1 i 1 . 18.

(29) Then we receive zij ( k )   pi . Adding all groups on both sides of the equations and considering the constraints for p and Z. As below, g. n. g.  zij (k )    pi  0 , i 1 j 1. i 1. g. n.  zij  n and i 1 j 1. g.  p 1. i 1. i. We get   n . Therefore, the estimate pi (k +1) is n. z j 1. pi ( k 1) . (k ). ij. n. .. (3.7). From this formula, the estimate of the mixing probability of failure due to the i-th type of risk is equal to adding posterior probabilities that individuals would have failed from the i-th type of risk and dividing the sample size n. That means if there are many high posterior probabilities that individuals would have failed from the i-th type of risk, the mixing probability of failure due to the i-th type of risk will be large. This is very intuitive.. 3.3 Estimation of the baseline hazard function In this section, we show two methods of estimating the baseline hazard function. One of these referred to the paper written by McLachlan et al. [7], who assumed that the baseline hazard function is piecewise constant, and used maximum likelihood estimation. This method is shown in section 3.1.1. The other method is the new recommended method. Agathe et al. [12] proposed kernel estimator to estimate the baseline hazard function under non-mixture Cox proportional hazard model. In section 3.1.2, we extend the idea of kernel estimator to the mixture model. This method is mainly used when we simulate and analyze data in this thesis.. 3.3.1 Piecewise constant estimator 19.

(30) In the decomposition of the Q-function (3.6), Q1 , Q2 ,..., Qg are functions of component-baseline hazard functions h01 ,..., h0g , respectively. On the (k+1)-th iteration of the M-step, the equation Qi for i  1,..., g can be written as following, n. Qi   zij ( k )  j log f i (t j X j ,  i )  (1   j ) log Si (t j X j ,  i )  j 1 n. . .   zij ( k )  j log h0i (t j )  X j T  i   exp( X j T  i ) H 0i (t j ) . j 1. To obtain the estimates of h01 ,..., h0g , we will maximize Qi with  i fixed at i (k ) . McLachlan et al. [7] assumed that both component-baseline hazard functions were taken to be piecewise constant and treated each observed survival time as a cut point. We rearrange the survival times in increasing order and denote t( j ) for j  1,..., n with t(1) < t(2) <...< t(n) . Then the i-th component-cumulative baseline hazard function. H0i (t( j ) ) can be computed as follows, j. H 0i (t( j ) )   h0i (t( m) )  (t( m)  t( m1) ) . m 1. Let Qi , equal to the equation Qi with  i fixed at i (k ) , be written as j n  (k )  T (k ) T (k )  Qi   z(ij )  (j ) log h0i (t(j ) )  X (j )  i   exp( X (j )  i ) h0i (t( m) )  (t( m)  t( m1) )  , j 1 m 1  . where z(ij ) is the posterior probability that the sorted j-th individual with survival time t(j ) would have failed from the i-th type of risk.. The estimate of h0i ( k 1) (t(j ) ) is obtained by setting the first derivate of Qi with respect to h0i (t(j ) ) equal to zero. As follows,. z(ij ) ( k ) (j ) Qi   (t( j )  t( j 1) )   z(im ) ( k ) exp( X (m )T  i ( k ) )  0 . h0i (t(j ) ) h0i (t(j ) ) m j Define Z(i )  (z(i1) ,...,z(in ) )T be the vector of posterior probabilities that the sorted individuals would have failed from the i-th type of risk. The estimate of h0i ( k 1) (t(j ) ) is. 20.

(31) h0i. ( k 1). (t(j ) ) . z(ij ) ( k ) (j ) (t( j )  t( j 1) )   z(im ). (k ). m j. exp( X (m )  i ) T. .. (k ). Furthermore, the estimate of H 0i ( k 1) (t(j ) ) is. H 0i. ( k 1). j. j. k 1. k 1. (t( j ) )   h0i (t( k ) )  (t( k )  t( k 1) )  . z(ik ) ( k ) (k ). z. m k. (k ) (im ). exp( X (m )  i ) T. .. (3.8). (k ). This estimation of cumulative baseline hazard function is also called Breslow estimator.. 3.3.2 Kernel estimator Note that, if we use piecewise constant function to estimate curve, it will not be so smooth. Hence, we apply the kernel estimator and estimate the curve more smoothly. Moreover, we can also estimate at those points that are not observed. Define N j (t )  I (t j  t   j  1) be an event counting process. It’s the number of research interest events that occurred before or equal to time t. And define Yj (t )  I (t j  t ) be at risk process. It’s the number of individuals whose survival time. is more than or equal to time t. Then the Breslow estimator (3.8) can be rewritten as. H 0i. ( k 1). n. zij ( k ) j. j 1. Sni (t X , Zi ( k ) ,  i ( k ) ). (t X , Zi ,  i )   (k ). (k ). ,. (3.9). n. T where Sni (t X , Z i ,  i )   zij exp( X j  i )Y j (t ) . j 1. The known estimator of the baseline hazard function is a kernel estimator, introduced by Ramlau [16]. Agathe et al. [12] proposed kernel estimator to estimate the non-mixture Cox proportional hazards model. In this thesis, we extend the idea of kernel estimator to the mixture model. A kernel function is a non-negative real-valued integral function K. For most applications, it is desirable to define the function to satisfy two requirements: 21.

(32) (1) Normalization:  K (u ) du  1 ; (2) Symmetry: K (u )  K (u ) for all values of u. There are several types of kernel functions are commonly used. Here we introduce some types of kernel functions as follows. Table 3-1. Different types of kernel function. K (u ) Kernel function Uniform Triangular Epanechnikov Biweight Triweight. 1 , u 1 2 K (u )  1  u , u  1. K (u ) . 3 (1  u 2 ) , u  1 4 15 K (u )  (1  u 2 ) 2 , u  1 16 35 K (u )  (1  u 2 )3 , u  1 32 K (u ) . Cosine. K (u ) . Gaussian. K (u ) . . 4. cos(. . 2. u) , u  1. 1  12 u 2 e 2. ,   u  . Figure 3-1. Different types of kernel function. The updated kernel estimator of i-th component-baseline hazard function h0i (t ) is defined by. h0i (k 1) (t X , Zi ( k ) ,  i ( k ) , b)  where K :. . 1   t u  K dH 0i (k +1) (u X , Zi ( k ) ,  i ( k ) ) ,   0 ,   b 0  b . is a kernel function as above, and b is a positive parameter called the 22.

(33) bandwidth. Derived by smoothing the increments of the Breslow estimator (3.9). The kernel estimator of h0i ( k 1) (t j ) can be written as. h0i. (k 1). (t X , Zi ,  i (k ). where Y . (k ). (k ) 1 n   t  u  zij I (Y (u)  0) , b)    K  dN j (u) ,  b j 1 0  b  Sni (u X , Zi ( k ) ,  i ( k ) ). (3.10). n 1 n Y j and Sni (u X , Zi ,  i )   zij exp( X j T  i )Y j (u ) .  n j 1 j 1. The bandwidth of the kernel estimator is a free parameter which exhibits a strong influence on the resulting estimate. If the bandwidth is too small, the curve is under smoothed since it contains too many spurious data artifacts. By contrast, if the bandwidth is too large, the curve is over smoothed since it obscures much of the underlying structure. As a result, when using the kernel estimator, the selection of the bandwidth is very important. Here we introduce cross-validation method of selecting the bandwidth. The idea comes from the work written by Horova et al.[17]. The main idea of cross-validation method is to use the mixture sum of squared. . . error of baseline hazard function MSSE h0 (  X , Z ,  , b) g. as follows,. n. MSSE h0 (  X , Z ,  , b)   zˆij  hˆ0i ( t j X , Zˆ i , ˆi , b )  h0i ( t j )  .   i 1 j 1 2. Expand the full square of the above formula, we can get g. n.  zˆ i 1 j 1. ij.  hˆ0i 2 ( t j X , Zˆ i , ˆi , b )  2hˆ0i ( t j X , Zˆ i , ˆi , b )  h0i ( t j )+h0i 2 ( t j )  .  . Split the above formula into three parts. Assume the first, second and third term are g. n. MSSE1 h0 (  X , Z ,  , b)   zˆij  hˆ0i 2 ( t j X , Zˆ i , ˆi , b ) ; i 1 j 1 g. n. MSSE2 h0 (  X , Z ,  , b)  2  zˆij  hˆ0i ( t j X , Zˆ i , ˆi , b )  h0i ( t j ) ; i 1 j 1. g. n. MSSE3 h0 (  X , Z ,  , b)   zˆij  h0i 2 ( t j ) . i 1 j 1. . . Note that the third term MSSE3 h(  X , Z ,  , b) does not depend on bandwidth b, so 23.

(34) we ignore this term. The first term can be calculated by (3.10). That is, g. n. MSSE1 h0 (  X , Z ,  , b)   zˆij   hˆ0i ( t j X , Zˆ i , ˆi , b )  ,   i 1 j 1 2. (3.11). 1 n   t  u  zˆij I (Y (u )  0) where hˆ0i (t X , Zˆ i , ˆi , b)    K  dN j (u ) .  b j 1 0  b  S ni (u X , Zˆ i , ˆi ). The unbiased estimate of the second term which is taken by Patil [18] is . g. MSSE2 h0 (  Z , b) =2 i 1. l j. 1  tl  t j K b  b.  zˆil l zˆij j ,   Y (tl ) Y (t j ). (3.12). n. where Y (t )   I (t j  t ) . This idea is based on the “leave-one-out” estimator. j 1. Then we consider the cross-validation function combining (3.11) and (3.12) as follows, . CV (b)  MSSE1 h0 (  X , Z ,  , b)  MSSE2 h0 (  Z , b) g g n 2 1  t t   zˆij   hˆ0i ( t j X , Zˆ i , ˆi , b )   2 K  l j   b  b i 1 j 1 i 1 l j.  zˆil l zˆij j .  Y ( t ) Y ( t )  l j. The selection of bandwidth using cross-validation method is defined by bCV  arg min CV (b) , bBn. where Bn be the set of acceptable bandwidths.. 3.4 Estimation of regression coefficients In the decomposition of the Q-function (3.6), Q1 , Q2 ,..., Qg are functions of regression coefficients 1 ,...,  g , respectively. On the (k+1)-th iteration of the M-step, the equation Qi for i  1,..., g can be written as follows, n. Qi   zij ( k )  j log f i (t j X j ,  i )  (1   j ) log Si (t j X j ,  i )  j 1 n. . .   zij ( k )  j log h0i (t j )  X j T  i   exp( X j T  i ) H 0i (t j ) . j 1. To obtain the estimate for 1 ,...,  g , the first step involves the calculation of h0i (k 1) (t ) and H0i (k 1) (t ) by kernel estimator given  i fixed at i (k ) . The details for kernel 24.

(35) estimator, see section 3.3.2. The second step is to calculate i (k +1) by maximizing Qi given h0i (t ) fixed at h0i (k 1) (t ) and H 0i (t ) fixed at H 0i (k 1) (t ) . Define the score vector U (  i ) for i  1,..., g as the first derivate of Qi with respect to the vector  i given H 0i (t ) fixed at H 0i (k 1) (t ) .. U ( i ) . Qi  i. n. H 0 i ( t j )=H 0 i (k 1) ( t j ).   zij ( k )  j  exp( X j T  i ) H 0i (k 1) (t j )  X j. (3.13). j 1. The estimate i (k +1) is obtained by setting the score vector (3.13) equal to zero. n. U (  i (k +1) )   zij ( k )  j  exp( X j T  i (k +1) ) H 0i ( k +1) (t j )  X j  0. (3.14). j 1. Since the estimate i (k +1) cannot be written as a closed form, Newton–Raphson method is used for estimation. In numerical analysis, Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a method for calculating better approximations to the roots of a realvalued function. Here, we use this method to get the updated estimates of 1 ,...,  g . The Hessian matrix H (  i ) for i  1,..., g is denoted as follows, H ( i ) . n U (  i )   X j zij ( k ) exp( X j T  i ) H 0i ( k +1) (t j ) X j T .  T  i j 1. (3.15). Let  i (k ,l ) be the l-th iteration of estimating i (k +1) by Newton–Raphson method and. i (k ,0)  i (k ) . According to Taylor expansion, the first-order expansion of U (i (k ,l ) ) about  i (k ,l 1) is (3.14). 0  U ( i ( k ,l ) )  U ( i ( k ,l 1) )+H (  i ( k ,l 1) )  ( i ( k ,l )   i ( k ,l 1) ) . After the operations with matrices, we obtain 1.  i ( k ,l )   i ( k ,l 1)   H ( i ( k ,l 1) ) U ( i ( k ,l 1) ) .. (3.16). We will do a series of iterations to update the estimate i (k +1) until a convergence ( k ,l ) ( k ,l 1) criterion is met. That is to do (3.15) repeatedly until  i   i. 25. 2.   , where  >0.

(36) is a small termination value and vector X  ( x1 , x2 ,..., xn ) is X. . 2. 2. is the L2 norm. Note that, the L2 norm of a.  x12  x2 2  ...  xn 2 . If this convergence criterion is. met, then we set  i (k 1)   i (k ,l ) . Therefore, using Newton–Raphson method, we can update the estimate of 1 ,...,  g .. 3.5 Algorithm and convergence Recall the symbols used in this thesis: (1) The sample size is n and the number of risks is g. (2) The vector of mixing probabilities is p  ( p1 , p2 ,..., pg 1 )T . (3) The matrix of regression coefficients is   ( 1T ,  2T ,...,  g T )T , where i  (i1 , i 2 ,..., id )T . (4) The matrix of baseline hazard rates of every survival time would have failed due to each type of risk is h0  (h01T , h02T ,..., h0g T )T , where h0i  (h0i (t1 ), h0i (t2 ),..., h0i (tn ))T . (5) The matrix of posterior probabilities is Z  ( Z1T , Z2T ,..., ZnT )T , where Z j  ( z j1 , z j 2 ,..., z j g )T . Note that, the superscript (.) appearing below represents the number of iteration. The algorithm is shown as follows, Fix n and g. Pick an initial mixing probabilities p (0) which every entry is 1 g , regression coefficients  (0) , baseline hazard rates h0(0) given a constant for each row and a termination value,  >0 ; Set the initial counter l  1 . Step 1 (E-step): Compute Z (l 1) with p (l 1) , h0(l1) and  (l 1) by (3.5); Step 2 (M-step): Update p(l ) with Z (l 1) by (3.7); Step 3 (M-step): Update h0l with Z (l 1) and  (l 1) by (3.10); 26.

(37) Step 4 (M-step): Update  (l ) with Z (l 1) , h0(l ) and  (l 1) by (3.16); Step 6: IF. p(l )  p(l 1). 2.  h0( l )  h0( l 1). 2.   (l )   ( l 1). 2.   , THEN stop;. ELSE let l  l +1 and GOTO Step2. In Step6,. . 2. is the L2 norm. Here, we use L2 norm to calculate the distance. between two matrices. If the distance between the matrix of estimated parameters in this iteration and the previous iteration is less than a small value  , then we say the iteration sequence is convergent and stop the iteration. The process of the iteration is shown in the following Figure 3-2.. Pick initial values:. E-step. M-step. Compute. Update. with. ,. ,. . Set the initial counter. with. ,. and. Update. .. .. .. with. and .. Update. with. ,. Compute. If. and. .. .. , then we get the estimations.. If. , then we let. Figure 3-2. The process of the iteration.. 27. ..

(38) 28.

(39) Chapter Ⅳ.. Validity indices. 4.1 Guidelines for mixture model selection McLachlan et al. [7] proposed an EM-based semi-parametric mixture model to analyze the data with competing risks. In their study, they assume that if we have observed the failure time, then we know the individual failed due to which type of risk. Only when we have observed the censored time, we do not have this information. Thus, we can get the information about the number of risk types from the observed failure data. Their analysis is built on the fact that the number of risk types is already known. But even if when we have observed the failure time, frequently, we can't know the individual failed due to which type of risk. That means we don’t know how many types of risk will affect the occurring of event. We can only guess from the experience that these data may be classified into some types of risk. As a result, the purpose of our study is not only to estimate the distribution of the survival time, but also to find out the most appropriate number of risk types and each individual may have failed due to which type of risk. In the mixture model, we can classify individuals to different types of risk through the posterior probabilities. Recall that zˆij denotes the posterior probability that the j-th individual would have failed due to the i-th type of risk. Accordingly, we can match individuals into the type of risk by posterior probabilities. In order to determine the number of risk types and select the appropriate model, we provide two guidelines for choosing indices. These two guidelines are shown below. (1) Extreme posterior probabilities If the posterior probability of each type of risk is very extreme, close to zero or one, then we can easily match this individual into one type of risk with largest posterior. 29.

(40) probability. For example, suppose there are two types of risk. The posterior probabilities of the observation A are 0.9 and 0.1 according to the two types of risk, respectively. The posterior probabilities of observation B are 0.5 according to both types of risk. Then we can match the observation A into the risk that the posterior probability is 0.9. But the observation B cannot be clearly matched. (2) Small difference of residuals The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest. In our model, we consider the distance between the survival time and the expectation of this survival time given the covariates and estimators. If the distance is small, the model fits the data well.. 4.2 Validity indices involving only posterior probabilities In this section, we only consider posterior probabilities to figure out the indices. We refer to the idea of fuzzy clustering algorithm. In fuzzy clustering algorithm, zˆij denotes the membership in the i-th cluster of the j-th individual. Note that our model is an EM-based mixture hazard model. In EM algorithm, zˆij denotes the posterior probability that the j-th individual with survival time t j would have failed due to the ith type of risk. The memberships in fuzzy clustering algorithm and the posterior probabilities in EM algorithm all satisfies: (1) 0  zˆij  1 for every i  1, 2,..., g and j  1, 2,..., n . g. (2).  zˆ i 1. ij.  1 for every j 1, 2,..., n .. n. (3) 0   zˆij  n for every i  1, 2,..., g . j 1. n. (4). g.  zˆ j 1 i 1. ij.  n for every i  1, 2,..., g and j  1, 2,..., n .. 30.

參考文獻

相關文件

– Each time a file is opened, the content of the directory entry of the file is moved into the table.. • File Handle (file descriptor, file control block): an index into the table

When the Hessian of each constraint function is of rank 1 (namely, outer products of some given so-called steer- ing vectors) and the phase spread of the entries of these

Cowell, The Jātaka, or Stories of the Buddha's Former Births, Book XXII, pp.

• When light is refracted into two rays each polarized with the vibration directions.. oriented at right angles to one another, and traveling at

In a nonparametric setting, we discuss identifiability of the conditional and un- conditional survival and hazard functions when the survival times are subject to dependent

In this paper, we would like to characterize non-radiating volume and surface (faulting) sources for the elastic waves in anisotropic inhomogeneous media.. Each type of the source

Cumulative emissions of CO2 largely determine global mean surface warming by the late 21 st century and beyond.. Cumulative emissions of CO2 largely determine global mean surface

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,