檢測以韋伯分布為基線之混合風險模型的離群值

全文

(1)國立臺灣師範大學數學系碩士班碩士論文. 指導教授：張少同博士. Detecting Outliers in Mixture Hazard Models with Weibull Baselines 檢測以韋伯分布為基線之混合風險模型的離群值. 研究生：何莉維. 中華民國一零七年八月.

(2) 摘要離群值的檢測(Outlier detection)是統計分析方法中很重要的議題，是一種針對資料中極度異於其它資料的事件或觀測值的識別。適時地找出這些觀測值並加以處理，可以改善統計分析結果且合理解釋資料模型。在生活中離群值檢測常見的應用於結構缺陷、醫療問題等類型的問題。在醫療問題中，Cox 比例風險模型(Cox proportional Hazard model)是存活分析被廣為應用的分析模型，主要用於探討存活時間與自變項（Covariate）的關係。因此，有許多學者提出針對風險模型的離群值檢測，但較少著墨於混和風險模型 (mixture hazard model)。然而，混和風險模型在這個領域也越來越被重視，因為實際在醫學中，疾病會被區分成許多類型(group)，因此發展出一個適用於混和風險模型之檢測離群值及估計模型的方法是很重要的，此論文即探討此模型之離群值檢測及模型估計。本論文將針對醫學研究領域最廣為應用的混和風險模型來探討離群值的檢測，並以韋伯分布為基線。利用收縮參數(shrinkage parameter)對現有的概似函數加入懲罰(penalty)函數項，以 EM 演算法估計收縮參數來檢測資料中的離群值，再進一步對離群值加權或是刪除以調整模型將參數估計做最佳化。根據模擬顯示，此方法有效的偵測出資料中的離群值，且利用刪除離群值的方法通常可以得到較好的參數估計。. 關鍵字：離群值檢測、混和風險模型、韋伯分布、懲罰函數、EM 演算法. i.

(3) Abstract Outlier detection is an important issue in statistical analysis. It is a method to identify the data or observations which have extreme abnormalities in the dataset. Detecting these observations and treating them appropriately can improve the result of estimation and reasonably interpret models. Outlier detection is commonly applied to structural defects, medical problems, and other types of problems. In the medical problem, the Cox proportional hazard model is the most widely used model in survival analysis. It is mainly used to explore the relationship between survival time and covariate. Although many approaches have been proposed for the outliers detection in survival model, few of them consider the about outlier detection in mixture hazard model. However, the analysis of mixture hazard model is popular recently because the diseases would often be divided into many groups based on the causes in medicine. As a result, it is very important to develop a method for detecting outliers and fitting the estimation of the mixture hazard models, and this thesis is discussing about this issue. In this thesis, we focus on the detection of the mixture hazard model based on the Weibull mixture hazard model. We introduce the shrinkage parameters in the penalized likelihood function to detect the outliers, and develop EM algorithm to estimate the shrinkage parameter. After detecting possible outliers, we refit the model parameters either by weighting or deleting the outliers. The simulation results reveal that the proposed method can detect the outliers of the mixture hazard model effectively. Additionally, using the outlier-deleting method can obtain better parameter estimates, in the sense of smaller bias, generally.. Key word:. Outlier detection, Mixture hazard model, Weibull distribution, Penalty function, EM algorithm. ii.

(4) 致謝在臺灣師範大學兩年半的研究所生涯即將畫上句點，在這裡真的學習到了許多統計專業知識以及程式分析方法。能夠完成這篇論文要感謝的人很多，首先非常感謝我的指導教授張少同老師這兩年來專業用心的指導，在學業上耐心引導我做研究的方向，教導我如何運用邏輯思考來解決許多程式困難的問題。再生涯規劃方面老師也給予我極大的幫助，總是提供我正確的方向、給予我許多鼓勵和建議，讓我可以在工作之餘順利完成論文。另外，還要感謝李孟峰老師和呂翠珊老師在百忙之中擔任我的口試委員，兩位老師不只提供了許多寶貴的意見、耐心的指正我論文中的文法及名詞用法的問題，還延伸教導我論文延伸出來的深奧知識，讓我受益良多，也使我的論文更臻完善。再者，要感謝研究好夥伴張怡雯，除了在做論文的過程中給予我許多意見和幫助，也非常有耐心的協助我解決許多程式問題。我們一路上從剛開始對論文題目的摸索、不斷查資料互相討論問題，到編輯程式做分析，一起走過這段跌跌撞撞的艱辛路程，雖然辛苦卻也是另一種美好的回憶。最後，謝謝我最親愛的家人和朋友們，在我讀研究所及寫論文期間給予我經濟以及精神上的支持與陪伴。僅將這份論文獻給我敬愛的師長、親愛的同學、摯愛的家人，和所有關心我的朋友們，以表達我無限的感激。. 何莉維謹識於國立臺灣師範大學數學所統計組中華民國一零七年七月. iii.

(5) Contents 摘要. ⅰ. Abstract. ⅱ. 致謝. ⅲ. Contents. ⅳ. List of Tables. ⅴ. List of Figures. ⅶ. I.. 1. Introduction. II. Cox proportional hazard model with Weibull baselines 2.1. Cox proportional hazard model. 2.2 Baseline hazard function following Weibull distribution 2.3 Mixture proportional hazard model with Weibull baselines. III. Outlier detection on Weibull hazard model 3.1. 4 74 96 107. 13 10. Robust Weibull mixture model via shrinkage penalization. 13 10. 3.2 Outlier detection using (mixture) standard residual square. 16 13. IV. Parameter estimation. 19 16. 4.1 EM algorithm. 19 16. 4.2 Penalized maximum likelihood estimation. 2320. 4.3 Data-weighted method. 28 25. 4.4 Outlier-deleted method. 3229. Simulation. 36 33. V.. VI. Practical data analysis. 50. VII. Conclusion. 58. References. 59 iv.

(6) List of tables Ⅴ-1. Sample size, number of risk types, mixing probability, and count of outlier in each scenario.……………………………………………………………… 34. Ⅴ-2. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function.………………………………………. 35. Ⅴ-3. Parameter estimations, standard errors, bias, MSE, ARB, MSE, MsSSE, and BIC of data-weighted method and outlier-deleted method.…………..… 37. Ⅴ-4. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function.………………………………………. 38. Ⅴ-5. Parameter estimations, standard errors, bias, MSE, ARB, MSE, MsSSE, and BIC of data-weighted method and outlier-deleted method. ………….… 39. Ⅴ-6. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function.………………………………………. 41. Ⅴ-7. Parameter estimations, standard errors, bias, MSE, ARB, MSE, MsSSE, and BIC of data-weighted method and outlier-deleted method. ……………. 42. Ⅴ-8. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function with most of the outliers are ………………………..………. 44 distributed in the type of risks with large proportion.. Ⅴ-9. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function with most of the outliers are …………………………………. 44 distributed in the type of risks with less proportion.. Ⅴ-10. Parameter estimations, standard errors, bias, MSE, ARB, MSE, MsSSE, and BIC of data-weighted method and outlier-deleted method with most …………… 46 of the outliers are distributed in the type of risks with large proportion. v.

(7) Ⅴ-11. Parameter estimations, standard errors, bias, MSE,. ARB, MSE, MsSSE,. and BIC of data-weighted method and outlier-deleted method with most of the outliers are distributed in the type of risks with less proportion. ……………… 46. Ⅴ-12. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function. ……………………………………… 47. Ⅴ-13. Parameter estimations, standard errors, bias, MSE,. ARB, MSE, MsSSE,. and BIC of data-weighted method and outlier-deleted method in small sample size. …….……………………………………………………………. 48. vi.

(8) List of Figures Ⅴ-1. The data scatter plot with the expectation lines given the estimated parameter according to regular method and penalty method.……………….. 36. Ⅴ-2. The data scatter plot with the expectation lines given the estimated parameter according to data-weighted method and outlier-deleted method. .. 37. Ⅴ-3. The data scatter plot with the expectation lines given the estimated parameter according to regular method and penalty method.……………….. 38. Ⅴ-4. The data scatter plot with the expectation lines given the estimated parameter according to data-weighted method and outlier-deleted method. .. 40. Ⅴ-5. The data scatter plot with the expectation lines given the estimated parameter according to regular method and penalty method. ………………. 41. Ⅴ-6. The data scatter plot with the expectation lines given the estimated parameter according to data-weighted method and outlier-deleted method.... 43. Ⅴ-7. The data scatter plot with the expectation lines given the estimated parameter according to regular method and penalty method. ...…………….. 48. Ⅴ-8. The data scatter plot with the expectation lines given the estimated parameter according to data-weighted method and outlier-deleted method.... 49. vii.

(9) I.. Introduction. Statistic is now developing rapidly. There are more and more methods for establishing the statistical models, and the statistical models have become more complex. However, many researchers often ignore the existence of extreme values or outliers in the inspection of data before conducting data analysis. Whether it is from more traditional descriptive data analysis, or common regression models, or even the increasingly popular survival analysis, if the existence of extreme values is neglected in the process of analyzing, the results of the established models may not be correct. Subsequent interpretations or the establishment of theories will also be flawed. As a result, outlier detection methods have been commonly carried out, and have obtained considerable interest in various fields. There has been a lot of research introduced outlier detection in linear regression models. For example, among various techniques are least square method (LSM) (2004) [1] and the sigma filter (1983) [2] which have been used frequently to remove the outliers of linear regression. On the other hand, few papers discuss the outlier detection in survival analysis models. There are many definitions of an outlier in the literature, both mathematical and more informal, as can be seen more thoroughly in Ben-Gal (2005) [3]. For example, Hawkins (1980) [4] treats the observation that deviates very much from other observations as an outlier, which has arouse suspicion that it was generated by a different mechanism. Andreas (1996) [5] proposed a linear method for deviation detection in large databases. Moreover, outlier detection in mixture model is also famous. Chun Yua (2015) [6] used nonconvex penalized likelihood in outlier detection and robustness of mixture model. Not only detecting outliers, but also obtaining robust inference of the model are important. Peel and McLachlan (2000) [7] proposed a robust mixture modeling using t distribution. Markatou (2000) [8] 1.

(10) developed a robust estimation procedure for mixture models by using a weighted likelihood for each data point. Fujisawa and Eguchi (2005) [9] introduced a robust estimation method in normal mixture model using a modified likelihood function. Neykov (2007) [10] proposed robust fitting of mixtures using the trimmed likelihood. However, there are few research studies mention about the outlier detection in mixture survival analysis even through the survival analysis has become more important recently. In this thesis, we introduce a new method of the outlier detection in mixture Cox proportional hazard model (1972) [11]. For more details about Cox model, see also Kalbfleisch JD and Prentice RL (1992) [12], the Statistical Analysis of Failure Time Data, and Benichou et al. (1992) [13]. Attention of this model is focused on inference concerning the effects of factors on both the probability of occurrence and the hazard rate conditional on each of the failure types. The type of this kind of data is so-called competing-risks data. We take the model proposed by Ahmad Mahir Razali (2013) [14] for reference and consider about mixture Weibull distributions for fitting failure time according to each of the failure types. Furthermore, the new method of the outlier detection, we proposed, is a penalty method in mixture hazard model, which can detect the outliers by the shrinkage parameters. For analyzing the competing-risks data, we dealt with a number of components which results in a much more complicated model and hence may cause difficulties in the estimation. Alternatively, we can adopt a flexible rich class of baseline hazards. For example, Larson and Dinse (1985) [15] specified the baseline hazard function to be piecewise constant. Gelfand et al. (2000) [16] proposed the continuous form of the summation of an arbitrary number of parametric hazards such as the Weibull baseline. Here, we take Weibull distribution to be the baseline hazard function in the mixture model. Estimation is 2.

(11) based on EM algorithm to maximum the penalized likelihood on the basis of the full likelihood. Furthermore, we use either the data-weighted method or outlier-deleted method to robustify the parameter estimation. The remainder of this thesis is organized as follows. In Section Ⅱ, we will give a brief introduction to the Cox proportional model, and advanced to the baseline hazard function following Weibull distribution. Then we describe the mixture hazard model with Weibull baselines. In section Ⅲ, we introduce the shrinkage parameter and import the penalized function on full likelihood in order to determine if there are outliers in the dataset. Furthermore, we propose the mixture standard residual square to detect the. outliers, and introduce the data-weighted method and the outlier-deleted method to robustify the estimations. In Section Ⅳ, we first review the EM algorithm. Then we’ll describe the estimation of each parameter in regular log-likelihood function, penalized log-likelihood, data-weighted log-likelihood function, and outlier-deleted method. Additionally, we introduce the index of comparing the robustness of the estimations by the data-weighted method and outlier-deleted method. In Section V, several simulation. studies demonstrate how the penalized log-likelihood functions determine the outliers as well as how the data-weighted method and outlier-deleted method improve the parameter estimations. In Section VI, we apply the model to analyze the real data of prostate cancer patients treated with different dosages of the drug diethylstilbestrol. We will then conclude and discuss our studies in Section Ⅶ.. 3.

(12) II.. Cox proportional hazard model with Weibull baselines. Survival analysis is commonly used to study the survival time of human or animal that suffering from a disease in an experiment. However, with the knowledge accumulated over time and the improving technology, it has many applications in the industrial and commercial world. Furthermore, the most commonly used regression analysis model in survival analysis is Cox proportional hazard model (1972) [11]. In this section, we will introduce parametric Cox proportional hazard model following a baseline hazard function of Weibull distribution. 2.1 Cox proportional hazard model Let T be a non-negative variable representing the failure time of an individual in the population. The distribution of T can be represented in the common way in terms of density or distribution functions as well as in more specialized manners such as the hazard function. Specifically, giving the probability that the event has occurred by duration t , the probability density function (p.d.f.) is f  t  and the cumulative distribution function (c.d.f.) is defined as. F  t  = Pr T  t . In addition,. the survival function which gives the probability of being alive just before a duration t , or more generally, the probability that the event of interest has not occurred by duration t , can be written as . S  t  = Pr T  t  1  F  t    f  x  dx . t. An alternative characterization of the distribution of T is given by the hazard function, or instantaneous rate of occurrence of the event, defined as. h  t  = lim. Pr t  T  t  dt T  t. dt 0. dt. .. The numerator of this expression is the conditional probability that the event will. 4.

(13) occur in the interval t , t  dt  given that it has not occurred before, and the denominator is the width of the interval. Note that hazard model is. h t  = . d d ln 1  F  t     ln S  t  . dt dt. Furthermore, the cumulative hazard function is defined as def. H t  =.  h  u  du   ln 1  F t    ln S t  , t. 0. t>0 .. Such a model like this, introduced by Cox (1972) [11], is a survival analysis regression model, which describes the relation between the event incidence, as expressed by the hazard function and a set of covariates. The Cox proportional hazards model at time t among individuals with covariate x is defined as. . h  t x   h0  t  exp xT . . ,. where the superscript T denotes vector transpose and where  is a vector of regression coefficients. h0  t  are arbitrary unspecified baseline hazard functions, it can be interpreted as the hazard function for the population of individuals with x  0 . The baseline hazard functions have some alternative common lifetime distributions, for example, Gordon (1990) [17] adopted the Gompertz distribution to specify the conditional survival functions in the context of estimating the ‘cure’ rate of breast cancer after a treatment therapy. Larson and Dinse (1985) [18] took account of the baseline hazard function to be piecewise constant. McLachlan (2003) [19] assumed it to be piecewise constant to studied for the EM-based semi-parametric mixture model. At this thesis, we specified a continuous baseline hazard in the form of the Weibull distribution.. 5.

(14) 2.2 Baseline hazard function following Weibull distribution Choosing different hazard function may acquire amounts of different proportional parametric models. As shown previously, there is a direct link between the survival and hazard function, and the choice of hazard distribution determines that of the survival. Since the Weibull distribution is the theoretical basis for reliability analysis and life testing and is popularly used in the field of failure analysis, we adopted the Weibull distribution, which is characterized by two positive parameters, to specify the baseline hazard function. First, we determine f 0  t  be the probability density function (p.d.f.) of Weibull distribution. . f0  t    t  1 exp t . . ,. (2.1). where   0 is the scale parameter and   0 is the shape parameter of the distribution. The survival function of Weibull model can be deduced from p.d.f. (2.1) as S0  t  =Pr T  t   f 0  u  du  exp  t   , . t. (2.2). from (2.1) and (2.2), the hazard function of Weibull model is then given by. h0 (t )  . d ln S  t    t  1 . dt. From above, adopting the Weibull distribution as the baseline hazard function of proportional hazard model can be written as. . . . . h  t x  x  h0  t  exp x T = t  1 exp x T , which represents the hazard function at time t among individuals with covariate x . Then the corresponding survival function and density function are. S  t x  = exp  t  exp  x T  ,. 6. (2.3).

(15) f  t x     t  1 exp  x T   t  exp  x T  .. (2.4). Now, we consider a sample of n independent individuals and x j is a vector of covariates associated with the j-th individual. In this study, t j is the minimum of the exact failure time T j and the censoring time C j of the j-th individual, i.e.. t j  min Tj , C j  . Furthermore,  j  I Tj  C j  is an indicator variable represents the failure status. On the basis of the observed data, the likelihood function of the data. t , x ,   j  1,.., n , conditional on j. j. x1 , x2 ,...., xn is. j. n. . . j. . . 1 j. L  t x,   =  f t j x j   S t j x j      j 1. .. (2.5). Note that the full likelihood (2.5) can be rewritten like this because the censoring time is non-informative [20]. Imputing the Weibull distribution baseline (2.3) (2.4), the log-likelihood function becomes n. . . l ( ;  ;  )    t j i exp( x j T )   j ln   ln   (   1) ln t j  x j T  . j 1. (2.6). As we can see, the estimator of the parameter can be obtained by maximum likelihood estimate (MLE), and we will discuss the parameter estimation in section Ⅳ.. 2.3 Mixture proportional hazard model with Weibull baselines In the beginning of the discussion on mixture Cox proportional hazard model, it’s necessary to mention the competing risk first. As mentioned above, the individual is assumed to have a single event. However, the actual condition of the survival is complicate. In addition to research interest events, an individual may also encounter other events that we don’t care about. Among other various events that the individuals. 7.

(16) may encounter, once one of them occurs, the individuals will not have the opportunity to reoccur with research interests. The occurrence of certain events may obscure the possibility of other events of interest. It is so-called competing risk event. Furthermore, the probability of these types of events happened is called competing risk. In recently research, Lunn and McNeil (1995) [21] proposed a method using readily available standard programs for fitting an augmented approach to analyze competing-risks data. In the mixture model structure, the n independent individuals may be divided into g mutually exclusive components corresponding to each type of risk, that is, there are g distinct types of risks and the observed failure-time data is. . y  t1 , x1T , D1 ,..., tn , xn T , Dn. . T. ,. (2.7). where the superscript T denotes vector transpose, t j is the failure time or censoring time for the j-th individual, x j is a vector of covariates associated with the j-th individual, and D j  i represents that the j-th individual fails according to the i-th type of risk and D j  0 indicates a censored observation. The mixing proportions of g distinct types of risks are assumed to be  =(1 ,  2 ,....,  g ) . Let.  =  1 , 1 ,  1T , 2 , 2 ,  2T ,..., g ,  g ,  g T . T. be the vector containing the unknown. parameters for the Weibull hazard function coefficients. On the basis of the observed data y given by (2.7), the log-likelihood function for  under the mixture model is given by. . n  g log L   =    I  D j  i  log  i  f i t j x j , i , i ,  i j 1  i 1. .   I  D. j.   0  log S t j x j ,  , . . . where I  A  is the indicator function of event A. The maximum likelihood estimate of  and  is obtained by the EM algorithm. Further discussion of the EM algorithm and its application to mixture models in a general context can be found in S. K. Ng and G. J. McLachlan (2003) [19]. 8.

(17) To describe an incomplete-data, we bring in an unobservable random vector z of zero-one indicator variables for each censored observation. t j , where. z j   z1 j ,..., z gj  , and zij  1 if the j-th individual have failed from risk i and zero T. elsewhere  i  1,..., g  . Since the real failure time for those censored observations did not simplify the calculations, it was not conducted as an incomplete variable in the complete-data framework. The complete-data log-likelihood is then given by. . . n  g l c    log Lc   =  I  D j  i  log  i  fi t j x j , i , i ,  i j 1  i 1 g    I  D j  0  zij log  i  Si t j x j , i , i ,  i  , i 1 . . . . . (2.8). Inputting the parameter of Weibull distribution, the complete-data log-likelihood can be written as n g l c      I ( D j  i ) ln  i + ln i + ln i +(i  1) ln t j +x j T i  i t j i exp( x j T i )  j 1  i 1 (2.9). . + I ( D j  0) zij ln  i  i t j i exp( x j T i )  .. The parameter estimations will be obtained by using EM-algorithm, which will be discussed in the following section.. 9.

(18) III.. Outlier detection on Weibull hazard model. The MLE and EM based on the normality assumption possess many desirable properties, however, the methods are not robust and both parameter estimation and inference would fail miserably if some outliers exist. Our attention in this thesis is hence on outlier detection and robust estimation in mixture hazard model. We refer to the mean-shift model presented by Chun Yu and Kun Chen (2015) [6], and develop a new outlier detection and estimation approach based on a shrinkage term coupled with penalization in the likelihood. 3.1 Robust Weibull mixture model via shrinkage penalization In this section, we introduce the proposed robust mixture modeling approach via shrinkage penalization. To focus on the main idea, we restrict our attention on the Weibull mixture hazard model. The proposed approach can be readily extended to other mixture hazard models with different distributions, such as Exponential mixture and Gompertz mixture. From the observations y in Section Ⅱ-2.3, the proposed robust mixture model with a shrinkage parameterization is to assume that the observations have the following density function. . . f t j   j    i  fi  t j   j   , g. i 1. where    1 , 1 ,  1T , 2 , 2 ,  2T ,..., g ,  g ,  g T . T. , and. j. is the shrinkage. parameter for the j-th observation. Perceivably, without any constraints, the product of the shrinkage parameters results in an overly parameterized model. Note that whatever kind of risks, it’s possible to make it happen that individuals failed in a short period of time, as a result, it’s normal to have small “ t ”. In addition, there is severe. 10.

(19) right-skew distribution in survival rates and it’s rare to have long term survival; thus, we will consider relatively big “ t ” an outlier. The key here is to assume that the vector    1 ,  2 ,...,  n  is quantified as a number between 0 and 1, i.e.,  j would be close to 1 when the j-th individual is a normal observation in at least one of the components, and only when the j-th observation is an outlier,  j would close to zero to adjust the outlying effect. Therefore, the estimation of shrinkage parameters.  provides a direct way to accommodate and identify outliers. Due to the assumption of  , we propose to maximize the following complete penalized log-likelihood criterion to conduct model estimation and outlier detection, n. pl c (  )  l c (  )   P (  j ) , j 1. where n  g l c        I  D j  i  log  i  f i  t j   j   j 1  i 1. . . . .   I  D j  0  zij log  i  Si  t j   j   g. i 1.   . (3.1). is the complete penalized log-likelihood from (2.9) with the product of the shrinkage parameter, and P () is the penalty function of  with a tuning parameter  controlling the number of outliers. Since the shrinkage parameter  j  j  1, 2,..., n  have an constraint in  0,1 , we define the penalty function P () as P (  j )     j  1 , 0< j  1 . 2. (3.2). The tuning parameter  is presented as a function of  as below. =. k0. . ,. (3.3). where k0 is some prespecified value. In practice, k0 is chosen to be 3.8, and we will explain of how to choose this value in the section Ⅴ (simulation). The reason why. 11.

(20) we determine  by a function of  is to strengthen the role of penalty function. Since the outlier maybe far away from the data, we don’t take a fixed tuning parameter  0 to tune different levels of outliers here. Hence, we define  to be a fraction, which is expressed by a prespecified value k0 over  , the mean of.  j  j  1, 2,..., n  . If the number of outliers is too big, k0 will be divided by a smaller  and hence get a bigger  , which giving larger degree of penalty. Conversely, if. the number of outliers is small, we need to reduce the penalty by k0 over bigger  . The parameter of    can be obtained based on the construction of the EM algorithm. Combining the above function (3.1), (3.2), and (3.3), we can rewrite the complete penalized log-likelihood function as below, n  g pl c (  )     I  D j  i  log  i  fi  t j   j   j 1  i 1. . . 2  n   I  D j  0  zij log  i  Si  t j   j        j  1 , i 1  j 1 g. . . (3.4). where 0< j  1 , for j  1, 2,..., n ; and  = k0  . After introducing the proposed robust mixture modeling approach via shrinkage penalization, we have to discuss about when to use this model. When we collected a set of data, we don’t know whether there are any outliers inside. In most application, we will take the maximum likelihood estimator (MLE) via the expectationmaximization (EM) to estimate the mixture model parameters. It is well known, however, that the MLE would be very sensitive to outliers in the data. Hence, we can take both the regular log-likelihood function (2.9) and penalty log-likelihood function (3.4) into account, estimate the parameters of both models at the same time, and compare the mixture sum of standard squared residuals (MsSSE) and Bayesian information criterion (BIC) of these two models. The mixture sum of standard squared. 12.

(21) residuals (MsSSE) formula is defined as g. n. MsSSE   zîj. t . j 1 i 1. j.  Eî (t j ) Vâr (t ) i. . 2. ,. (3.6). j. ˆ (t ) is the expectation and the variance of survival time where Eˆ i (t j ) and Var i j. according to each individual corresponding to each type of risk, given the estimators obtained from the penalized log-likelihood criterion as below, . . 0. 0. ˆ ˆ ) dt j , Eî (t j )   P(T  t j x j ˆ ˆ ) dt j   Si (t j x j ,. (3.7). 2. ˆ (t )  Eˆ (t 2 x , ˆ ˆ )   Eî (t j x j , ˆ ˆ ) . Var i j i j j   The Bayesian information criterion (BIC) formula is known as,. . . ˆ ˆ ˆ  p  log(n) . BIC  2  l c . Since the estimators of l c   from (2.9) may be far away from the true values, when outliers exist, while the penalty method would adjust the estimators of. pl c (  ) from (3.4), we can determine whether there are outliers in the dataset by comparing the indices of these models.. 3.2 Outlier detection using (mixture) standard residual square After we get the information that the outliers are present in the dataset, for the purpose of obtaining a reliable model, it is important to detect the outliers from the data. The way of detecting which individual is outlier in this thesis is by using the estimation of the complete penalized log-likelihood function pl c (  ) from (3.4). First, we adopt the EM-algorithm to obtain the penalized estimations ˆ  ˆ . The detail of EM-algorithm will be discussed in next section. Second, we utilize the. ˆ ˆ and ˆ to compute the expectation Eˆ i (t j ) and variance penalized estimators  ˆ (t ) of survival time according to each individual j corresponding to the risk i Var i j 13.

(22) from (3.7). Although the penalized estimators would improve the residual standard deviation estimation and subsequently suppress the mis-estimation of outlier influence, we don’t take it as the final estimation. One reason is that the suppression of the mis-estimation isn’t significant, and the other is because the data has already been deformed after the survival times of each individual multiplied the shrinkage parameter  j  j  1, 2,..., n  . Hence, we only take it for computing the mean Eˆ i (t j ) ˆ (t ) of the data. and variance Var i j. Next, we detect the outlier by the mixture standard squared residual for the j-th individual ( MsSE j ) as below, g. MsSE j   zîj. t . i 1. j.  Eî (t j ) ˆ (t ) Var i. . 2. .. (3.8). j. MsSE j is the value that present the square of the standardized distance between the. j-th individual and the expectation Eˆ i (t j ) of survival time given the estimations corresponding to the risk i from (3.7). Since outlier is the one which have extreme abnormalities in the dataset, the MsSE j of outlier would be bigger than that of normal data. As a result, if the MsSE j of the j-th individual is larger than a prescribed critical value, the j-th individual will be classified as an outlier. Note that we assume the outliers are all failure data, i.e., we can observe the outlier belongs to which risk. The distribution of the MsSE j of the j-th failed individual would approach to chi-square distribution. g.  zˆ i 1. ij. t . j.  Eî (t j ) ˆ (t ) Var i. j.   t 2. j.  Eî (t j ) ˆ (t ) Var i. . 2.   2 (1) , where zîj  1 .. (3.9). j. Hence, we take it for reference and determine the critical value as  2 1 , where.  2 . . is the  -level Chi-square value with degrees of freedom equal to 1. The. suggested   0.01 , as we expect the percentage of outliers in a dataset should not be 14.

(23) too high for reliable inference, so it is the acceptable maximum value. By this criterion, we have to pay attention that if there isn’t any outliers in the dataset, it would still have the probability approaching to one percent that the normal data are mistaken as the outliers. From the critical value  2 0.01 1 , the set of outliers O can be expressed as. O  t j : MSSE j   20.01 1 , j  1, 2,..., n ,. (3.10). where the critical value  20.01 (1)=6.635 . The above method shows how to detect the outliers. As we identified which data may be the outlier, we could make some adjustments for outliers in order to get better parameter estimations. One approach is to delete the outliers, estimate afresh using the new data, and get the better estimation of parameters. Otherwise, we can attempt to apply the weighted optimization method, i.e., we give the data some degree of importance or some degree of non-relevance by weight. These would be discussed in next section.. 15.

(24) IV.. Parameter estimation. The expectation–maximization (EM) algorithm is a famous iterative method to acquire maximum likelihood estimates of parameters in statistical models, which depends on unobserved latent variables. In our study, since we don’t know the true failure time of the censored observation, we bring in a random vector z of zero-one indicator variables for each censored observation, which is the unobserved latent variable. Therefore, we obtain the estimator of the parameter by using EM algorithm. EM iteration has two steps, E step forms a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and M step computes parameters maximizing the expected log-likelihood created on the E step. The iteration alternates between E step and M step.. 4.1 EM algorithm It follows on the application of EM algorithm in the aforementioned structure that on the (k+1)-th iteration of the E-step. We calculate the Q-function, which is the conditional expectation of the complete-data log-likelihood given the current estimate of the parameter and the observed data. . . . . n  g Q    k    k  =  I  D j  i  log  i  fi t j x j , i  k  , i  k  ,  i  k  j 1  i 1. .  . . g   I  D j  0  zij  k  log  i  Si t j x j , i  k  , i  k  ,  i  k   , i 1 . where   k  is the estimate of  =  1 , 1 ,  1T , 2 , 2 ,  2T ,..., g ,  g ,  g T . T. after the. k-th iteration and. zij. (k ).  E  zij y,ˆ  k     .  i  Si (t j x j , i ( k ) , i ( k ) ,  i ( k ) ) g.  l 1. (k ) l. ,. (4.1).  Sl (t j x j , l , l ,  l ) (k ). (k ). (k ). where zij ( k ) is the posterior probability that the j-th individual with censored survival 16.

(25) time t j would have failed due to risk i  i  1, 2,..., g  . The M-step provides the updated estimate   k 1 that maximizes the Q-function with respect to  . Let . be decomposed into.  , 1. 2. ,..., g  , where.  i   i , i ,  i T  ; i  1, 2,..., g . It can be seen that the Q-function in (4.1.1) can be separated into. . . Q    k    k  =   I  D j  i  log  i +I  D j  0  zij  k  log  i  g. n. j 1 i 1. . . . . +   I  D j  1 log f1 t j x j , 1 +I  D j  0  z1j  k  log S1 t j x j , 1    j 1 n. +. . . . . +  I  D j  g  log f g t j x j , g +I  D j  0  z gj  k  log S g t j x j , g    j 1 n. Def. = Q0  Q1  ...  Qg. with respect to the unknown parameters  i and  i  i  1, 2,..., g  , respectively. It implies that the estimates of  i and  i  i  1, 2,..., g  can be updated separately by maximizing Q0 and Q1 ,..., Qg , respectively. In order to compute  i ( k 1)  i  1, 2,..., g  , we introduce a new variable  called a Lagrange multiplier and the Lagrange function Q0 L defined by.  g  k    Q0    I  D j  i  log  i +I  D j  0  zij log  i      i  1 . j 1 i 1  i 1  g. n. L. On differentiation of Q0 L with respect to  i  i  1, 2,..., g  as below, n. Q0   i L.  I (D j 1. j. n.  i )  I ( D j  0) zij ( k )  = 0. i. .  I (D j 1. j.  i)  I ( D j  0) zij ( k ). . Note that we know the sum of  i  i  1, 2,..., g  equal to 1, hence, g. g.  i 1. i. . . n.    I (D i 1. . j 1. j.   i )  I ( D j  0) zij ( k )   =1. . 17.   n..  i ..

(26) It follows that  i ( k 1) can be written as n.  i ( k 1) .  I (D j 1.  i)  I ( D j  0) zij ( k ). j. .. n. (4.2). Next, we need to maximize Qi with respect to  i  i  1, 2,..., g  . First, on differentiation of Qi with respect to the scale parameter i  i  1, 2,..., g  , it follows that i ( k 1) satisfies the equation k  k  I ( D j  i)  i  I ( D j  i) exp( x j T i  k  )t j i +I ( D j  0) zij  k  exp( x j T i  k  )t j i    0, . n. i. j 1. and we can write i ( k 1) as the following closed-form expression. I ( D j  i). n. i ( k 1)   j 1. exp( x j T i  k  )t j.  k  i.  I ( D j  i)+I ( D j  0) zij ( k ) . .. (4.3). Second, on differentiation of Qi with respect to the shape parameter i.  i  1, 2,..., g  ,. since it couldn’t be expressed as a closed-form solution, we use. Newton’s method (Method of Fluxions, completed in 1671, published in 1736 after Newton’s death). Let g ( i ) defined as the gradient function which represent first differentiation of Qi with respect to i , and it can be written as Q g ( i )= i i n  1       I ( D j  i)   ln t j   i (t j ) i exp( x j T i ) ln(t j )  I ( D j  i)+I ( D j  0) zij   . j 1    i  . Let H ( i ) be the Hessian matrix of Qi which represent the second differentiation of Qi with respect to i , where n 2  I ( D j  i)   2Qi     i (t j ) i exp( x j T  i ) ln(t j )   I ( D j  i)+I ( D j  0) zij   , 2 i i j 1  i .  2Qi 0 , i  k. for k  i .. Likewise, we can obtain the iterative scheme as i ( k 1,n+1)  i ( k ,n )   H  i ( k ,n )   g  i ( k , n )  , 1. 18. (4.4).

(27) where i ( k 1,n+1) is the (n+1)-th iteration of Newton’s method for i ( k 1) , and the. . . sequence i ( k ,1) , i ( k ,2) ,.... will converge to a point i ( k 1) satisfying g i ( k 1)  0 . Third, on differentiation of Qi with respect to the vector of regression coefficients  i ( k 1)  i  1, 2,..., g  , since it couldn’t be expressed as a closed-form solution as well, we use Newton’s method again. Let g ( i ) defined as the gradient function which represent first differentiation of Qi with respect to  i , and it can be written as g ( i )=. n Qi   I ( D j  i)  i exp( x j T  i )t j i  I ( D j  i)+I ( D j  0) zij  x j .  i j 1. . . Let H ( i ) be the Hessian matrix of Qi which represent the second differentiation of Qi with respect to  i , where n  2Qi   i exp( x j T  i )t j i  I ( D j  i )  I ( D j  0) zij  x j x j .  i  i j 1. .  2Qi 0 ,  i  k. . for k  i .. Likewise, we can obtain the iterative scheme by (4.1.10) and (4.1.11) as.  i ( k 1,n 1)   i ( k ,n )   H   i ( k ,n )   g   i ( k ,n )  , 1. (4.5). where  i ( k 1,n1) is the (n+1)-th iteration of Newton’s method for  i ( k 1) , and the. . . sequence  i ( k ,1) ,  i ( k ,2) ,.... will converge to a point  i ( k 1) satisfying g  i ( k 1)  0 . The algorithm of EM is summarized in the following, Algorithm 1. EM. Based on the initial values of.    ,  0. i. i. (0). . , i (0) ,  i (0) , i  1, 2,..., g , the EM algorithm. iterates between the following E-step and M-step. Set the initial counter k=0. Step 1: E-step Calculate the posterior probabilities zij ( k ) using (4.1); Step 2: M-step Update the parameters  i ( k 1) with zij ( k ) by (4.2); Step 3: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k ) , and  i ( k ) by (4.3); 19.

(28) Step 4: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k 1) , i ( k ) , and  i ( k ) by (4.4); Step 5: M-step Update the parameters  i ( k 1) with zij ( k ) , i ( k 1) , i ( k 1) , and  i ( k ) by (4.5); Step 6: Convergence ( k 1)   i ( k )  i ( k 1)  i ( k )  i ( k 1)  i ( k )   i ( k 1)   i ( k ) IF  i 2 2 2. 2.  ,. THEN stop; ELSE let k  k  1 and RETURN to step 1.. where i  1, 2,..., g , j  1, 2,..., n ,. 2. is the l2 -norm, and  is the radius of. convergence which is a nonnegative real number.. 4.2 Penalized maximum likelihood estimation From section 3.1, in order to detect the outliers, we put the shrinkage parameter  j for the j-th observation and create a complete penalized log-likelihood to get a. better estimation from the dataset with outliers. Here, we also use EM-algorithm to obtain the shrinkage parameter    1 ,  2 ,...,  n  and tuning parameter  . Aforementioned in the EM structure that on the (k+1)-th iteration of the E-step, we calculate the Qp -function, which is the conditional expectation of the penalty complete-data log-likelihood (3.4) given the current estimate of the parameter and the observed data. The penalized Qp -function is as below. . . . . .  g Qp    ,  =  I  D j  i  log  i  fi t j   j ( k ) x j , i  k  , i  k  ,  i  k  j 1  i 1 g 2  n  I  D j  0  zij  k  log  i  Si t j   j ( k ) x j , i  k  , i  k  ,  i  k      j  k   1 , i 1  j 1 k . k . k . n. . . . where   k  is the estimate of  after the k-th iteration and. 20. . .

(29) zij. (k ).  E  zij y,ˆ  k  ,   k     .  i  Si (t j   j ( k ) x j , i ( k ) , i ( k ) ,  i ( k ) ,  j ( k ) ) g.  l 1.  Sl (t j   j. (k ) l. (k ). , (4.6). x j , l , l ,  l ,  j ) (k ). (k ). (k ). (k ). where zij ( k ) is the posterior probability that the j-th individual with censored survival time t j would have failed due to risk i  i  1, 2,..., g  . The M-step provides the updated estimate   k 1 and   k  that maximizes the. . k  k  k  Qp -function Q p    , . . . with respect to  ,  . Let  be decomposed. . . . into    1 , 2 ,..., g , where  i  i , i ,  i T ; i  1, 2,..., g . In order to compute î ( k 1)  i  1, 2,..., g  , we introduce a new variable  called a Lagrange multiplier and study the Lagrange function Qp L defined by n g Qp L    I ( D j  i ) ln  i + ln i + ln i +(i  1) ln (t j   j )+x j T  i  i (t j   j ) i exp( x j T  i )  j 1  i 1 n n 2 2 1 +I ( D j  0) zij ( k ) ln  i  i (t j   j ) i exp( x j T  i )        j  1     j  1 . . . j 1. j 1. On differentiation of Qp L with respect to  i  i  1, 2,..., g  as below, n. Qp  i. L. .  I (D j 1. j. n.  i )  I ( D j  0) zij ( k )  = 0. i. .  I (D j 1. j.  i )  I ( D j  0) zij ( k ). .  i .. Note that we know the sum of  i  i  1, 2,..., g  equal to 1, hence, g. g.  i 1. i. . . n.    I (D i 1. . j 1. j.   i )  I ( D j  0) zij ( k )   =1.   n.. . It follows that  i ( k 1) satisfies the equation n.  i ( k 1) .  I (D j 1. j.  i )  I ( D j  0) zij ( k ) n. .. (4.7). Next, we need to maximize Qp with respect to  i  i  1, 2,..., g  , and.  j  j  1, 2,..., n  . First, on differentiation of Qp with respect to the scale parameter. i  i  1, 2,..., g  , it follows that i ( k 1) satisfies the equation 21.

(30) k  k  I ( D j  i)  i  I ( D j  i) exp( x j T i  k  )(t j   j  k  ) i +I ( D j  0) zij exp( x j T i  k  )t j i     0, . n. i. j 1. and we can write i ( k 1) as the following closed-form expression. I ( D j  i). n. i ( k 1)   j 1. k .  k  i k . exp( x j T i )t j (t j   j ).  I ( D j  i)+I ( D j  0) zij ( k ) . .. (4.8). Second, on differentiation of Qp with respect to the shape parameter i.  i  1, 2,..., g  ,. since it couldn’t be expressed as a closed-form solution, we use. Newton’s method the same as previous process. Let g ( i ) defined as the gradient function which represent first differentiation of Qp with respect to i , and it can be written as g ( i )=. Qp i. n   1      I ( D j  i )    ln t j  ln  j    i (t j   j ) i exp( x j T i ) ln(t j   j )  I ( D j  i)+I ( D j  0) zij  . j 1    i  . Let H ( i ) be the Hessian matrix of Qp which represent the second differentiation of Qp with respect to i , where  2Q p. n 2  I ( D j  i)      i (t j   j ) i exp( x j T  i ) ln(t j   j )   I ( D j  i)+I ( D j  0) zij   , 2 i i j 1  i  2  Qp  0 , for k  i . i k. Likewise, we can obtain the iterative scheme. i ( k 1,n+1)  i ( k ,n )   H  i ( k ,n )   g  i ( k ,n )  , 1. (4.9). where i ( k 1,n+1) is the (n+1)-th iteration of Newton’s method for i ( k 1) , and the. . . sequence i ( k ,1) , i ( k ,2) ,.... will converge to a point i ( k 1) satisfying g i ( k 1)  0 . Third, on differentiation of Qp with respect to the vector of regression coefficients  i ( k 1)  i  1, 2,..., g  , since it couldn’t be expressed as a closed-form solution as well, we use Newton’s method again. Let g ( i ) defined as the gradient 22.

(31) function which represent first differentiation of Qp with respect to  i , and it can be written as g ( i )=. Qp. n. . .   I ( D j  i)  i exp( x j T  i )(t j   j ) i  I ( D j  i)+I ( D j  0) zij  x j .  i j 1. Let H ( i ) be the Hessian matrix of Qp which represent the second differentiation of Qp with respect to  i , where n  2Q p   i exp( x j T  i )(t j   j )   I ( D j  i )  I ( D j  0) zij  x j x j .  i  i j 1. .  2Q p  i  k. 0 ,. . i. for k  i .. Likewise, we can obtain the iterative scheme.  i ( k 1,n+1)   i ( k ,n )   H   i ( k ,n )   g   i ( k , n )  , 1. (4.10). where  i ( k 1,n1) is the (n+1)-th iteration of Newton’s method for  i ( k 1) , and the. . . sequence  i ( k ,1) ,  i ( k ,2) ,.... will converge to a point  i ( k 1) satisfying g  i ( k 1)  0 . Next, on differentiation of Qp with respect to the shrinkage parameter  j ( k 1).  j  1, 2,..., n  , since it couldn’t be expressed as a closed-form solution, too. We use Newton’s method here as well. Let g ( j ) defined as the gradient function which represent first differentiation of Qp with respect to  j , and it can be written as g (  j )=. g   I ( D j  i )(i  1)     i i t j i exp( x j T  i )   j i 1  I ( D j  i)  I ( D j  i ) zij    2   j  1 .  j i 1  j . Qp. Let H ( j ) be the Hessian matrix of Qp which represent the second differentiation of Qp with respect to  j , where g     I ( D j  i )(i  1)     i ( i  1)i t j i exp( x j T  i )   j i  2  I ( D j  i )  I ( D j  i ) zij    2 , 2  j  j j i 1   .  2Q p.  2Q p  j  k. 0 ,. for k  j .. Likewise, we can obtain the iterative scheme 23.

(32)  j ( k 1,n 1)   j ( k ,n )   H   j ( k ,n )   g   j ( k , n )  , 1. (4.11). where  j ( k 1,n 1) is the (n+1)-th iteration of Newton’s method for  j ( k 1) , and the. . . sequence  j ( k ,1) ,  j ( k ,2) ,.... will converge to a point  j ( k 1) satisfying g  j ( k 1)  0 . The algorithm of EM is summarized in the following, Algorithm 2. EM with penalized log-likelihood. Based on the initial values of.    ,  0. i. i. (0). , i (0) ,  i (0) ,  j (0) , (0) , i  1, 2,..., g;. j  1, 2,..., n , the EM algorithm iterates between the following E-step and M-step. Set the initial counter k=0. Step 1: E-step Calculate the posterior probabilities zij ( k ) using (4.6); Step 2: M-step Update the parameters  i ( k 1) with zij ( k ) by (4.7); Step 3: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k ) , and  i ( k ) by (4.3); Step 4: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k 1) , i ( k ) , and  i ( k ) by (4.4); Step 5: M-step Update the parameters  i ( k 1) with zij ( k ) , i ( k 1) , i ( k +1) , and  i ( k ) by (4.10); Step 6: M-step Update the parameters  j ( k 1) with zij ( k ) , i ( k 1) , i ( k 1) ,  i ( k 1) , and  ( k ) by (4.11); Step 7: M-step Calculate the parameters  ( k +1) with  j ( k 1) by (3.3); Step 8: Convergence ( k 1)   i ( k )  i ( k 1)  i ( k )  i ( k 1)  i ( k )   i ( k 1)   i ( k ) IF  i 2 2 2.   j ( k 1)   j ( k )   ( k +1)  ( k )   , 2. 2. THEN stop; ELSE let k  k  1 and RETURN to step 1.. 24. 2.

(33) where i  1, 2,..., g , j  1, 2,..., n ,. 2. is the l2 -norm, and  is the radius of. convergence which is a nonnegative real number.. 4.3 Data-weighted method Aforementioned in section 3.2, outliers are obviously different from other observations, however, they may be abnormal phenomena or they may contain valuable information. If data analysis finds outliers, deleting them directly may not be necessarily the best method, hence, we attempt to apply the weighted optimization method. We give data some degree of importance or relevance by weight. Large weights are assigned to high relevant individuals and smaller weights are appointed to less relevant data. Giving each individuals the weight w j  j  1, 2,..., n  , furthermore, only the outlier detected by penalty method have weight equal to the shrinkage parameter  j  j  1, 2,..., n  , and the normal individual has weight equal to 1. The reason we. chose  j  j  1, 2,..., n  as the weight for outliers is because that  j is close to zero if the j-th individual is classified as an outlier. The weighted complete-data log-likelihood is as below, n g lwc (  , w )   w j  I ( D j  i ) ln  i fi (t j ; x j , i , i ,  i ) j 1  i 1  I ( D j  0) zij ln  i Si (t j ; x j , i , i ,  i )  .. (4.12). where w   w1 , w2 ,..., wn  ; and   1 , if t j is normal data wj   , j  1, 2,.., n .    j , if t j is outlier. (4.13). From the EM structure that on the (k+1)-th iteration of the E-step, we calculate the. Qw -function, which is the conditional expectation of the weighted complete-data log-likelihood (4.12) given on the current estimate of the parameter and the observed 25.

(34) data, the penalized Qw -function is as below. . . . . n  g Qw    k    k  , w = w j  I  D j  i  log  i  fi t j x j , i  k  , i  k  ,  i  k  j 1  i 1. . . . . g    I  D j  0  zij  k  log  i  Si t j x j , i  k  , i  k  ,  i  k   , i 1 . where   k  is the estimate of  =  1 , 1 ,  1T , 2 , 2 ,  2T ,..., g ,  g ,  g T . T. after the. k-th iteration , w   w1 , w2 ,..., wn  is the given weight by (4.13) or (4.14) and. zij. (k ).  E  zij y,ˆ  k     .  i  Si (t j x j , i ( k ) , i ( k ) ,  i ( k ) ) g.  l 1. (k ) l. ,. (4.15).  Sl (t j x j , l , l ,  l ) (k ). (k ). (k ). where zij ( k ) is the posterior probability that the j-th individual with censored survival time t j would have failed due to cause i  i  1, 2,..., g  . The M-step provides the updated estimate   k 1 that maximizes the Q-function with respect to  . Let . . . be decomposed into    1 , 2 ,..., g , where.  i   i , i ,  i T  ; i  1, 2,..., g . It can be seen that the Q-function in (4.1.1) can be separated into. . . n g  Qw    k    k  =  w j    I  D j  i  log  i +I  D j  0  zij  k  log  i   j 1  i 1 . . . . . + w j   I  D j  1 log f1 t j x j , 1 +I  D j  0  z1j  k  log S1 t j x j , 1    j 1 n. +. . . . . + w j   I  D j  g  log f g t j x j , g +I  D j  0  z gj  k  log S g t j x j , g    j 1 n. Def. = Qw0  Qw1  ...  Qw g. with respect to the unknown parameters  i and  i  i  1, 2,..., g  , respectively. It implies that the estimates of  i and  i  i  1, 2,..., g  can be updated separately by maximizing Qw 0 and Qw1 ,..., Qwg , respectively. In order to compute î ( k 1)  i  1, 2,..., g  , we introduce a new variable  called a Lagrange multiplier and study the Lagrange function QwL defined by. 26.

(35) n g   g  Qw0 L   w j    I  D j  i  log  i +I  D j  0  zij  k  log  i      i  1 . j 1  i 1   i 1 . On differentiation of Qw0 L with respect to  i  i  1, 2,..., g  as below, n. Qw0   i L. w j 1. j.   I ( D j  i )  I ( D j  0) zij ( k ) . i. n.  = 0 . w j 1. j.   I ( D j  i )  I ( D j  0) zij ( k ) . .  i .. note that we know the sum of  i  i  1, 2,..., g  equal to 1, hence,  n   I ( D j  i )  I ( D j  0) zij ( k )   w     j g i 1 j 1  =1 i   g.   n.. . i 1. It follows that  i ( k 1) satisfies the equation n.  i ( k 1) . w j 1. j.   I ( D j  i)  I ( D j  0) zij ( k )  n. .. (4.16). Next, we need to maximize Qwi with respect to  i  i  1, 2,..., g  . First, on differentiation of Qwi with respect to the scale parameter i  i  1, 2,..., g  , it follows that i ( k 1) satisfies the equation k  k  I ( D j  i )  i  I ( D j  i ) exp( x j T i  k  )t j i +I ( D j  0) zij exp( x j T i  k  )t j i     0,  wj . n. i. j 1. and we can write i ( k 1) as the following closed-form expression. I ( D j  i). n. i ( k 1)   w j  j 1. exp( x j T i  k  )t j.  k  i.  I ( D j  i)+I ( D j  0) zij ( k ) . .. (4.17). Second, on differentiation of Qwi with respect to the shape parameter i.  i  1, 2,..., g  ,. since it couldn’t be expressed as a closed-form solution, we use. Newton’s method. Let g ( i ) defined as the gradient function which represent first differentiation of Qwi with respect to i , and it can be written as. 27.

(36) g ( i )=. n  1  Qwi    w j   I ( D j  i )   ln t j   i (t j ) i exp( x j T i ) ln(t j )  I ( D j  i)+I ( D j  0) zij  . i j 1  i   . Let H ( i ) be the Hessian matrix of Qi which represent the second differentiation of Qi with respect to i , where n 2  I ( D j  i)   2Qi   w j    i (t j ) i exp( x j T  i ) ln(t j )   I ( D j  i)+I ( D j  0) zij  , 2 i i j 1 i  .  2Qi 0 , i  k. for k  i .. Likewise, we can obtain the iterative scheme by (4.1.7) and (4.1.8) as. i ( k 1,n+1)  i ( k ,n )   H  i ( k ,n )   g  i ( k ,n )  , 1. (4.18). where i ( k 1,n+1) is the (n+1)-th iteration of Newton’s method for i ( k 1) , and the. . . sequence i ( k ,1) , i ( k ,2) ,.... will converge to a point i ( k 1) satisfying g i ( k 1)  0 . Third, on differentiation of Qi with respect to the vector of regression coefficients  i ( k 1)  i  1, 2,..., g  , since it couldn’t be expressed as a closed-form solution as well, we use Newton’s method again. Let g ( i ) defined as the gradient function which represent first differentiation of Qi with respect to  i , and it can be written as g ( i )=. n Qi   w j  I ( D j  i)  i exp( x j T  i )t j i  I ( D j  i)+I ( D j  0) zij  x j .  i j 1. . . Let H ( i ) be the Hessian matrix of Qi which represent the second differentiation of Qi with respect to  i , where n  2Qi   w j  i exp( x j T  i )t j i  I ( D j  i )  I ( D j  0) zij  x j x j ,  i  i j 1. .  2Qi 0 ,  i  k. . for k  i .. Likewise, we can obtain the iterative scheme.  i ( k 1,n+1)   i ( k ,n )   H   i ( k ,n )   g   i ( k , n )  , 1. (4.19). where  i ( k 1,n1) is the (n+1)-th iteration of Newton’s method for  i ( k 1) , and the 28.

(37) . . sequence  i ( k ,1) ,  i ( k ,2) ,.... will converge to a point  i ( k 1) satisfying g  i ( k 1)  0 . The algorithm of EM is summarized in the following, Algorithm 3. EM for weighted log-likelihood. Based on the initial values of.    ,  0. i. (0). i. . , i (0) ,  i (0) , i  1, 2,..., g , the EM algorithm. iterates between the following E-step and M-step. Set the initial counter k=0. Step 1: E-step Calculate the posterior probabilities zij ( k ) using (4.15); Step 2: M-step Update the parameters  i ( k 1) with zij ( k ) by (4.16); Step 3: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k ) , and  i ( k ) by (4.17); Step 4: M-step Update the parameters i ( k 1) with zij ( k ) , i ( k 1) , i ( k ) , and  i ( k ) by (4.18); Step 5: M-step Update the parameters  i ( k 1) with zij ( k ) , i ( k 1) , i ( k +1) , and  i ( k ) by (4.19); Step 6: Convergence ( k 1)   i ( k )  i ( k 1)  i ( k )  i ( k 1)  i ( k )   i ( k 1)   i ( k ) IF  i 2 2 2. 2.  ,. THEN stop; ELSE let k  k  1 and RETURN to step 1.. where i  1, 2,..., g , j  1, 2,..., n ,. 2. is the l2 -norm, and  is the radius of. convergence which is a nonnegative real number.. 4.4 Outlier-deleted method If the outliers have extreme effect on the estimation of parameters, we may delete outliers in order to get better estimations. From (3.8), we use the MsSE j of the j-th failed individual for the criterion, and take the choice for the critical value at. 29.

(38)  20.01 (1)  6.635 . If the MsSE j of the j-th failed individual is larger than  20.01 (1) , we treat it as an outlier; otherwise, it is a normal data. From the criterion, we can get the set of outliers O (3.10).. The next step is to remove the outliers O from. dataset, we define the information of the outliers as yout ,. . . . yout  t j , x j T , D j : t j  O, j  1, 2,..., n . After eliminating the outliers from the dataset y (2.7), the remaining data y  can be expressed as y  y  yout . On the basis of the normal data y  , the normal-data log-likelihood function for  under the mixture model is given by n g ln c      I ( D j  i ) ln  i + ln i + ln i +(i  1) ln t j +x j T i  i t j i exp( x j T i )  j 1  i 1. . + I ( D j  0) zij ln  i  i t j i exp( x j T i )  .. (4.20). We use EM Algorithm 1 to get the new estimations of this model. Theoretically, the estimations from y  will have less residual than the original dataset. Hence, we would like to compare the two methods above, and discuss which method is the best. For comparison, we used the following three indices. One is the average absolutely relative bias ( ARB ),. 1 ARB  P.  . where E ˆk. P.  k 1.  , E ˆ . bias ˆk k. . is the k-th mean of estimated parameters, bias ˆ. is the bias of the. estimated parameters, and P is the number of parameters estimated by the model. Another is the average mean squared error ( MSE ),.  MSE ˆ  P. MSE .  . where MSE ˆk. k. k 1. P. ,. is the mean squared error according to the k-th estimated. parameters of the models, and P is the number of parameters estimated by the model.. 30.

(39) The other is the average MsSSE ( MsSSE ) from (3.6), 1 MsSSE  N. .  n g t j  Eî (t j )  ˆ z   ij   Vâri (t j ) l 1 j 1 i 1  N. . 2.  ,  . where N is the number of replicates of the simulation. The best model is selected based on the smaller index. The result will be shown in the next section. The last one is the average Bayesian information criterion ( BIC ) of each model,. . BIC  2  l c ˆ  P  log  n . . where l c ˆ. . is the log-likelihood of each method, i.e., the l c ˆ. . log-likelihood is lwc (ψ,w) from (4.12); the l c ˆ. of weighted. of deleted method is l n  . from (4.20). And the average Bayesian information criterion is defined as, BIC . 1 N. N.  BIC l 1. l. where N is the number of replicates of the simulation. To sum up, we introduced the penalized log-likelihood to determinate if there are outliers by comparing the MsSSE from (3.6) with the regular log-likelihood. Secondly, we take MsSE j  j  1, 2,..., n  from (3.8) as the index of each individual and compare with  2 0.01 1 to find out the outliers. In addition, we use data-weighted method and outlier-deleted method to adjust the parameter estimation. Last but not least, we compare the two adjust methods to conclude which one is better. The process of the outlier detection analysis is shown in the following chart.. 31.

(40) Data. Log-likelihood function l c   & Penalty log-likelihood function pl c (  ). EM algorithm. ˆ ˆ ) l c ˆ  of Compare MsSSE and BIC of process & analysis those of pl c ( Chart 1. The to determine if there are outliers or not.. If there are outliers in the dataset. Use MsSE j to detect individual is outlier. Data weighted method. Outliers deleted method. weighted log-likelihood lwc (ψ,w). normal-data log-likelihood ln c  . EM algorithm. EM algorithm. New estimation. New estimation. Use ARB, MSE, MsSSE, and BIC to compare which method is the best.. Chart 1. The process of the outlier detection analysis. 32.

(41) V.. Simulation. In this section we conduct a simulation study to examine how these methods work for different components of Weibull mixture hazard models with outliers. We present the results of four simulation experiments for the proposed penalty method with shrinkage parameters and compare the data-weighted method with the outlier-deleted method. Setups We first set the sample size n to be 500, included 12 outliers. The covariate vector x =  x1 , x2 ,..., x500 . T. is continuous variable that generated independently from a. uniform distribution U  4, 4  , and we only consider one covariate for each individual here. Each parameter setup will be elaborated in the following scenarios. The failure time are generated from the Weibull distribution by the inverse transformation method as the following step. First, we figure out the cumulative probability density function (p.d.f.).. . . Fi  t x, i , i ,  i   1  Si t x, i , i ,  i   1  exp i t i exp xT  i  . Second, assume the cumulative p.d.f. (5.1) equal to a value that generated by a uniform distribution U  0,1 .. . . 1  exp it i exp xT  i   u, where u ~ U (0,1) . Therefore, we can derive the function of failure time as below..   ln(1  u )   tj    i exp  x j i  . 1 i. for i  1,..g and j  1, 2,...,500. Next, the censored times are generated from a uniform distribution U  c1 , c2  , where. c1 and c2 are constants. If the value of the j-th censored time is smaller than the j-th failure time, we take this j-th individual as a censored data and  j is equal to 0. On. 33.

(42) the other hand, a few outliers are then created by adding a random error according to some of the failure times. The way we obtain the outliers can be written as,. . . Ei t j x j , i , i ,  i  , i  1,.., g ; j  1, 2,.., nout ,. (5.1). where  is a big error term, and nout is the number of outliers form risk i. We consider different proportions of outliers in each scenario. The prescribed value k0 of the penalty function is chosen to be 3.8. It comes from the results of many simulations. We tried putting different values of k0 into the model and got the best result when. k0 is in the interval 3, 4.6 . Accordingly, we took the average value of this interval as k0 , i.e. k0 =3.8 . We will consider 5 scenarios in the simulation according to the Table V-1. Table Ⅴ-1. Sample size, number of risk types, mixing probability, and count of outlier in each scenario. Scenario Sample Number of size risk types. Mixing probability. Count of Outliers. 1. 2. 3. risk 1. risk 2. risk 3. 0.5 0.5 0.4. ---. 6. 6. ---. ---. 10. 2. ---. 0.3. 4. 4. 4. 9. 2. 1. 10. 1. 1. 4. 3. ---. 1. 500. 2. 2. 500. 2. 3. 500. 3. 0.5 0.5 0.3. 4. 500. 3. 0.5. 0.3. 0.2. 5. 200. 2. 0.5. 0.5. ---. Scenario 1. We generate two distinct types of risks (g=2) in Weibull mixture model with. 1  0.5, 1  0.5, 2  0.1, 1  2, 2  2,  1  1, and  2  0.5. The censored times are generated from U  c1 , c2  where.  c1 , c2  =  4,10. for both types of risks in this. scenario. There are both 6 outliers in risk 1 and risk 2. Next, we use the regular log-likelihood function (2.9) and the penalty log-likelihood function (3.4) to fit the parameters by EM algorithm. The estimations, 34.

(43) the MsSSE, and the BIC are shown in Table Ⅴ-2. We see that the estimated parameters of regular method are far away from true values; on the contrary, those of penalty method are close to true values. It can also be seen that the MsSSE, and the BIC of the penalty log-likelihood function are significantly less than those of regular log-likelihood function. Thus, we can reasonably suspect that there are some outliers in the dataset. Table Ⅴ-2. Parameter estimation, MsSSE, and BIC in regular log-likelihood function and in penalty log-likelihood function. Regular TRUE Estimator Penalty TRUE Estimator. 1 0.5000 0.4820. 1 0.5000 0.4903. 1 0.5000 0.5579 1 0.5000 0.7104. 2. 1. 2. 0.1000 2.0000 2.0000 0.1921 0.8203 1.1144. 2. 1. 2. 0.1000 2.0000 2.0000 0.1225 2.1364 2.1731. 1 1.0000 0.2862 1 1.0000 1.1110. 2. MsSSE. BIC. -0.5000 557.70 2346.6 -0.2254  2 MsSSE BIC -0.5000 121.57 1658.4 -0.6229. In addition, we describe the scatter plot of the dataset with the lines of the expectation functions (3.7) according to each type of risks in Figure Ⅴ-1. The horizontal axis represents survival time, and the vertical axis shows the covariate of each individual. We use asterisk and pentagram for representing the individuals from risk 1 and risk 2, respectively. The red and blue lines describe the expectation of survival time given the estimations from risk 1 and risk 2, respectively. The data of the scatter plot is come from one simulation. According to Fig.Ⅴ-1(a), we observe that the expectation lines have deviated from the center of data, they are higher because of the effect from outliers. In Fig.Ⅴ-1(b), the expectation lines are close to the center of data since the penalty method has suppressed the estimation.. 35.

(44) (a) Regular method. (b) Penalty method. Figure Ⅴ-1. The data scatter plot with the expectation lines given the estimated parameter according to regular method and penalty method. From the MsSSE and the BIC of these two models, we doubt that there are some outliers in the dataset. Hence we take MsSE j  j  1, 2,..., n  from (3.8) as the index of each individual and compare with  2 0.01 1 to find out the outliers. After this, we obtain the data-weighted method and outlier-deleted method to get the new estimations, following the process of outlier detection shown in Chart 1. The number of replicates is 1000 for this simulation setting, the average censored rate of this simulation is 17.8%, and the average value of the outliers detected by this method is 14.062. It is close to the real outlier number 12, that is, this method can almost detect the correct number of outliers. The estimations results of these two methods are displayed in Table Ⅴ-3. It can be seen that the means of the parameters from these two methods are improved than the regular method. Furthermore, the estimations of outlier-deleted method are closer to the true values than those of data-weighted method. As the ARB, MSE, MsSSE, and BIC of outlier-deleted method showed that they are all smaller than those of data-weighted method. It may come to a conclusion that using the method of deleting outliers can obtain better parameter estimates.. 36.