二維存活資料之模式檢驗

(1)

國立交通大學

統計學研究所

碩士論文

二維存活資料之模式檢驗

Model Diagnostics

for

Archimedean Copula Models

研究生: 林建威

指導教授: 王維菁博士

(2)

二維存活資料之模式檢驗

Model Diagnostics

for

Archimedean Copula Models

研究生: 林建威

_{Student: Chien-Wei Lin}

指導教授: 王維菁博士

_{Advisor: Dr. Wei-Jing Wang}

國立交通大學

統計所研究所

碩士論文

A Thesis

Submitted to Institute of Statistics

College of Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Master

in

Statistics

June 2007

Hsinchu, Taiwan, Republic of China

中華民國九十六年六月

(3)

二維存活資料之模式檢驗

研究生: 林建威

指導教授: 王維菁博士

國立交通大學統計學研究所

摘要

在本論文中，我們針對右設限的資料提出了 Archimedean Copula(AC)模型的

模式檢驗法。我們拓展了 Shih(Biometrika,1998) 的想法，Shih 只針對 Clayton

模型作推論，而我們將之延伸到更大的集合，AC 家族。我們也針對 AC 家

族提出了新的資料生成演算法。我們提供模擬分析以佐證在有限樣本下，

我們所提出的方法之效能。

關鍵字: Archimedean Copula, Clayton 模型, Concordance 估計量, Frailty 模型,

Gumbel 模型

(4)

Model Diagnostics for Archimedean Copula Models

Student: Chien-Wei Lin

Advisor: Dr. Wei-Jing Wang

Institute of Statistics

National Chiao Tung University

ABSTRACT

In the thesis, we propose a model diagnostic approach to selecting an

Archimedean Copula (AC) model based on right censored data. The proposed

method extends the idea of Shih (Biometrika, 1998), who considered the

Clayton model, to a larger class of models, namely the AC family. We also

propose a new algorithm for generating a model from the AC family. Simulation

results are provided to examine finite-sample performances of the proposed

method.

Keywords: Archimedean Copula, Clayton model, Concordance estimator, Frailty

(5)

誌

謝

首先，我要先謝謝王維菁老師在這一年半對我的指導，讓我對於研究該有的態度與精神有所體認，以及教導我在論文的內容陳述上必需的技巧，才使我能夠順利的完成我的碩士論文。雖然在剛開始很不能適應技巧的改變，不斷地修改簡報內容時，實在是受到不小的打擊，但在老師耐心地循循善誘之下，終於有所成果。老師，謝謝您。我還要謝謝江村剛志學長，回想起在剛開始與學長一起做研究時，我非常地笨拙且不得要領，但學長不厭其煩一五一十地教導我基礎觀念，分享他的經驗技巧，幫助我在之後的研究當中，理論推導得以如魚得水。且由於過程以日文對談，我的日文也得以突飛猛進，對學長的感恩實在不是一言兩語說得完。學長在不久的將來就要回日本繼續精進，在這邊我也祝福學長能夠有更進一步的突破，且期許將來有再會的一日。還有所上的教授，謝謝您們，在這兩年的學習歷程中，教授們傳授了非常多難得的觀念以及思維，實在是讓我受益良多，回想起在碩一上陳鄰安老師所開的數統，老師經常掛在嘴邊的一句話，統計，是一門哲學，這句話實在是呼應了這兩年的一切。統計，是門科學，更是門哲學。再來，我要謝謝一起共度這兩年的好夥伴們：謝謝永在、阿淳、俊睿、柯董、阿 Q、益銘、侑侑、小米、小Ｂ陪我一起打籃球，讓我再度踏上球場。謝謝小米、俊睿、益銘、益通經常與我分享許多的經驗與想法。謝謝與我一起打 CS 的好戰友們，陪我度過許多快樂的時光。謝謝平時與我互相討論作業、切磋的好同學們，讓我的思維能夠更加清楚。謝謝燒肉團的成員們，陪我一起大啖美食與聊天。謝謝天天陪我一起共進午晚餐的好同學們，總是配合我的任性到多多吃飯。感謝眾美女們，不嫌棄與我這個大宅男聊天，我會加油的！感謝所有的同學容忍我占了許多電腦室的電腦跑模擬，沒有你們的包容，就沒有我的論文。再一次地謝謝所有的同學，由於個人文筆的拙劣，心中的感觸以及感恩的心情實在難以一一陳述，還請大家見諒了。有你們陪伴的這兩年，我非常地開心，在這裡祝福大家將來都能夠順心如意，不論在事業或學業上，都能夠有更進一步的突破。珍重再見，期望將來有再會的一日。最後，我要謝謝最親愛的家人，若不是家人的苦心栽培，以及在我煩惱迷惘的時候，

(6)

有家人的即時點醒、支持，肯定沒有今日的我。謝謝您們。

在此，將本篇論文獻給我所有的好朋友、師長以及家人，謝謝你們。

林建威謹誌於國立交通大學統計研究所中華民國九十六年六月

(7)

Chapter 1: Introduction

……….……….1

1.1 Motivation

...1

1.2 Outline of the thesis

……….……….…………..2

Chapter 2: Review of Archimedean Copula models

……….3

Chapter 3: Review Methods of Model Checking

………..……..6

3.1 Model checking for a parametric distribution

………...……...………6

3.2 Model selection for Copula models

………..…………..………6

Chapter 4:

The proposed method for model checking

…………...……..9

4.1: Analysis based on Complete Data

………..……….10

4.2: Analysis based on Right Censored Data

………...…….13

4.3: Model checking

……….…...………16

Chapter 5: Data Generation Algorithms

………....………19

5.1 Frailty Approach

……….………...…....19

5.1.1 Theoretical Background……….…………..19

5.1.2 Generation Algorithm………..20

5.2 Conditional Distribution Approach

……….………….…….21

5.2.1 Theoretical Background……….………..21

5.2.2 Generation Algorithm………..22

5.3: The Proposed Data Generation Method

………....……….23

5.4: Comparisons of the Three Approaches

……….…….…….25

Chapter 6: Numerical Analysis

……….………….……..28

(8)

Appendix

………...……..………...40

Reference

………46

(9)

Chapter 1: Introduction

1.1 Motivation

In the literature of survival analysis, there has been substantial research on investigating the association among several lifetime variables. Copula models are the most common modeling choice because they possess nice properties that are suitable for describing lifetime variables.

Specifically copula models form a class of bivariate distributions whose marginals are uniform on the unit interval (Genest and MacKay, 1986). Usually one can write

(

,

)

Pr

(

,

)

C u v = U <u V <v ,

where

(

U V,

)

are uniform

(

0,1 variables marginally but correlated with the joint

)

distributionfunction C

( )

.,. : 0,1

[ ]

2 →

[ ]

0,1 . Let

(

X Y,

)

be a pair of continuous failure times. In applications of lifetime data analysis, the copula structure is usually imposed on the survival function such that

(

)

{

}

Pr X >x Y, > y =C Pr(X >x), Pr(Y >y) (1.1) Models in the copula family allow for separate investigation on the dependence structure and the marginal distributions. The former is often the main interest and hence is handled parametrically (i.e. the form of C_α

(

u v,

)

is given). The latter is of less interest and hence dealt with nonparametrically. There has been a trend to derive general properties for a class of models rather than only a single member. The Archimedean copula (AC) family, which is a sub-class of the copula family, is attractive due to its nice analytical properties. For an AC model, the bivariate copula function C_α

(

u v,

)

can be further simplified as

(

)

1

{

( )

}

,

C_α u v

φ

_α−

φ

_α u

φ

_α v

= + for u v, ∈

[ ]

0,1 , (1.2) where

φ

_α

( )

⋅ : 0,1

[ ] [

→ 0,∞

]

is a univariate function which has two continuous derivatives

(10)

satisfying

φ

_α

( )

1 =0,

( )

t

( )

t 0 t α α

φ

′ =∂ < ∂ and

( )

2 2 0 t t t α α

φ

′′ =∂ >

∂ . AC models have the

nice feature that the bivariate relationship can be summarized by the univariate function

( )

α

φ

⋅ .

Many authors have considered semi-parametric inference of the copula parameter

α

, which measures the level of association, without specifying the marginal distributions. Note that

α

is related to Kendall’s

τ

, a rank correlation measure, such that

( )

(

)

(

)

1 1

0 0

4 C_α u v C, _α du dv, 1

τ α

=

∫ ∫

− , (1.3)

where

τ

is defined as the difference of concordance and discordance probabilities for two independent pairs of

(

X Y,

)

. These semi-parametric inference procedures require specification of C_α

( )

.,. or

φ

_α

( )

⋅ .

A practical and important question is how can we select an appropriate model to fit the data? There have been some works on model selection including the papers by Genest and Rivest (1993), Shih (1998) and Wang and Wells (2000), just to name a few. In the thesis, we extend the approach of Shih (1998), who considered only testing the Clayton model, to general Archimedean copula models.

1.2 Outline of the thesis

In Chapter 2, we review AC models and their properties. In Chapter 3, we review literature on model diagnostics, including general methodology and results developed for selecting a particular copula model. The proposed method is presented in Chapter 4. In Chapter 5, we review existing data generation algorithms and propose a new approach. Simulation analysis is presented in Chapter 6 and concluding remarks are given in Chapter 7.

(11)

Chapter 2: Review of Archimedean Copula models

Under Archimedean Copula family, the relationship between association parameter

α

and Kendall’s

τ

can be expressed as follows:

( )

(

)

(

)

1 1 0 0 4 C_α u v C, _α du dv, 1

τ α

=

∫ ∫

−

(

)

=4E C_ _α u v, _−1

( )

1 0 4 v dv 1. v α α

φ

= + ′

∫

(2.1)

If the form of the function

φ

( )

⋅ is specified, then we can estimate the association parameter

α

semi-parametrically by the above equation.

In multivariate survival analysis, we usually use Kendall’s

τ

to measure the dependence between random variables. Moreover, we have another dependence measurement called the local odds ratio which is related to the conditional version of Kendall’s

τ

in Oakes (1989). Local odds ratio has been used to measure the pointwise dependence. From Oakes (1989), for an Archimedean Copula model we know that the local odds ratio depends on t=

(

x y,

)

only through some function of S x y

(

,

)

, that is,

θ

*

(

x y,

)

=

θ

{

S x y

(

,

)

}

, where

(

)

*

,

x y

θ

is the local odds ratio function defined as

(

)

(

)

(

)

(

)

(

)

* Pr , Pr , , Pr , Pr , X x Y y X x Y y x y X x Y y X x Y y

θ

= = = ⋅ ≥ ≥ = ≥ ⋅ ≥ = (2.2) For an AC model, we have

θ

( )

v = −v

φ

′′

( )

v

φ

′

( )

v . The paper by Frees and Valdez (1998) provides a nice review of copula models.

(12)

Example 1: Clayton model

The generating function can be written as φ_α

( )

v

(

v−α 1

)

α

= − ,

α

∈

(

0,∞

)

. The joint survival function can be written as

(

)

{

( )

}

1 Pr X x Y, y S_x x S_y y 1 α α α − − −   > > = _ _ +_ _ − . It follows that 2 α τ α = + .

A special property of the Clayton model reflects in its local odds ratio which can be expressed as

θ

*

(

x y,

)

=

α

+1. Notice that

θ

{

S x y

(

,

)

}

does not depend on

(

x y,

)

.

Example 2: Gumbel model

The generating function can be written as

φ

_α

( )

v = −

{

log

( )

v

}

α+1,

α

∈

[

0,∞

)

. The joint survival function can be written as

(

)

(

( )

)

(

( )

)

1 1 1 1 Pr X x Y, y exp lnSx x lnSy y α α+ + α+    _ _  > > = − −__ + − __      . It follows that 1 α τ α =

+ . And the local odds ratio, can be expressed as

(

)

(

)

* , 1 log , x y S x y

α

θ

= − . Compared with the Clayton model, however

θ

*

(

x y,

)

depends

on the joint survival function at

(

x y,

)

.

In Figure 2.1, the curves of

θ

( )

v for three AC models with the same Kendall’s 0.75

τ = are plotted. Note that the relationships between

α

and

τ

between two models are different. Under the same value of Kendall’s

τ

, the corresponding values of

α

are different. Note that in the Clayton model, we have 2 6

1

τ α

τ

= =

− . In the Gumbel model, 3 1 τ α τ = =

− . For the Frank model, by setting τ =0.75, we know that α =7.741e-07 which can be obtained by numerically solving

( )

{

1

}

( )

1 4 D log 1 log

(13)

where

( )

{

₍

₎

}

1 ₀ 1

t

D

α

=

∫

α t

α

e − dt.

It is worthy to mention that the local odds ratio,

θ

( )

v , plays an important role for our proposed method.

(14)

Chapter 3: Review Methods of Model Checking

In §3.1, we briefly describe methods of model checking for a parametric distribution. In

§3.2, we review useful results for selecting an appropriate Archimedean Copula model.

3.1 Model checking for a parametric distribution

Suppose

(

X Y,

)

follows a parametric model with the distribution function

(

,

)

Pr

(

,

)

F_θ x y = X ≤x Y≤ y . Usually one can compare F_θˆ

(

x y,

)

with its empirical

estimator,

(

)

(

)

1 , , n i i i I X x Y y F x y n = ≤ ≤ =

∑

.

The comparison can be made based on the plots of two curves or some distance measures. For example: one may use the Q-Q plot to check whether the two quantiles are about the same. Alternatively one can set up for a formal hypothesis and test it using statistics such as the K-S test or Chi-squared test, both of which measure the “distance” between the two functions.

3.2 Model selection for Copula models

If the marginal distributions of X and Y were known, the approach mentioned above can be easily applied. Specifically let Ui =S1

( )

Xi and Vi =S Y2

( )

i , then we have

(

U Vi, i

)

(

i=1,...,n

)

, and can compare Cαˆ

(

u v,

)

and its empirical estimator

(

)

(

)

1 , , n i i i I U u V v C u v n = < < =

∑

.

However, the parametric forms of the marginal distributions are usually not specified, we have to give up this method.

Genest and Rivest (1993) derived useful properties of AC models which have been applied for model selection. Specifically they define the distribution Copula

( )

(

,

)

(15)

( )

v

( )

v

( )

v

λ

=

φ

′ for 0< ≤ . Therefore we can compare the difference between a v 1 nonparametric estimator of Pr

(

F X Y

(

,

)

≤v

)

=Pr

(

V ≤v

)

=K v

( )

and if model-based representation, v−

λ

_α

( )

v , based on a selected distance measure. If the difference is small, we can say that the imposed model is appropriate for the data.

We briefly illustrate how to perform the above ideas based on complete data

(

X Y1, 1

)

,...,

(

X Yn, n

)

. The purpose is to identify the form of φα. First of all, we need

observations V~_i =F(X_i,Y_i) in order to estimate K v

( )

nonparametrically. Since the form of F(.,.) is unknown, Genest and Rivest (1993) proposed the following “pseudo” observations:

(

)

{

}

(

)

# , : , 1 , 1, , i j j j i j i V = X Y X < X Y <Y n− i= … n,

which are proxies of V~_i =F(X_i,Y_i). The procedure of model selection is stated below. 1. Obtain the nonparametric estimate of K v

( )

:

( )

(

)

1 1 n n i i K v I V v n = =

∑

≤ .

2. Construct a semi-parametric estimate of K v

( )

. We may have several candidates of models indexed by

φ

_α( )j for j=1,...,J. For each candidate we need to estimate the value of

α

. Note that one may estimate Kendall’s

τ

and use the relationship between

τ

and

α

to estimate

α

. Based on ( )

_{( )}

( )

_{( )}

/ j j j v K v v v v α α α

φ

= − ∂ ∂ ,

we can estimate K_α( )j

( )

v for j=1,...,J.

3. Then we can compare the distance between K_n(v) and K_α(j)(v) for j =1,...,J. The most-fitted model is the one which gives the smallest distance between the two curves. The above procedure is not applicable when there is censoring. Wang and Wells (2000)

(16)

(17)

Chapter 4:

The proposed method for model checking

In this chapter, we present our proposal for model checking. The idea was motivated by the paper of Shih (1998) who proposed to test the Clayton model by comparing the difference between weighted and unweighted concordance estimators of the association parameter

α

. When the model assumption is correct, which is the condition of the null hypothesis, both estimators converge to the true parameter value. On the other hand, when the model assumption is false, the two estimators will converge to different values. Here we extend Shih’s idea to verify whether the model follows a particular Archimedean Copula model. We will use the Gumbel model as an example of AC models.

Define ∆ =ij I

(

Xi −Xj

)(

Yi−Yj

)

>0 , where

(

X Yi, i

)

and

(

X Yj, j

)

are two

independent replications of

(

X Y,

)

. This indicator variable denotes whether the pairs

(

X Yi, i

)

and

(

X Yj, j

)

are concordant or discordant. The conditional expectation of ∆ ij

contains the information about the level of association. Specifically it follows that

(

)

Pr ∆_ij=1X_ij =x Y, _ij = y

(

)

(

)

(

)

₍

₎

(

)

₍

₎

2 Pr , Pr , 2 Pr , Pr , Pr , Pr , i i j j i i j j i i j j X x Y y X x Y y X x Y y X x Y y X x Y y X x Y y ⋅ = = ⋅ ≥ ≥ =   ⋅_ = = ⋅ ≥ ≥ + = ≥ ⋅ ≥ = _

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

Pr , Pr , Pr , Pr , Pr , Pr , 1 Pr , Pr , X x Y y X x Y y X x Y y X x Y y X x Y y X x Y y X x Y y X x Y y = = ⋅ ≥ ≥ = ≥ ⋅ ≥ = = = = ⋅ ≥ ≥ + = ≥ ⋅ ≥ =

(

)

(

)

(

)

(

)

, , , 1 S x y S x y α α

θ

= + (4.1)

where X_ij =min

(

X X_i, _j

)

and Y_ij =min

(

Y Y_i, _j

)

and

θ

_α

(

S x y

(

,

)

is the local odds ratio defined in equation (2.2). Recall that for the Clayton model,

θ

_α

(

S x y

(

,

)

=

α

+1 and the Gumbel model,

(18)

(

)

(

)

(

)

, 1 log , S x y S x y α

α

θ

= − .

Now we will illustrate how to utilize equation (4.1) to construct different forms of estimating functions of

α

. In § 4.1, external censoring is ignored temporarily to simply the presentation. In § 4.2, the proposed methods are modified to handle censored data.

4.1: Analysis based on Complete Data

For complete data, we observe

{

(

X Yi, i

)(

i=1,...,n

)

}

which is a random sample of

(

X Y,

)

. Based on the moment condition of equation (4.1), one can construct the following estimating function of the association parameter

α

:

(

)

(

)

(

)

0 , , ij ij| ij, ij , i j U

α

S x y E X Y <   =

∑

_∆ − ∆ _ where

(

)

(

)

(

)

(

)

(

)

, | , Pr 1 , , 1 ij ij ij ij ij ij ij ij ij ij S X Y E X Y X Y S X Y α α

θ

∆ = ∆ = = + . For the Gumbel model, the above estimating function becomes

(

)

(

)

(

)

(

)

0 1 log , , , 2 log , ij ij ij i j ij ij S X Y U S x y S X Y

α

<   −     = _∆ − _ −    

∑

(

)

(

)

log , . 2 log , ij ij ij i j _ij _ij S X Y S X Y

α

<  ₋    = ∆ − ⋅ −    

∑

(4.2)

Here notice that, we use the conditional probability rather than the unconditional one. The latter is used in Shih’s paper (1998) since both are equivalent under the Clayton model.

Note that S x y

(

,

)

=Pr

(

X >x Y, >y

)

is a nuisance function. For complete data,

(

,

)

S x y can be estimated by the empirical estimator

(

)

(

)

1 , ˆ _, _. n i i i I X x Y y S x y n = > > =

∑

(19)

To obtain an estimator of

α

, ˆ

α

, we solve U0

(

α

,S x yˆ

(

,

)

= . Since an explicit solution is 0

not available, we suggest to solve the equation by numerical methods, say the Newton-Raphson method, which often requires computing the derivative of U₀

(

α

,S x y

(

,

)

with respect to

α

. For the Gumbel model, the derivative equals

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 2 , log , . 2 log , 1 , ij ij ij ij i j i j ij ij ij ij d S X Y S X Y d S X Y S X Y α α

θ

α

θ

< <      ₋     ₌     _ _ − + _ _         

∑

The function in (4.2) can be viewed as an unweighted version. That is, ∆ is treated _ij equally for each combination of i and j . However since different combinations of ∆ are _ij associated with different values of (X Y_i, )_i and (X Y_j, _j), it is reasonable to suspect that such additional information may be utilized in the estimation procedure. Clayton (1978) proposed a conditional likelihood function for the Clayton family. We can modify his method for AC models. The resulting log-likelihood function is given by

( )

(

)

(

)

(

)

(

)

(

)

, ₁ L log 1 log , 1 , 1 , ii ii _ij ij i _ii _ii _ii i j _ij _ij _ij S X Y _R R S X Y R S X Y α α α θ α θ < θ     −     = + − ∆  _{− +}   _{− +}     

∑

where R_ij =R X Y

(

_ij, _ij

)

and

(

)

(

)

1 , , n i i i R x y I X x Y y =

=

∑

≥ ≥ . Now we derive the score function for an Archimedean copula model:

( )

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

ln , 1 , 1 . 1 , 1 , ii ii ij ij ij ii i i j ii ii ii ij ij ij d S X Y d S X Y L R d d R S X Y R S X Y α α α α θ θ α α _θ α < _θ α ∆ − ∂ − = ⋅ + ⋅ ∂  _{− +}   _{− +}     

∑

For the Gumbel model, the score function becomes

( )

(

)

(

)

(

)

(

)

(

)

1 1 . log , log , log , ij ii i i j _ij _ij _ij ii ii ii ii ii L R R S X Y R S X Y S X Y

α

_α

α

< − ∆ ∂ − − = + ∂   _ ₋ _    ₋ _ ₋ _    

∑

(20)

technique in Oakes (1986). That is, if j∈R_ii, then R_ii =R_ij, ∆ = and _ij 1 1 ii ij ii j R∈ ∆ =R −

∑

. It follows that

( )

L

( )

S

α

∂ = ∂

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

ln , 1 , 1 , 1 , ii ij ii ii ij ij ij j R i i j ii ii ii ij ij ij d S X Y d S X Y d d R S X Y R S X Y α α α α θ θ α α θ θ ∈ < ∆ ∆ − = ⋅ + ⋅  _{− +}   _{− +}     

∑

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

ln , 1 , 1 , 1 , ii ii ii ij ij ij ij j R i i j ii ii ii ij ij ij d S X Y d S X Y d d R S X Y R S X Y α α α α θ θ α α θ θ ∈ < ∆ − ∆ = ⋅ + ⋅  _{− +}   _{− +}     

∑∑

∑

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

ln , 1 , 1 , 1 , ij ij ij ij ij ij i j ij ij ij ij ij ij d S X Y d S X Y d d R S X Y R S X Y α α α α

θ

α

θ

< ∆ − ∆ = ⋅ + ⋅  _{− +}   _{− +}     

∑

(

)

(

)

(

)

₍

(

)

₎

(

)

(

)

(

)

(

)

(

)

, 1 ln , , . , 1 1 , ij ij ij ij ij ij ij i j _ij _ij ij ij ij S X Y d S X Y S X Y d _{S X Y} R S X Y α α α α α

θ

α

_θ

θ

<   + _ _ = ⋅ ∆ −   ₋ ₊  _ ₊ _    

∑

(4.3) For the Gumbel model for illustration, the equation equals

( )

(

)

(

)

(

)

(

)

(

)

log , 2 log , 2 log , . log , log , ij ij ij ij ij ij ij i j _ij _ij _ij _ij _ij S X Y S X Y S X Y S R S X Y S X Y

α

<  ₋  ∆ −  − − _ _ = − ⋅  ₋   ₋     

∑

If we treat the estimating function in (4.2) as an unweighted version, the weight function in equation (4.3) is

(

)

(

)

(

)

(

)

(

)

(

)

, 1 ln , , 1 , ij ij ij ij ij ij ij S X Y d S X Y d R S X Y α α α

θ

α

θ

+ ⋅  ₋ ₊   

which accounts for the effects of the data and model properties.

Next, one can derive the Fisher information function, that is, the derivative of minus score function. It follows that

(21)

( )

S

( )

I

α

∂ = − ∂

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

, 1 ln , , , 1 1 , ij ij ij ij ij ij ij i j _ij _ij ij ij ij S X Y d S X Y S X Y d _{S X Y} R S X Y α α α α α

θ

α

<

_θ

α

_θ

  + ∂   = − ⋅ ∆ −  ∂  ₋ ₊  _ ₊ _    

∑

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 , , 1 1 , 1 , ij ij ij ij ij i j _ij _ij ij ij ij d S X Y d _{S X Y} S X Y R S X Y α α α α

θ

α

_θ

θ

<        _ ₊    _ _ = − _ − ∆ _⋅  _{− +} _    

∑

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 2 2 2 , 1 . 1 , _, _, ij ij ij ij ij ij ij ij ij ij d S X Y d R S X Y _d _{S X Y} _{S X Y} d α α α α

θ

α

θ

_θ

α

         ∆    − −  _ _ _ _  − +  _ _ _ _    _ _    _ _    

For the Gumbel model, the Fisher information function becomes,

( )

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 log , 1 log , log ,

log , log , . log , log , ij ij ij ij ij ij ij ij ij ij ij ij ij i j _ij _ij _ij _ij _ij S X Y S X Y S X Y R S X Y S X Y I R S X Y S X Y

α

<  ₋  _{− ∆} ₋ _∆     −    ₋   ₋   _ _ _ _   = −  ₋   ₋     

∑

Again the nuisance function S x y

(

,

)

can be estimated separately and then the estimator is plugged into the score function and Fisher information function. The solution ˆ

α

_w requires using numerical methods such as the Newton-Raphson method approach.

4.2: Analysis based on Right Censored Data

Now we incorporate external situation censoring in the analysis. Let

(

A B_i, _i

)

be the bivariate censoring variables. One only observes

(

X Y_i, ,_i

δ δ

₁_i, ₂_i

)

_{, where} _min

₍

_,

₎

i i i

X = X A ,

(

)

min ,

i i i

Y = Y B ,

δ

_1i =I X

(

_i ≤A_i

)

and

δ

_2i =I Y

(

_i ≤B_i

)

. In presence of right censoring, we know the order of X_i and X_j if and only if X_ij ≤min( ,A A_i _j). Similarly the order of Y_i

(22)

and Yj can be known if and only if Yij ≤min(B Bi, j) . Define Zij =I X

(

ij ≤ A Yij, ij ≤Bij

)

_, where Aij =min( ,A Ai j) and Bij =min(B Bi, j) _{. The value of} ij

∆ is observed if and only if 1

ij

Z = . We will modify the estimating procedures that only include comparable pairs (i.e. those with Z_ij = ) in the analysis. 1

The unweighted estimating function of the association parameter

α

can be modified as

(

)

(

)

(

)

0 , , ij ij | ij, ij ij, i j U

α

S x y E X Y Z <   =

∑

_∆ − ∆ _⋅ where

(

)

(

)

(

)

(

)

(

)

, | , Pr 1 , . , 1 ij ij ij ij ij ij ij ij ij ij S X Y E X Y X Y S X Y α α

θ

∆ = ∆ = = + For the Gumbel model, the above function then becomes

(

)

(

)

(

)

(

)

0 1 log , , , 2 log , ij ij ij ij i j ij ij S X Y U S x y Z S X Y

α

<   −     = _∆ − _⋅ −    

∑

(

)

(

)

log , . 2 log , ij ij ij ij i j _ij _ij S X Y Z S X Y

α

<  ₋    = ∆ − ⋅ ⋅ −    

∑

For estimating the function S x y( , ) for right censored data, there exist several nonparametric estimators. Here we adopt the Dabrowska estimator (1988) which is the most well-known one among its competitors. The formula of ˆ( , )S x y is given by

(

)

(

) (

)

(

)

0 0 ˆ _, ˆ _{, 0} ˆ _0, ₁ _, _, u x v y S x y S x S y L u v < ≤ < ≤ =

_∏

_ − ∆ ∆ _ where

(

)

(

)

(

)

(

)

(

)

{

}

{

(

)

}

10 01 11 10 01 ˆ _, ˆ _, ˆ _, ˆ _, ˆ ˆ 1 , 1 , u v u v u v L u v u v u v Λ ∆ − Λ − ∆ − Λ ∆ ∆ ∆ ∆ = − Λ ∆ − − Λ − ∆ ,

(23)

(

)

(

)

(

)

1 2 1 11 1 , , 1, 1 ˆ _, , n i i i i i n i i i I X u Y v u v I X u Y v

δ

= = = = = = Λ ∆ ∆ = ≥ ≥

∑

,

(

)

(

)

(

)

1 1 10 1 , , 1 ˆ _, , n i i i i n i i i I X u Y v u v I X u Y v

δ

= = = ≥ = Λ ∆ − = ≥ ≥

∑

,

(

)

(

)

(

)

2 1 01 1 , , 1 ˆ _, , n i i i i n i i i I X u Y v u v I X u Y v

δ

= = ≥ = = Λ − ∆ = ≥ ≥

∑

,

and S xˆ

(

, 0

)

and Sˆ 0,

(

y

)

are the usual Kaplan-Meier estimates, i.e.,

(

)

(

)

(

)

1 1 , 1 ˆ _{, 0} ₁ n i i n u x i i I X u S x I X u δ = ≤ =   = =     =  −   _≥     

∑

∏

∑

,

(

)

(

)

(

)

1 1 , 1 ˆ 0, 1 n i i n u y i i I Y u S y I Y u δ = ≤ =   = =     =  −   _≥     

∑

∏

∑

That is, the marginals of S x yˆ

(

,

)

are given by the univariate Kaplan-Meier estimates.

Numerical algorithms for solving ˆα often involve calculating the derivative of

(

)

(

)

0 , ,

U

α

S x y . For Gumbel’s model, this terms equals

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 2 , log , . 2 log , 1 , ij ij ij ij ij ij i j i j ij ij ij ij d S X Y S X Y d _Z _Z S X Y S X Y α α θ α α θ < <      ₋      ⋅ = _ _⋅   _ _ − + _ _         

∑

The weighted versions of the score function and Fisher information for censored data are given by

( )

(

)

(

)

₍

(

)

₎

(

)

(

)

(

)

(

)

(

)

, 1 ln , , . , 1 1 , ij ij ij ij ij ij ij ij i j _ij _ij ij ij ij S X Y d S X Y S X Y S Z d _{S X Y} R S X Y α α α α α θ θ θ α α _θ θ <   + _ _ = ⋅ ∆ − ⋅  ₋ ₊  _ ₊ _    

∑

(24)

For the Gumbel model, it equals

( )

(

)

(

)

(

)

(

)

(

)

log , 2 log , 2 log , . log , log , ij ij ij ij ij ij ij ij i j _ij _ij _ij _ij _ij S X Y S X Y S X Y S Z R S X Y S X Y α α α α α α <  ₋  ∆ −  − − _ _ = − ⋅ ⋅  ₋   ₋     

∑

And

( )

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 , , 1 1 , 1 , ij ij ij ij ij i j _ij _ij ij ij ij d S X Y d _{S X Y} I S X Y R S X Y α α α α θ α _θ α θ θ <        _ ₊    _ _ = − _ − ∆ _⋅  _{− +} _    

∑

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 2 2 2 , 1 . 1 , , , ij ij ij ij ij ij ij ij ij ij ij d S X Y d _Z R S X Y d S X Y S X Y d α α α α θ α θ θ θ α          ∆    − − ⋅ _ _ _ _  − +  _ _ _ _    _ _    _ _    

For the Gumbel model, it equals

( )

(

)

(

)

(

)

(

)

(

)

(

)

(

)

(

)

2 log , 1 log , log ,

log , log , . log , log , ij ij ij ij ij ij ij ij ij ij ij ij ij ij i j _ij _ij _ij _ij _ij S X Y S X Y S X Y R S X Y S X Y I Z R S X Y S X Y α α α α α α <  ₋  _{− ∆} ₋ _∆  _ _  −    ₋   ₋   _ _ _ _   = − ⋅  ₋   ₋     

∑

The solution ˆα_w requires using numerical methods such as the Newton-Raphson method approach.

4.3: Model checking

By defining

γ

ˆ=log

α

ˆ and ˆγ_w =logαˆ_w, Shih (1998) proves that when the Clayton model is correct,

(

)

1

2 _ˆ _ˆ

w

n γ −γ converges to a normal distribution with mean zero. For complete data, the variance is

( )

w

( )

2

( )

W

η

=V

η

+V

η

− H

η

, where

η

=1

α

and

(25)

( )

(

)

(

)

( )

(

)

(

)

(

) (

)

2 4 3 2 2 ₂ 2 4 2 1 17 27 14 2 8 2 1 1 3 1 3 1 V

η

L

η

η η

η

 ₊ ₊ ₊ ₊ ₊    = −  + + +     is the variance of 1 2 _ˆ n

γ

− , where

( )

(

)

(

)(

)

1 0 3 , 3 1 2 k i k i L k k

η

∞ = = Γ + = Γ + + +

∏

∑

and

( )

₂

(

)

4 1 1 1 2 6 5 1 1 2 2 2 w V

η

=

η

+

η

+ −

η

+ 

ψ

′ +

η

−

ψ

′ +

η

       is the variance of 1 2 _ˆ w n

γ

−

, where

ψ

′

( )

c =

∑

(

n+c

)

−2 is the trigamma function, and

( )

(

)

( )(

)(

)

3 2 2 2 40 49 21 3 4 4 1 2 1 , 2 1 H

η

J

η η

η

+ + + = − − + + + + where

( )

(

)(

) (

)

0 ! 2 . 2 2 k k J k k k

η

∞ = Γ = + + Γ +

∑

For right censored data, the asymptotic variance becomes

( )

w

( )

2

( )

W

η

=V

η

+V

η

− H

η

_, where

( )

(

)

(

)

2

(

) (

)

(

)

2 ₃ 12 13 2 2 3 3 23 23 2 * 1 4 2 , , , , , 1 i i i V

η

α

N N C u v C u v C u v f x y dxdy

α

δ

= + = +

∫∫

∏

( )

(

) (

)

(

) (

)

(

)

3 12 13 2 2 3 3 23 23 2 1 2 2 3 3 , , , 1 , , , , w i i i N N C u v C u v C u v V f x y dxdy S u v S u v

η

δ

= =

∫∫

∏

( )

(

)

(

)

(

) (

)

(

)

(

)

3 12 13 2 2 3 3 23 23 * 1 2 2 2 2 , , , , , 1 , i i i N N C u v C u v C u v H f x y dxdy S u v

α

η

α

δδ

= + = +

∫∫

∏

with

δ

=Pr an observation is uncensored in both components

(

)

, N₁_l = ∆₁_l

(

α

+2

) (

−

α

+1

)

,

(

)

*

Pr Z_ij 1

δ

= = , u_l =min

(

x x₁, _l

)

, v_l =min

(

y y₁, _l

)

, l=2,3 , and u₂₃ =max

(

u u₂, ₃

)

,

(

)

23 max 2, 3

(26)

One can estimating the variance for complete data W

( )

η

by W

( )

η

ˆ_w . The censored version W

( )

η

_{can be estimated by}

(

)

(

)

(

)

(

)

* * ˆ ˆ 2 2 2 2 1 1 1 ˆ ˆ ˆ _, ˆ _ˆ ₁ ˆ ˆ _ˆ ₁ ˆ w w ij ik ij ik i j k _ij _w _ik _w W Z Z N N n R n R n

α

δ

α

δ

α

δ

≠ ≠  ₊   ₊      =  −   −  +  +       

∑

where

(

) (

)

ˆ _ˆ ₂ _ˆ ₁ ij ij w w N = ∆

α

+ −

α

+ ,

(

1 2

)

1 1, 1 ˆ n i i i I n

δ

= = = =

∑

, ˆ* 2 ij i j n Z

δ

<   =    

∑

.

For complete data, the null hypothesis is rejected when

( )

1 2 ˆ ˆ ˆ w w W n

γ

η

−     is greater than Z₁₋_α₂

with significance level equals to

α

, where Z_p is the p-th percentile of the standard normal

distribution. For censored data, the test statistic is changed to ˆ _{1 2}ˆ ˆ w W n

γ

−

γ

    .

For our proposal which can be extended to the whole AC family, we need to know the (asymptotic) distribution of ˆ

γ

_w− . Asymptotic normality should be correct based on the

γ

ˆ central limit theorem and the delta method. Formal derivations will be future work. However the proposed method involves the complicated plugged-in estimator, analytic estimation of the variance term will be impossible. Note that even there exists no analytic form for the variance of Dabrowska’s estimator. Hence we suggest to use the Jackknife algorithm to estimate the variance of the proposed test statistic, denoted by

σ

ˆ_Jackknife2 . Our hypothesis testing is rejected

if ˆ ˆ ˆ w Jackknife

γ

σ

−

(27)

Chapter 5: Data Generation Algorithms

In this chapter, we discuss two existing algorithms for generating an AC model and then propose a new data generation algorithm.

5.1 Frailty Approach

5.1.1 Theoretical Background

Suppose that there are p lifetime variables, X X₁, ₂,...,X_p which are correlated. Oakes (1989) might be the first one who used the idea of frailty to construct multivariate distributions. He assumes that the dependence among these variables can be fully explained by a latent variable γ , called “frailty”. That is, given the value of γ , these variables are independent such that one can write

(

1 1

)

(

)

1 ,..., | | p p p j j j S X x X x γ S X x γ = > > = Π > .

If the failure times represent the lifetimes of family members, γ represents the shared genetic/environmental factor. Furthermore, γ affects each of T_j via a proportional hazard model such that

(

|

)

( )

j j

h x Z =h x ⋅

γ

(or equivalently S_j

(

x Z|

)

=B_j

( )

x γ),

where Bj

( )

x =exp

(

−

∫

_oxb t dtj

( )

)

. Since γ is a (positive) random variable, the unconditional

joint survival function can be expressed as

(

1 1

)

( )

1 ,..., p p p j j j S X x X x E B x γ γ ₌      > > = _Π _       . (5.1) Notice that the Laplace transform of γ is defined as

( )

(

t

)

tx

( )

L t =E_γ e−γ =

∫

e dF− _γ x ,

where F_γ is the distribution function of γ .

(28)

family and the AC family introduced earlier. That is the inverse of the generating function

( )

1 α

φ

−

⋅ for an AC model is actually the Laplace transform of γ . To see this, we can view

( )

L t as the moment generating function evaluated at − . If we know the form of t L t

( )

, we derive its distribution. Moreover,

(

1 1,..., p p

)

{

1

( )

1 p

( )

p

}

S X >x X >x =E_γ _B x B x _γ =Eγ

{

expγ ⋅

(

lnB x1

( )

1 +lnB2

( )

x2 ++Bp

( )

xp

)



}

=L−lnB x1

( )

1 −lnB2

( )

x2 −−lnBp

( )

xp . Since

( )

1

( )

ln ln i i i i i i i i i i S x =E_γ _B x γ_=L_− B x _⇒ − B x =L− _S x _, we can obtain

(

)

{

1

( )

1

( )

}

1 1,..., p p 1 1 p p S X >x X >x =L L− _S x _++L− _S x _ _.

Since the Laplace forms have well-defined inverses, thus from the above equation we can find that the inverse function of L , 1

L− , acts as the generator of Archimedean Copula. That is, the frailty family can be treated as a subclass of the Archimedean copula family with the generator being the inverse function of the Laplace transform for the latent variable γ . The explanation by using the frailty variable to explain the cause of dependence is intuitive for many applications.

5.1.2 Generation Algorithm

Here we consider the bivariate case with p= and, to unify the notations, we let 2

1

X = X and Y =X₂. Based on the construction of the frailty model, a random replication of

(

X Y,

) (

= X X1, 2

)

can be generated as follows.

1. Generate a positive random variable γ following a given distribution. Then derive the form of its Laplace transform denoted as L .

(29)

2. Independent of γ , generate

(

U U1, 2

)

which are independent uniform

(

0,1

)

random variables. Recall that based on the relationship U_k =S_k

(

X_k|γ

)

=B_k

(

X_k

)

γ , we have

γ

1 lnUk lnBk

(

Xk

)

−

− ⋅ = − for k =1, 2.

3. After specifying the forms of Sk

( ) (

. k =1, 2

)

, we need to find

(

)

1 1 ln k k k X S− L γ− U  = _ − ⋅ _{ .}

For the Gumbel model, it corresponds to the case that γ following positive stable distribution with Laplace transform

( )

1 1 exp L t ₌ _₋tα+ _  , where L−1

( )

t log

( )

t α+1 φ

( )

t = −_ _ = .

5.2 Conditional Distribution Approach

5.2.1 Theoretical Background

The idea was proposed by Lee (1993). Given the marginal distribution of X₁ and if the conditional distribution of X₂|X₁ is specified, then X₂ can be generated. In general, X_k

can be generated given that the form of X_k|X X₁, ₂,...,X_k₋₁ is specified. The algorithm can be performed successively for k=2,...,p.

Now we apply the above idea to the family of Archimedean copula construction of the form:

(

1, 2,..., p

)

(

1

( )

1 , 2

( )

2 ,..., p

( )

p

)

F x x x =C F x F x F x 1

{

_{( )}

}

1 1 k k F x F x φ− φ φ = _ _++ _ _ _. The joint distribution function is given by

(

1

)

(

1

)

1 , , ,..., ... k k k k f x x F x x x x ∂ = ∂ ∂ …

( )

{

}

1 1 1 1 ... k k k k F x F x x x φ φ φ − ∂ = _ _+ + _ _ ∂ ∂

(30)

( )

1 ( )

{

( )

}

( )

1 1 ₁ k k k k _i i i i i F x F x F x F x

φ

−

φ

= ′ ′ = _ _++ _ _ ⋅ Π _ _ _,

where the superscript notation (k) means the k-th derivative. Then, the conditional density function of X_k given X X₁, ₂,...,X_k₋₁ is

(

)

(

)

(

)

1 1 1 1 1 , , | ,..., , , k k k k f x x f x x x f x x − − = … …

( )

{

( )

}

( )

{

( )

(

)

}

( )

1 1 1 ₁ 1 1 1 1 1 1 1 1 k k k k _i i i i i k k k k i i i i i F x F x F x F x F x F x F x F x φ φ φ φ φ φ φ φ − = − − − − − = ′ ′ + + ⋅ Π             = ′ ′ + + ⋅ Π            

( )

{

( )

}

( )

{

( )

(

)

}

( )

1 1 1 1 1 1 1 1 1 k k k k k k k k k k F x F x F x F x F x F x φ φ φ φ φ φ φ − − − − − + +         _′ ′ = ⋅ _ _ + +         .

Then, the conditional cumulative density function of X_k given X X₁, ₂,...,X_k₋₁ is

(

| 1,..., 1

)

(

| 1,..., 1

)

k x k k k F x x x ₋ f x x x ₋ dx −∞ =

∫

( )

{

( )

}

( )

{

( )

(

)

}

( )

1 1 1 1 1 1 1 1 1 k k x k k k k k k F x F x F x F x dx F x F x φ φ φ φ φ φ φ − − −∞ − − − + +         _′ ′ = ⋅ _ _ + +        

∫

( )

{

( )

}

( )

{

( )

(

)

}

1 1 1 1 1 1 1 1 1 1 k x k k k k k x F x F x F x F x φ φ φ φ φ φ − − − − − − =−∞ + +         = + +        

( )

{

( )

}

( )

{

( )

(

)

}

1 1 1 1 1 1 1 1 1 1 k k k k k k F x F x F x F x φ φ φ φ φ φ − − − − − − + +         = + +        

( )

{

( )

}

( )

(

)

1 1 1 1 1 1 k k k k k k a F x a

φ

− − − − − − + _ _ = , (5.2) where a_k−₁ =

φ

F x₁

( )

₁ ++

φ

F_k−₁

(

x_k−₁

)

. 5.2.2 Generation Algorithm

Consider the bivariate case with

(

X Y,

) (

= X X1, 2

)

. The algorithm can be described as

following:

二維存活資料之模式檢驗

國 立 交 通 大 學

統計學研究所

碩士論文