Introduction - 設限資料下分量迴歸模型分析之文獻回顧

1.1 Different types of regression models

Consider the response variable Y and covariate vector Z . Regression analysis refers to the situation that one wants to model the behavior of Y based on Z . The following linear model is the most popular form:

Y   ZT  . (1.1)

The major interest is the estimation of  which measures the effect of Z and Y . To interpret the meaning of  and also develop valid inference methods, we need to impose additional assumptions on the distribution of  . For example, if we assume that  are identically and independently distributed (iid) with mean zero, the  measures E Y Z( | 0) and the  measures the change of ( )E Y when the corresponding covariate changes one unit.

However in survival analysis, the mean is often not a useful descriptive measure since it is not robust. Therefore other types of regression model are preferred for analyzing survival data. The Cox proportional hazards (PH) model is the most popular one. Denote T 0 as the lifetime variable of interest. A natural link for applying model (1.1) in survival analysis is to write Y log( )T . The PH model can be written

the hazard for the baseline group with Z  . Hence we have the accelerated failure 0 time (AFT) model such that

log( ) ^T

Y  T Z   . (1.3)

Notice that models (1.1) and (1.3) differ in whether an intercept term is included in the right-hand side. The difference comes from the fact that, in survival analysis, the

assumption E( )  is dropped and distribution-free assumption is imposed on  . 0 The resulting inference method will be rank-invariant which makes the intercept parameter to be non-identifiable.

The above discussions imply that the distribution form of  plays a key role in regression analysis based on linear models. In the thesis, we will focus on quantile regression models. Define the quantile of Y as

( ) inf{ : Pr( ) } QY   y Y  y  ,

where 0  1. Figure 1.1 shows the location of Q_Y( ) for a continuous random variable Y . Quantile regression models state that

( | ) ( ) ^T ( ).

QY  Z   Z   (1.4)

Note that the   is the quantile of Y for the baseline group with ( ) Z 0. It is important to note that model (1.4) is equivalent to the linear model in (1.1) with the assumption that Pr( 0) . To see this, we have

Pr(Y Q_Y( | )) Z Pr( ( )  Z^T ( )  Q_Y( | )) Z Pr( 0) .  We draw a plot based on the special two-sample case such that Q_Y( | Z 0) ( ) and Q_Y( | Z  1)  ( ) ( ). In Figure 1.2, we see that the two samples differ more in the region at larger quantile. Note that we may imagine that when the two curves cross, the covariate effect may be reserved.

1.2 Examples of quantile regression models

The first three examples are originally described in the book of Keonker. In these examples, quantile regression models provide better explanations to the real-world phenomenon.

Example 1: Salaries and Experiences in Academia

The American Statistical Association conducted a salary survey in 1995 on 370 full professors of statistics from 99 departments in U.S. colleges and universities. The

response is the professor’s salary and the covariate is the year of experience (years in rank). Simply based on the scatterplot without imposing any model assumption (not presented here), the plot shows that (a) salary increases with experience; (b) the growth rate with the experience increase is different for different quantile group of salary. Specifically the first quantile (25^th percentile) of the salary distribution has a 7.3% growth rate per year of tenure, whereas the median (50^th percentile) and the third quantile (75^th percentile) grow at 14% and 13% respectively. It seems to us that in American Academia, the full professors are treated different. A middle-paid one has the potential of achieving a leader position and hence gets the highest salary rate. A high-paid one who has been quite established has salary increase at about the same speed as the former. A relatively low-paid professor is probably the one who becomes less productive after he/she has been promoted to a full professor and hence gets the lowest rate of salary increase.

Example 2: Score of Course Evaluation and Size of Class

A university conducted course evaluation based on 1482 courses over the period 1980-94. The response is the mean course evaluation questionnaire (CEQ) score and the class size is the main covariate. The observations are classified into three categories (high, median and low quantile) based on their CEQ scores. It is found that larger classes tend to get lower CEQ score but the effect of class size is more significant on the lower quantile than on the upper quantile. In other words, high-evaluated courses may contain different sizes of classes but low-evaluated courses are significantly related to large classes.

Example 3: Infant Birth Weight and Mother’s Background

The sample contains the birth weight and other covariate information for 198,377 infants. Comparing boys and girls, their difference is larger than 100 grams at the upper quantile, whereas smaller than 100 grams at the lower quantile. (Note boys tend

to be heavier.) Comparing married and unmarried mothers, their difference becomes more obvious in lower quantile. (Note the child of a married mother tend to be heavier.) Comparing white and black mothers, their difference is about 330 grams at the lower quantile, while 175 grams at the upper quantile. (Note the child of a white mother tends to be heavier.) The analysis implies that the difference between different race (white/black) or social (married/unmarried) groups is more obvious for babies in the lower quantile range.

Example 4: Mortality risk for Dialysis Patients with Restless Legs Syndrome

Peng and Huang (2008) did an analysis on a cohort of 191 renal dialysis patients from 26 dialysis facilities serving the 23-country area surrounding Atlanta, GA. The restless legs syndrome (RLS) is the main covariate and is classified into two levels of symptoms, mild RLS symptoms and severe RLS symptoms. The interest is to see the effect of RLS symptoms on mortality risk. Comparing the mild and severe RLS symptoms groups, their difference is about 1.5 years at the first quantile (25 percentiles of survival time), while there is no obvious difference at the third quantile ( 75 percentiles of survival time). In other words, there is a strong association between the RLS symptoms and mortality risk for short-term survivors. The phenomenon can not be detected by the ATF model.

Example 5: Mortality rate for medflies with different Gender

Koenker and Geling (2001) studied the relationship between mortality rate and gender on medflies, a study originally conducted by Garey, Liedo, Orozco and Vaupel (1992). According to the survival time, medflies are classed into three categories, the lower (before 20 days), middle (20-60 days) and upper (after 60 days) quantile. It found that males have lower mortality rate than females on the lower quantile, while males have higher mortality rate than females on the middle quantile and there is no obvious difference on the upper quantile.

1.3 Comparison with other regression models Based on the PH model in (1.2), we obtain

( | ) 0 ( log(1 ) exp{ ^T ( )}) QT  Z   ^  Z   ,

which depends on the form of the baseline   . Furthermore, ₀( ) Q_T( | ) Z is monotone in log(1 for all Z . This property restricts the application of PH ) models because it cannot handle the heterogeneity data. Based on the AFT model

logT_i Z_i^T ( ) . _i

Usually, it is assumed that _i (i1,..., )n have the same distribution which is independent of Z . This property can not well explain the heterogeneity data.

在文檔中設限資料下分量迴歸模型分析之文獻回顧 (頁 7-12)