切割點之估計

(1)

國立交通大學

統計學研究所

碩士論文

切割點之估計

Cutoff Point Estimation

研究生：侯智飛

指導教授：陳鄰安博士

中華民國九十八年六月

(2)

切割點之估計

Cutoff Point Estimation

研究生：侯智飛 Student：Zhi-Fei Hou 指導教授：陳鄰安 Advisor：Dr. Lin-An Chen

國立交通大學

統計所研究所

碩士論文

A Thesis

Submitted to Institute of Statistics College of Science

National Chiao Tung University In Partial Fulfillment of the Requirements

For the Degree of Master In Statistics June 2009 Hsinchu, Taiwan 中華民國九十八年六月

(3)

切割點之估計

研究生：侯智飛指導教授：陳鄰安教授

國立交通大學統計學研究所

摘要切割點在建構使用基因影響表現分析的離群和或離群平均上扮演重要角色。我們在這篇論文中考慮了切割點的估計，其中樣本切割點估計量的近似分配，我們討論一種是根據經驗分位數來推導的，另一個是根據 Chen 和 Chiang(1996)發展的對稱分位數來推導的。在檢測離群值的近似變異數和檢定力顯示由對稱分位數估計的樣本切割點跟經驗分位數比較起來是非常有競爭力的。關鍵字：切割點；經驗分位數；檢定力比較；對稱分位數。 i

(4)

Cutoff Point Estimation

Student：Zhi-Fei Hou Advisor : Dr. Lin-An Chen

Institute of Statistics

National Chiao Tung University

Abstract

Cutoff point plays a vital role in constructing outlier sum or outlier mean which is used for gene influential expression analysis. We consider the estimation of the cutoff point in this paper. Asymptotic distributions of sample cutoff point estimates, one based on empirical quantiles and one based on symmetric quantiles of Chen and Chiang(1996), are developed. Comparisons of asymptotic variance and power for detecting outliers are performed showing that the version of sample cutoff point based on symmetric quantiles is very competitive with the one based on the empirical quantile.

Key words: Cutoff points; empirical quantile; power comparison;

symmetric quantile.

(5)

誌謝

在研究所的這二年期間，真得非常感謝所上教授們的指導及照顧，所上的每個老師都很和藹可親，就像爸爸媽媽一樣，讓我覺得相處起來沒有壓力，使我可以在這期間可以順順利利地度過，也讓我在這期間學習了很多統計分析的技巧，更學習了許多統計相關軟體，讓我對統計有更深一層的認識，讓我帶著很多豐碩的知識離開學校。也非常感謝我的指導教授-陳鄰安教授，對本來開始對寫論文非常害怕且懵懵懂懂的我，一步一步靠著老師的指導，讓我對寫論文不再害怕，並努力完成它，從老師身上學到很多東西，不僅是論文上的研究，還有一些為人處世的真理，老師都會不吝地教給我們，真的非常開心陳鄰安教授能當我的指導教授。也非常慶幸在研究所時認識了一群不錯的同學，常常一起討論功課，研究作業，大考完還會一起約出去放鬆心情，是在課業壓力外的一個調劑，這群同學更能在我心情煩悶時給我安慰，在我開心時陪我大笑，在我沮喪時給我鼓勵，讓我的研究所生活過得多彩多姿。最後，也感謝我的家人，常常鼓勵著我，讓我有一直往前的動力，而不會退縮不前進，讓我可以順利完成研究所的學業。侯智飛謹誌于國立交通大學統計學研究所中華民國九十八年六月 iii

(6)

Cuto Point Estimation

Abstract

Cuto point plays a vital role in constructing outlier sum or outlier mean which is used for gene inuential expression analysis. We consider the esti-mation of the cuto point in this paper. Asymptotic distributions of sample cuto point estimates, one based on empirical quantiles and one based on symmetric quantiles of Chen and Chiang (1996), are developed. Compar-isons of asymptotic variance and power for detecting outliers are performed showing that the version of sample cuto point based on symmetric quantiles is very competitive with the one based on the empirical quantile.

Key words: Cuto point empirical quantile power comparison symmetric quantile.

1. Introduction

DNA microarray technology, which simultaneously probes thousands of gene expression proles, has been successfully used in medical research for disease classication (Agrawal et al. (2002) Alizadeh et al. (2000) Ohki et al. (2005)) Sorlie et al. (2003)). Recently, microarray analysis has been advanced to disease classication by identifying outlier genes that are over-expressed only in a small number of disease samples (see, for example, Tibshirani and Hastie (2007) Tomlins et al. (2005)). To achieve this goal, common statistical methods for two-group comparisons such as t-test, are not appropriate due to a large number of genes expressions and a limited number of subjects available.

Among statistical approaches proposed to identify those genes where only a subset of the sample genes has high expression, Tibshirani and Hastie (2007) and Wu (2007) suggested use of an outlier sum that sums all the gene expression values in the disease group that are greater than the total of the 75% percentile and the interquartile range of the same gene. They also showed that the statistical test based on this outlier sum is noticeably more poweful in simulation. The distribution theory of an outlier mean,

TypesetbyA M S-T E X 1

(8)

modied from the outlier sum, has been studied by Chen, Chen and Chan (2008).

Basically an outlier is an observation that lies an essential distance from the mass of data in a random sample from a population and an outlier sum or outlier mean uses cuto pointF;1(0:5)+kIQRfor somek >0 to detect the

upper-tail outliers where IQR=F;1(0:75) ;F

;1(0:25) is the interquartile

range. The cuto point should be estimated from the observation when the distribution function F is unknown. With the fact that a cuto point plays a vital role in detection of outliers, there are two concerns about estimation of the population cuto point. First, from the point of estimation of an unknown parameter, we concern the estimator's eciency in estimating the population cuto point. Second, for its role of detecting inuential genes, we concern the power in detecting outliers of an estimator when a distributional shift occurs. We consider a step on tackling these two concerns that help in advancing study of outlier mean for gene expression analysis.

The empirical quantile has long been the popular choice whenever esti-mation of population quantile is needed in constructing location and scale estimators. It is desired to see if there is competitive alternative choice of cuto point estimator through other choice of estimation of the popula-tion quantile. This is the rst step for the concerns. In order to improve the eciency of a location estimator, the trimmed mean, Kim (1992) de-veloped the metrically trimmed mean for a location model which, through comparison of asymptotic variances, was shown to be more ecient than the ordinary trimmed mean. Later, Chen and Chiang (1996) dened the symmetric quantile and used it to propose the symmetric trimmed mean as an extension of Kim's trimmed mean to the linear regression model. They observed that this symmetric trimmed mean of small trimming percentages can have asymptotic variances very close to the Crammer-Rao lower bounds when regression errors obey heavy tail distributions.

For solving our concerns, one interesting question is to see if the e-ciency of symmetric trimmed mean can carry over to other quantile-based proposals. This is the topic that we want to investigate in this paper.

(9)

2. Symmetric and Classical Cuto Points

In gene expression analysis, there are m genes to be concerned and for each gene there are two groups of subjects, one normal or healthy group and one cancer (disease) group. For a given gene, we assume that there are availablenand mexpression variables, respectively, for two groups forming as follows:

Normal group Cancer group

X1:::Xn Y1:::Ym

(2.1) The test statistics been seen in literature to detect cancer genes is con-structed based on an outlier sum of the form

m

X

i=1

YiI(Yi C^)

when cancer genes are over-expressed and of the form

m

X

i=1

YiI(Yi C^)

when cancer genes are down-expressed where ^C is estimator of a cuto point C, varying in over- and down-expressed cancer genes. Let us restrict on cuto point with over-expressed cancer genes only. In Wu (2007) and Chen, Chen and Chang (2008), the cuto point is C = F;1

x (0:75) +IQR

with IQR = F;1

x (0:75); F ;1

x (0:25), the interquartile range, constructed

from the distribution function Fx of random variable X and Tibshirani

and Hastie (2007) considered cuto point constructed based on a combined disribution of random variablesX and Y.

Given a population cuto point, the eciency of an outlier sum or outlier mean is then seriously dependent on the quality of the estimator of the unknown cuto point. We raise this estimation question and consider two types of cuto point estimators for comparison of asymptotic variances and powers.

Let us denote (1;)-central rangeCR=F ;1 x (1; 2) ;F ;1 x (2), the range

of central (1;) quantile interval (F ;1

x (2)F ;1

x (1;

(10)

formulations of population cuto points are popularly used for identication of outlier observations: Ca(1;) =F ;1 x (1; 2 ) +CR = 2F;1 x (1; 2 );F ;1 x (_{2 )} and Cb(1;) =F ;1 x (1; 2 ) + 1:5CR = 2:5F;1 x (1; 2 );1:5F ;1 x (₂₎:

In case that 1; = 0:5, the 0:5-CR is the interquartile range IQR. Let

us call Ca(1;) the type I cuto point and Cb(1;) the type II cuto

point. For estimation of cuto points, we assume that there are a random sampleX1:::Xnshowing in (2.1) drawn from distribution Fx and we need

to specify one estimator of Ca(1;) or Cb(1;).

3. Cuto Points Estimators Based on Empirical Quantiles and

Symmetric Quantiles

Classically the population quantile functionF;1

x is estimated by the

em-pirical quantile F;1 n . We call ^ Ca(1;) = 2F ;1 n (1; 2 );F ;1 n (_{2 )}

the empirical quantile based type I cuto point estimator and ^ Cb(1;) = 2:5F ;1 n (1; 2 );1:5F ;1 n (_{2 )}

the empirical quantile based type II cuto point estimator.

Besides the two empirical quantile based cuto point estimators, we also propose an alternative ones constructed by symmetric quantile of Chen and Chiang (1996). The so-called symmetric quantile is formulated based on a folded distribution function. Let x be a constant, known or unknown, the

folded cumulative function about x for random variableX is dened as

(11)

Then the 1;symmetric quantile pair dened by Chen and Chiang (1996) is (F; s (1;)F + s (1;)) = (;F ;1 s (1;)+F ;1 s (1;)) where F;1

s (1;) = inffa :Fs(a) 1;g: If Fx is continuous, the 1;

symmetric quantile pair satises 1;=P(F ;

s (1;)X F +

s (1;)).

If we further assume that Fx is symmetric atx, it can be seen that

F; s (1;) =F ;1 x (_{2 ) and} F+ s (1;) =F ;1 x (1; 2 ) (3.1) the classical one and the symmetric one are identical.

Two symmetric type cuto points are analogously dened as

C_sa(1;) =F + s (1;) + (F + s (1;);F ; s (1;)) = 2F+ s (1;);F ; s (1;) =x + 3F;1 s (1;) and Csb(1;) =F + s (1;) + 1:5(F + s (1;);F ; s (1;)) = 2:5F+ s (1;);1:5F ; s (1;) =x+ 4F;1 s (1;)

Then, if Fx is continuous and symmetric, we have

Csa(1;) =Ca(1;) and Csb(1;) =Cb(1;):

Let ^x be an estimate of x. We may dene the sample type 1 ;

symmetric quantile pair as (F; sn(1;)F + sn(1;)) = (^x;F ;1 sn (1;)^x+F ;1 sn (1;)) (3.2) whereFsn(a) = 1 nPn i=1I(

jyi;^ja) is the sample type folded cumulative

distribution function and F;1

sn (1;) = inffa:Fsn(a)1;g: The sample

type symmetric cuto points are as follows ^ Csa(1;) = 2F + sn(1;);F ; sn(1;) = ^x+ 3F ;1 sn (1;) ^ C_sb(1;) = 2:5F + sn(1;);1:5F ; sn(1;) = ^x+ 4F ;1 sn(1;)

(12)

The equality of (3.1) does not hold when the underlying distribution F

is not symmetric so that there is no fair criterion to compare their corre-sponding sample coverage intervals. Hence, we may set the case that F is symmetric to compare the precision of these two coverage intervals through the asymptotic variances of their sample type coverage intervals.

It is desired to give a simple example to describe the construction of these two cuto point estimates and see how the symmetric type cuto point estimate is worth to be introduced for outlier detection.

Example 1.

Suppose that we have a set of 10 observations that are ordered as

;5;3;2;1;0:50:51350100: (3.3)

We want to construct = 0:2 empirical and symmetric type I cuto point estimates for identication of outliers. WithF;1

n (0:1) =;5 andF ;1

n (0:9) =

50, the = 0:2 empirical type I cuto point estimate is ^

Ca(0:8) = 2F;1

n (0:9);F ;1

n (0:1) = 250;(;5) = 105

For construction of symmetric cuto point estimate, we choose sample me-dian as the estimate of x. That is,

^ x =F;1 n (0:5) = inffa: 110 10 X i=1 I(xia)0:5g=;0:5:

Let's denote residuals ei =xi;^xi= 1:::10. The residuals are ;4:5;2:5;1:5;0:5011:53:550:5100:5:

The sample type folded cumulative distribution function is

Fsn(a) = 110 10 X i=1 I(jeija): For examples,Fsn(0) = 1 10Fsn(1) = 1 10I( j;0:5j1)+I(j0j 1)+I(j1j 1)] = 3 10. Then we have F;1 sn (0:8) = inffa: 110 10 X i=1 I(jeija)0:8g = 4:5:

(13)

This indicates that the 80% symmetric coverage interval is ^ Csa(0:8) = 2(^x+F;1 sn(0:8));(^x;F ;1 sn (0:8)) = 2(;0:5 + 4:5);(;0:5;4:5) = 13:

We consider that the observations beyond the cuto point estimate are classied as outliers. The empirical type I cuto point estimate is ^Ca(0:8) =

105 indicating that there is no observation to be classied as outlier. On the other hand, the symmetric cuto point estimate is ^C_sa(0:8) = 13 indicating that there are observations 50100. From the data in (3.3), the cuto point estimate based on symmetric quantiles is quite satisfactory.

The equality of (3.1) does not hold when the underlying distribution Fx

is not symmetric so that it is not fair to compare, in any criterion, two types of sample coverage intervals. Hence, we may set the case that Fx is

symmetric to compare the precision of these two coverage intervals through the asymptotic variances of their sample type coverage intervals.

4. Eciency Comparisons for Cuo Point Estimators

Two properties are desired to discover for two cuto points. First, the asymptotic distributions of these two cuto point nonparametric estimators are interesting to discover and a comparison for their asymptotic variances in estimation of the same population cuto point is needed. Second, it is interesting to study the powers of these two cuto points for their roles of identifying outliers. We study the rst question in this section.

The following theorem introduced the asymptotic distributions of the two types of empirical cuto point.

Theorem 4.1.

(a) n1=2( ^C a(1;);Ca(1;)) is asymptotically normal N(02 empa) where 2 empa= _{2 (}_f_x₍_F;1 x (1; 2)) + ;2 2fx(F;1 x (2)) 2+ ( 2 ; fx(F;1 x (1; 2)) ; 2fx(F;1 x (2))) 2] + (1 ;)( fx(F;1 x (1; 2)) + 2fx(F;1 x (2))) 2:

(14)

(b)n1=2( ^C b(1;);Cb(1;)) is asymptotically normalN(0 2 empb) where 2 empb = _{2 (}₄_f_x₍_F;15 x (1; 2)) + 3(;2) 4fx(F;1 x (2)) 2+ ( 5(2 ;) 4fx(F;1 x (1; 2)) ; 3 4fx(F;1 x (2))) 2] + (1 ;)( 5 4fx(F;1 x (1; 2)) + 3 4fx(F;1 x (2))) 2:

To study the asymptotic distribution of the symmetric type cuto points, we restrict to the following location models,

Xi =x+ii= 1:::n (4.1)

wherei's are independent and identically distributed (iid) random variables

having distribution functions Gx with zero mean, variance2

x and

probabil-ity densprobabil-ity functiongx. For convenience of comparison, we also assume that

Gx is symmetric at zero.

We consider that x is the median parameter and let ^x be the sample

median as ^ x = arginfx 2R n X i=1 jXi;xj:

Suppose that we assume that Gx is continuous and symmetric at 0. The

asymptotic distributions of two symmetric cuto points are stated in the following theorem.

Theorem 4.2.

Assuming that distribution function Gx is symmetric at

zero, we have the following asymptotic properties. (a)n1=2( ^Csa(1 ;);Csa(1;)) is asymptotically normalN(0 2 syma) where 2 syma = _{2 (}; 1 2gx(0) + 3(1 ;) 2gx(G;1 x (1; 2))) 2+ ( 1 2gx(0) + 3(1 ;) 2gx(G;1 x (1; 2))) 2] + 12(1;)( 1 2gx(0) + 2gx(G;13 x (1; 2))) 2+ ( 1 2gx(0) ; 3 2gx(G;1 x (1; 2))) 2] (b)n1=2( ^Csb(1 ;);Csb(1;)) is asymptotically normalN(0 2 symb) where 2 symb = _{2 (}; 1 2gx(0) + 2(1 ;) gx(G;1 x (1; 2))) 2+ ( 1 2gx(0) + 2(1 ;) gx(G;1 x (1; 2))) 2] + 12(1;)( 1 2gx(0) + gx(G;12 x (1; 2))) 2+ ( 1 2gx(0) ; 2 gx(G;1 x (1; 2))) 2]

(15)

With asymptotic distributions of two types of cuto point estimators developed, we may consider several distributions for error variable for computation of their asymptotic variances to compare their eciencies for estimating the unknown cuto points. The distributions considered here include standard normal distribution N(01), t-distribution t(r) where r is the degrees of freedom, Cauchy distribution (Cauchy(s)s > 0) with pdf

gx() = 12+s s2 2R

and the Laplace distribution (Lap(b)) with pdf

gx() = 12_be; jj

b 2R:

We display the computed eciencies in Table 1.

Table 1.

Asymptotic variances for two quantile-based cuto point estima-tions 2 syma 2 empa 2 symb 2 empb N(01) 0:05 32:85 34:94 57:19 59:28 0:15 15:88 16:18 27:02 27:32 0:25 11:52 11:43 19:26 19:17 0:35 9:274 9:020 15:26 15:01 0:45 7:761 7:441 12:57 12:25 t(1) 0:05 27838 31091 49488 52741 0:15 955:8 1077 1697 1819 0:25 196:6 222:9 347:6 373:9 0:35 70:25 79:37 122:9 132:0 0:45 33:36 37:13 57:39 61:16 Cauchy(3) 0:05 250543:8 279822:4 445394:0 474672:5 0:15 8602:33 9701:71 15275:7 16375:1 0:25 1769:50 2006:15 3128:51 3365:17 0:35 632:25 714:33 1106:74 1188:81 0:45 300:25 334:22 516:51 550:48 Lap(1) 0:05 172:0 191:0 305:0 324:0 0:15 52:00 57:66 91:66 97:33 0:25 28:00 31:00 49:00 52:00 0:35 17:71 19:57 30:71 32:57 0:45 12:00 13:22 20:55 21:77

(16)

: Symmetric type cuto points have smaller asymptotic variance than it

of empirical quantile based cuto points

We may draw several conclusions from the results in Table 1:

1. In few cases (normal distribution with = 0:250:35 and 0:45), it is relatively more ecient estimating the cuto point by the empirical quan-tiles. This indicates that when we want to estimate the population cuto point and we know that the underlying distribution is normal the version estimated by empirical quantiles is appropriate. However, we note that although the dierences between these two versions are not signicant. 2. For the distributions other than the normal one, the estimate based on symmetric quantiles is simultaneously more ecient than it based on empirical quantiles. In an overall comparison, we may say that the cuto point estimate based on symmetric quantile is a robust one.

3. In this consideration of nonparametric estimation, we may expect that any cuto point estimator based on symmetric quantiles is a robust one.

In gene inuential analysis, a common situation is that there are only few observations lie beyond the main trend of the model. Hence, to study the large sample properties of cuto estimators for these distributions is desired. We consider the following contaminated normal distribution

= (1;)H +N(h1) (4.2)

which ensures that a large proportion of observations drawn from the same distribution under H0 and a small proportion of observations are outliers.

We compute the eciencies of two cuto point estimates dened as the followings:

effsym = minf 2

syma2

empag

2

syma effemp= min

f 2 syma2 empag 2 empa

A simulation results are displayed in Table 2.

Table 2.

Eciencies

effsym

effemp

of type I cuto point estimates by sym-metric quantile and empirical quantile (h= 1)

(17)

H = 0:05 0:15 0:25 0:35 0:45 H =N(01) = 0:05 1 0:955 1 0:989 0:987 1 0:971 1 0:959 1 = 0:1 1 0:963 1 0:994 0:984 1 0:970 1 0:959 1 H =t(1) = 0:05 1 0:895 1 0:883 1 0:857 1 0:880 1 0:906 = 0:1 1 0:895 1 0:874 1 0:835 1 0:878 1 0:915 H =Lab(1) = 0:05 1 0:879 1 0:889 1 0:906 1 0:917 1 0:922 = 0:1 1 0:859 1 0:878 1 0:909 1 0:928 1 0:935

Several comments may be drawn from the results in Table 2:

1. The eicencies of the symmetric type cuto point estimate although has eciencies smaller than one on situations that H is normal and 0:25,

however, they are at least more than 0:95.

2. The eicencies of the empirical type cuto point estimate are with eciencies smaller than one on all distributions other than normal one and it can be as small as 0:835. In comparison for this contaminated distribution, the symmetric type cuto point estimate is also a robust proposal.

3. Since our study of cuto point estimation is primarily motivated from the gene inuential analysis and, in this analysis, it often faces few extreme outliers in the treatment group that is a type of contaminated distribution, this comparison shows that the symmetric type cuto point estimate is an appropriate choice for outlier detection of this analysis.

We also consider the two estimators of the type II cuto point that we display their eciencies in Table 2.

Table 3.

Eciencies

effsym

effemp

of type II cuto point estimates by sym-metric quantile and empirical quantile (h= 1)

(18)

H = 0:05 0:15 0:25 0:35 0:45 H =N(01) = 0:05 1 0:988 0:996 1 0:985 1 0:977 1 0:971 1 = 0:1 0:998 1 0:986 1 0:978 1 0:972 1 0:968 1 H =t(1) = 0:05 1 0:938 1 0:927 1 0:890 1 0:917 1 0:946 = 0:1 1 0:938 1 0:914 1 0:854 1 0:906 1 0:952 H =Lab(1) = 0:05 1 0:911 1 0:920 1 0:942 1 0:955 1 0:962 = 0:1 1 0:881 1 0:899 1 0:939 1 0:964 1 0:977

From the results showing in Table 2, the eciencies performed by two quantile methods are quite similar to those showing in Table 1.

5. Power Comparisons for Cuto Point Estimators

With asymptotic distributions of two sample cuto point estimates, it is desired to study a cuto point for its ability to detecting an observation drawn from an alternative distribution. Consider that we have the following sample location model

Xi =x+ii= 1::n

and we have a random variable Y drawn from an alternative distribution. Let C be a cuto point and ^C be an estimator of C constructed from the sample of variable X. The power of detection of outlierY is

=PfY C^g: (5.1)

Suppose that p

n( ^C;C) converges, in distribution to a normal distribution

N(02 c). An approximate power is =PfY C^g Pf p nY ; p n( ^C;C) p nCg =Pf p nY ;Nc p nCg

(19)

where Nc is a random variable with distribution N(02

c). If a

distribu-tion of the combinadistribu-tionp

nY ;Nc is available, the approximate power then

can be computed. It is then interesting to compare the powers for cut-o points estimated from empirical quantile and symmetric quantile. We further compute the powers of symmetric cuto point and empirical cuto point, respectively denoted by sym and emp. We note that sym and emp

are very close values in all the cases. However, it is still worthy to compare their sizes.

In the following two tables, we display the comparisons of sym and emp

for type I and type II symmetric and empirical cuto point estimators when the sample is drawn from several distributions of interest.

Table 4.

Power comparison of type I symmetric and empirical cuto point estimators (n= 30)

sym > emp sym < emp

X t(r 1) Y t(r 2) + r1 =r2 = 1 = 0:2 = 3:::10 = 0:512 = 0:5 = 3:::10 = 0:512 r1 = 3r2 = 1 = 0:2 = 5:::10 = 0:51234 = 0:5 = 34::10 = 0:512 X N(01) Y N(01) + = 0:1 = 5678910 = 0:51234 = 0:2 = 45678910 = 0:5123 = 0:5 = 0:512 = 34567 X N(01) Y (1;)N(01) +N(1) = 0:05= 0:5 = 0:51:::10 none = 0:05= 0:2 none = 0:51:::10 = 0:1= 0:5 = 0:51:::10 none = 0:1= 0:2 none = 0:51:::10 = 0:2= 0:5 = 0:51:::10 none = 0:2= 0:2 none = 0:51:::10

Table 5.

Power comparison of type II symmetric and empirical cuto point estimators (n= 30)

(20)

sym > emp sym < emp X t(r 1) Y t(r 2) + r1 = 3r2 = 1 = 0:2 = 7:::10 = 0:5123:::6 = 0:5 = 4:::10 = 0:5123 r1 = 3r2 = 3 = 0:2 = 7:::10 = 0:512:::6 = 0:5 = 4::10 = 0:5123 X N(01) Y (1;)N(01) +N(1) = 0:05= 0:5 = 0:51:::10 none = 0:05= 0:2 none = 0:51:::10 = 0:1= 0:5 = 0:51:::10 none = 0:1= 0:2 none = 0:51:::10 = 0:2= 0:5 = 0:51:::10 none = 0:2= 0:2 none = 0:51:::10 From the results showing in Tables 4 and 5, the two cuto point estima-tors are quite competitive. Some are better with symmetric type estimation and some are better with empirical estimation. However, from the robust-ness consideration, we prefer to use the symmetric type cuto point estima-tior since its estimation is more reliable with smaller asymptotic variances that showed in Section 4.

It is desired to study the power performance for these two cuto point estimators when the outliers not only shift in both location and scale. We further consider the following contaminated normal distribution

(1;)N(01) +N( 2)

Table 6.

Power comparison of types I and II symmetric and empirical cuto point estimators (n= 30)

(21)

sym > emp sym < emp X N(01) Y (1;)N(01) +N( 2) = 0:1 = 2= 0:5 = 0:5:::10 none = 0:1 = 2= 0:2 none = 0:5:::10 = 0:1 = 5= 0:5 = 0:5:::10 none = 0:1 = 5= 0:2 none = 0:5:::10 = 0:1 = 10= 0:5 = 0:5:::10 none = 0:1 = 10= 0:2 none = 0:5:::10

6. Appendix

Proof of Theorem 4.1.

From Ruppert and Carroll (1980), we have a representation of the empirical quantile ^F;1

x () as n1=2( ^F;1 x ();F ;1 x ()) = _f_x₍_F;11 x ())n ;1=2 n X i=1 (;I(XiF ;1 x ()))+op(1): (6.1) Since the emprical quantile based cuto point estimates ^Ca(1; ) and

^

Cb(1;) are both linear functions of empirical quantiles ^F ;1

x (1;) and

^

F;1

x (), careful arrangements of the corresponding representations of these

two empirical quantiles lead to the following representations.

(a) A large sample representation for cuto point estimator ^Ca(1;) is as

follows: n1=2( ^C a(1;);Ca(1;)) =n ;1=2 n X i=1 f; fx(F;1 x (1; 2)) ; ;2 2fx(F;1 x (2)] I(Xi F ;1 x (_{2 )) +}; fx(F;1 x (1; 2)) ; 2fx(F;1 x (2))] I(F;1 x (_{2 )}Xi F ;1 x (1; 2 )) + 2 ; fx(F;1 x (1; 2)) ; 2fx(F;1 x (2))] I(Xi F ;1 x (1; 2 ))g+op(1)

(22)

follows: n1=2( ^C b(1;);Cb(1;)) =n ;1=2 n X i=1 f; 5 4fx(F;1 x (1; 2)) ; 3(;2) 4fx(F;1 x (2)] I(Xi F ;1 x (_{2 )) +}; 5 4fx(F;1 x (1; 2)) ; 3 4fx(F;1 x (2))] I(F;1 x (_{2 )}Xi F ;1 x (1; 2 )) + 5(2;) 4fx(F;1 x (1; 2)) ; 3 4fx(F;1 x (2))] I(Xi F ;1 x (1; 2 ))g+op(1):

The theorem is induced from the central limit theorem.

Proof of Theorem 4.2.

Again, from Ruppert and Carroll (1980), we have a representation for this sample median as

n1=2(^ x;x) =n ;1=2 1 gx(0) n X i=1 (0:5;I(i 0)) +op(1): (6.2)

On the other hand, a Barhadur representation for F;1

sn (1;) developed by

Chen and Chiang (1996) is

n1=2(F;1 sn (1;);(F ;1 x (1; 2 );x)) = 1 2gx(G;1 x (1; 2)) n;1=2 n X i=1 f1; ;I(G ;1 x (_{2 )}i G ;1 x (1; 2 )g+op(1): (6.3)

With, again, careful arrangements of representations of (6.3), we can derive representations of symmetric type cuto point estimates ^C_sa(1; ) and

^

C_sb(1;).

A large sample representation for outlier mean ^C_sa(1;) is as follows:

n1=2( ^Csa(1 ;);Csa(1;)) =n ;1=2 n X i=1 f; 1 2gx(0) + 3(1 ;) 2gx(G;1 x (1; 2))] I(i G ;1 x (_{2)) +}; 1 2gx(0) ; 3 2gx(G;1 x (1; 2))] I(G;1 x (_{2 )}i 0) + 1₂_g_x₍₀₎ ; 3 2gx(G;1 x (1; 2))] I(0i G ;1 x (1; 2 )) + 2gx1(0) + 3(1;) 2gx(G;1 x (1; 2))] I(i G ;1 x (1; 2 )) +op(1)

(23)

A large sample representation for outlier mean ^C_sb(1;) is as follows: n1=2( ^Csb(1 ;);Csb(1;)) =n ;1=2 n X i=1 f; 1 2gx(0) + 2(1 ;) gx(G;1 x (1; 2))] I(i G ;1 x (_{2 )) +}; 1 2gx(0) ; 2 gx(G;1 x (1; 2))] I(G;1 x (₂₎i 0) + 1₂_g_x₍₀₎ ; 2 gx(G;1 x (1; 2))] I(0i G ;1 x (1; 2 )) + 2gx1(0) + 2(1;) gx(G;1 x (1; 2))] I(i G ;1 x (1; 2 )) +op(1):

The theorem is induced from the central limit theorem for the above two representations, respectively, for ^C_sa(1;) and ^Csb(1;).

References

Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identied as lead marker of colon cancer progression, using pooled sample expression proling. J. Natl. Cancer Inst. 94, 513-521.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distinct types of diuse large B-cell lymphoma identied by gene expression proling.

Nature, 403, 503-511.

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.

Chen, L.-A., Chen, D.-T. and Chan, W. (2008). p value for outlier sum in dierential gene expression analysis. Submitted to Biometrika for publication (In revision).

Kim, S. J. (1992). The metrically trimmed means as a robust estimator of location,Annals of Statistics. 20, 1534-1547.

Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression proling of human atrial myocardium with atrial brillation by DNA microarray analysis. Int. J. Cardiol. 102, 233-238.

Ruppert, D. & Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

(24)

Sorlie, T., Tibshirani, R., Parker, J., eta l. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. U.S.A., 100, 8418-8423.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics, 8, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science, 310, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection. Bio-statistics, 8, 566-575.

切割點之估計

國 立 交 通 大 學

統計學研究所

碩 士 論 文

切割點之估計

Cutoff Point Estimation

研 究 生：侯智飛

指導教授：陳鄰安 博士

切割點之估計

Cutoff Point Estimation

國 立 交 通 大 學

統計所研究所

碩 士 論 文

切割點之估計

國立交通大學統計學研究所

Cutoff Point Estimation

誌謝

Contents

Cuto Point Estimation

Abstract

1. Introduction

2. Symmetric and Classical Cuto Points

3. Cuto Points Estimators Based on Empirical Quantiles and

Symmetric Quantiles

Example 1.

4. Eciency Comparisons for Cuo Point Estimators

Theorem 4.1.

Theorem 4.2.

Table 1.

Table 2.

Table 3.

5. Power Comparisons for Cuto Point Estimators

Table 4.

Table 5.

Table 6.

6. Appendix

Proof of Theorem 4.1.

Proof of Theorem 4.2.

References

75

國立交通大學

碩士論文

研究生：侯智飛

指導教授：陳鄰安博士

國立交通大學

碩士論文

4. Eciency Comparisons for Cuo Point Estimators