• 沒有找到結果。

離群值比例之基因分析

N/A
N/A
Protected

Academic year: 2021

Share "離群值比例之基因分析"

Copied!
23
0
0

加載中.... (立即查看全文)

全文

(1)

國 立 交 通 大 學

統 計 學 研 究 所

碩 士 論 文

離群值比例之基因分析

Outlier Proportion Based Gene Expression Analysis

研 究 生:刁瀅潔

指導教授:陳鄰安 博士

(2)

離群值比例之基因分析

Outlier Proportion Based Gene Expression Analysis

研 究 生:刁瀅潔 Student:Ying-Chieh Tiao 指導教授:陳鄰安 博士 Advisor:Dr. Lin-An Chen

國 立 交 通 大 學

統 計 學 研 究 所

碩 士 論 文

A Thesis

Submitted to Institute of Statistics College of Science

National Chiao Tung University In Partial Fulfillment of the Requirements

for the Degree of Master

in Statistics June 2010

Hsinchu, Taiwan, Republic of China

(3)

離群值比例之基因分析 學生:刁瀅潔 指導教授:陳鄰安 博士 國立交通大學統計學研究所碩士班 摘 要 藉由偵測病體樣本中的離群值而找出具有影響力的基因已是一種非常新 且重要的基因分析方法。透過離群和或是離群平均可以偵測出離群資料中的 集中趨勢是否有所改變,但是卻無法偵測出偏度等其它特徵量數。因此,我 們希望可以提供一個容易實行且有較高檢定力的統計檢定,以作為基因分析 的另一項替代選擇方法。我們將提出離群值比例的觀點,以離群值比例的近 似分配為基礎,發展出一項統計檢定。此外,我們也將更進一步地比較離群 值比例和離群平均兩者的檢定力表現。而為了避免估計尾端機率點的密度函 數之困難,進而造成檢定力較低的缺點,因此我們將採用經驗分位數當作切 點。 i

(4)

Outlier Proportion Based Gene Expression Analysis

Student: Ying-Chieh Tiao Advisor: Dr. Lin-An Chen Institute of Statistics

National Chiao Tung University

Abstract

Discovering the influential genes through the detection of outliers in samples of disease group subjects is a very new and important approach for gene expression analysis. The outlier sum or outlier mean technique can detect the shift in central tendency for the outlier data but not other characteristics such as spreadness or others for the outlier data. It is desired to provide a test that is easy to implement and efficient in power performance as an alternative tool for gene expression analysis. We propose the concept of outlier proportion for developing a test based on asymptotic distribution of this statistics. We further compare it with the outlier mean for their power performances. To avoid the inefficiency in estimating densities at tail quantiles involved in estimation of outlier proportion variance, we further consider applying the empirical quantile as the cutoff point for an alternative outlier proportion based test which shows satisfactory role in gene expression analysis from the point of power performance.

(5)

致 謝

從小到大,十八年的學生生涯即將奏起最終樂章,而這也意味著我將正 式地離開學校生活,投入職場展翅飛翔。 首先我要由衷地感謝我的論文指導老師-陳鄰安教授。老師總是很用心 地的指導著我,耐心地為我解惑。古人云:「師者,所以傳道、授業、解惑 也。」這些都一再地在老師身上得到印證。不僅如此,老師也會不時地關心 著我的生活情況,讓我備感溫暖,銘感五內。更因為老師的教導及用心,讓 我無論是在課業或是論文方面,都得到了最多且珍貴的收穫,讓我覺得自己 真的是很幸福,可以遇到這麼棒的一位論文指導老師,因此我非常真摯地說 一句:「老師,謝謝您!」。此外,也要謝謝三位論文口試委員對這篇論文 的指教和建議,使得整篇論文可以更加豐富與完整。 再者我要謝謝交大統計所 97 級的所有同學,感謝你們這兩年陪我一起成 長,還有要謝謝郭姊這兩年辛苦地替我們打理研究所生活的一切事宜,真的 辛苦您了。此外,我也要謝謝一直陪伴在我身邊的朋友、學長姐和學弟妹們, 尤其是我最愛的大學七姊妹們,因為你們的鼓勵與陪伴,給了我最大的信心, 我相信我們之間的這份友情,一輩子都不會改變。 最後,我要感謝我的家人,因為你們一直是我的最佳後援部隊,給了我 無限的支持與關心,讓我可以無後顧之憂的去追求我的夢想,真的非常謝謝 你們。 在此,將本篇論文獻給我的家人、朋友、師長以及所有曾經幫助過我, 陪伴過我的人。我將致上我最真摯的謝意,和你們分享這份成果與喜悅。 刁瀅潔 謹誌于 國立交通大學統計學研究所 中華民國九十九年六月 iii

(6)

Contents

摘要 ……… i Abstract……… ii 致謝 ……… iii 1. Introduction ……… 1 2. Outlier Proportion ……… 3

3. A Test Based on Asymptotic Distribution of Sample Outlier Proportion ……… 4

4. An Outlier Proportion Test With Empirical Quantile as Cutoff point ……… 8

5. Simulation Study ……… 12

6. Appendix ……… 15

(7)

Outlier Proportion Based Gene Expression Analysis

SUMMARY

Discovering the inuential genes through the detection of outliers in samples of disease group subjects is a very new and important approach for gene expression analysis. The outlier sum or outlier mean technique can detect the shift in central tendency for the outlier data but not other characteristics such as spread or others for the outlier data. It is desired to provide a test that is easy to implement and ecient in power performance as an alternative tool for gene expression analysis. We propose the concept of outlier proportion for developing a test based on asymptotic distribution of this statistic. We further compare it with the outlier mean for their power performances. To avoid the ineciency in estimating densities at tail quantiles involved in estimation of outlier proportion variance, we further consider applying the empirical quantile as the cuto point for an alternative outlier proportion based test which shows satisfactory role in gene expression analysis from the point of power performance.

1. Introduction

DNA microarray technology, which simultaneously probes thousands of gene expression proles, has been successfully used in medical research for disease classication (Agrawal et al. (2002) Alizadeh et al. (2000) Ohki et al. (2005)) Sorlie et al. (2003)). Among the existed techniques in dieren-tial genes detection, common statistical methods for two-group comparisons such ast-test, are not appropriate due to a large number of genes expressions and a limited number of subjects available. Several statistical approaches have been proposed to identify those genes where only a subset of the sam-ple genes has high expression. Among them, Tomlins et al. (2005) observed that there is small number of outliers in samples of dierential genes and then introduced a method called cancer outlier prole analysis that identies outlier proles by a statistic based on the median and the median absolute deviation of a gene expression prole. With this observation, a sequence of approaches then concentrated on detecting dierential genes based on

out-TypesetbyA M

S-T E

(8)

lier samples while Tibshirani and Hastie (2007) and Wu (2007) suggested to use an outlier sum, the sum of all the gene expression values in the disease group that are greater than a specied cuto point. The common disad-vantage of these techniques is that the distribution theory of the proposed methods has not been discovered so that the distribution based p value can not been applied. Recently Chen, Chen and Chan (2010) considered the outlier mean (average of outlier sum) and developed its large sample theory that allows us to formulate the distribution based p value. In specic, they considered the parametric study by specifying the normal distribution and performed simulation studies and data analysis for gene expression analysis. According to Tomlins et al. (2005), it is desired to verify if the variables for disease group subjects and normal group subjects on the region excessed a cuto point are identical. The outlier mean approach of Chen, Chen and Chan (2010) can detect if the excessive means are dierent. We know that summarizing the outlier data by its sum or mean (average) may be ecient when the central tendencies of two distributions on excessive region are sig-nicantly dierent. However, it is known that it is not enough to detect just the shift in mean while there may have a shift other than the central ten-dency. So, it requires to measure other characteristics showing in the outlier data as an alternative for detection of inuential genes. Here, in this paper, we consider the proportion of outlier data, called the outlier proportion, to detect the inuential genes. Interestingly this study shows that outlier pro-portion technique provides a technique very simple in computation but it is also much more ecient than the outlier mean test in detection of inuential genes.

In Section 2, we introduce the concept of population outlier proportion and study the adequacy for using it in detection of distributional shift. In Section 3, we study large sample property of the outlier variance and we compare the power performances between the tests based on outlier mean and outlier proportion. In Section 4, we propose an alternative outlier proportion based test that avoids the estimation of densities on extreme quantiles for construction of test statistic.

(9)

2. Outlier Proportion

In a study that consists ofn1 subjects in the normal control group andn2

subjects in the disease group, suppose that there are m genes to be investi-gated. Their gene expression can be represented as Xiji = 12:::n1j =

1:::m for normal control group and Yiji = 12:::n2j = 12:::m for

the disease group.

For theoretical development, let us x a gene and we drop the index j. Let X and Y be expression variables with expression Xii = 1:::n1 for

group of normal subject and Yii = 1:::n2 for group of disease subject,

respectively, with distribution functions FX and FY.

An important observation by Tomlins et al. (2005) from a study of prostate cancer, outlier genes are over-expressed only in a small number of disease samples. With dening a cuto point ^ determined from the data of the variable X, Tibshirani and Hastie (2007) and Wu (2007) con-sidered the sum of variables Y0

is that are over higher cuto point ^ given

by Pn 2

i=1YiI(Yi ^) as a test statistic for detection if the disease group

distribution is dierent from the normal group distribution. Latter, Chen, Chen and Chan (2010) developed the asymptotic distribution for its aver-age, called the outlier mean, LY = (Pn

2 i=1I(Yi ^)) ;1 Pn 2 i=1YiI(Yi ^)

for constructing a distribution basedpvalue. Let be the population coun-terpart of the sample cuto point ^. The idea behind the outlier mean approach considers a test based on LY to verify if its corresponding

pop-ulation outlier mean `Y = E(Y

jY ) varied from the same population

outlier mean when FY =FX as `X =E(X

jX ).

We consider here to establish a test based on the sample outlier propor-tion, a tail probability estimator, as

^ Y =n;1 2 n2 X i=1 I(Yi ^): (2.1)

Hence, the idea behind this sample percentage is to verify if its corresponding population outlier proportion

(10)

varied from the same population outlier proportion whenFY =FX asX =

PfX g:

To verify if this consideration is appropriate, we suggest the population cuto point of the form  = 2F;1

X (1;);F

;1

X () and make a numerical

comparison of two outlier proportions. We consider the following setting Normal :X N(01) and Y N(1)

Mixed normal:X N(01)Y 0:9N(01) + 0:1N( 2):

Population outlier proportions for variables X and Y under the above set-tings are displyed in Table 1 with the specied 0s and 0s.

Table 1.

Population outlier proportions ( = 1)

 X = 1Y = 3Y = 5Y FX =N(01) FY =N(1) 0:01 1:48E;12 1:12E ;9 3:19E;7 0:0239 0:05 4:01E;7 4:16E ;5 0:0016 0:5260 0:1 6:03E;5 0:0022 0:0325 0:8760 0:2 0:0057 0:0636 0:2998 0:9933 0:25 0:0215 0:1530 0:4906 0:9985 0:35 0:1238 0:4380 0:8006 0:9999 0:45 0:3530 0:7333 0:9477 0:9999 Mixed Normal 0:01 1:13E;10 3:19E;8 0:0023 0:05 4:52E ;6 1:67E;4 0:0526 0:1 2:76E ;4 0:0033 0:0876 0:2 0:0115 0:0351 0:1045 0:25 0:0346 0:0684 0:1192 0:35 0:1552 0:1915 0:2114 0:45 0:3911 0:4125 0:4177 Conceptually the bigger the dierence Y ;X, the easier to establish

a test in detection of distributional shift. From Table 1, we expect that larger 0s make the detection by outlier proportion more powerful. We will

evaluate this point in the subsequent sections.

3. A Test Based on Asymptotic Distribution of Sample Outlier

Proportion

(11)

The sample outlier proportion is dened by ^ Y = 1n 2 n2 X i=1 I(Yi ^)

where cuto point estimator is ^= 2 ^F;1

X (1;);F^

;1

X () and where ^F;1

X ()

is the th empirical quantile based on sample Xii= 1:::n1.

To construct a distribution based test statistic by this outlier proportion, we state an asymptotic distribution for this statistic in the following theorem where its proof is given in Appendix.

Theorem 3.1.

Suppose that assumptions (A2) and (A3) in the Appendix

are true. Then n1=2 2 (^Y ;Y) converges in distribution to N(0 2 ) where 2  =(b1 ;(1;)b 2) 2+(1 ;2) 2(b 1+b2) 2+( ;(1;)b 1+b2) 2+ Y(1;Y): Here we let b1 =2 fY()f ;1 X (F;1 X (1;)) b2 = fY()f ;1 X (F;1 X ()):

This theorem indicates, under H0 :Fx =Fy, the following

PH0 f p n2( ^ Y ;X  )zg! Z z ;1 (z)dz

for z 2 R where represents the probability density function of N(01).

Suppose that we have estimates ^  and ^X, a test based on the sample

outlier proportion is rejecting H0 if n 1=2 2 ( ^ Y ;^X ^  ) z : (3.1)

The test tries to see if outlier proportion for disease group subjects is dier-ent from it for normal group subjects. As a nonparametric approach, this test statistic involves the estimation of some density points fX and fY.

(12)

Having this sample outlier proportion based nonparametric test, it is desired to verify the power performance of this test when there exists dis-tributional shift for the disease group distribution. An approximate power with signicant level  may be derived as bellows

pp =PFY f p n2( ^ Y ;^X ^  ) z g =PFY f p n2( ^ Y ;Y  ) p n2( z ^  p n2 + ^X ;Y  )g PfZ z + p n2(X ;Y)  g: (3.2)

Considering the following distributional settings,

Normal: X N(01)Y N(1)

Laplace distribution: X Laplace(01)Y Laplace(1)

t distribution :X t(5)Y t(5) +

we display the powers pm, for outlier mean based test, and pp, for outlier

proportion based test, in Table 2.

(13)

  = 1  = 2  = 3 Normal = 0:45pm 0:523 0:999 1 pp 0:908 1 1 = 0:35pm 0:144 0:844 1 pp 0:537 1 1 = 0:25pm 0:063 0:294 0:863 pp 0:262 0:992 0:999 = 0:15pm 0:052 0:111 0:151 pp 0:122 0:537 0:579 Laplace = 0:45pm 0:289 0:993 1 pp 0:979 1 1 = 0:35pm 0:050 0:414 0:999 pp 0:390 1 1 = 0:25pm 0:050 0:255 0:490 pp 0:219 0:798 0:999 = 0:15pm 0:050 0:05 0:050 pp 0:123 0:441 0:260 t-distrib = 0:45pm 0:412 0:994 1 pp 0:898 1 1 = 0:35pm 0:077 0:418 0:999 pp 0:543 1 1 = 0:25pm 0:043 0:052 0:518 pp 0:332 0:995 0:998 = 0:15pm 0:046 0:027 0:016 pp 0:203 0:828 0:687

How surprisingly the outlier proportion performs much better than the out-lier mean in these three location distributional shifts.

According to Tomlins et al. (2005), it is desired to verify the power performance of the outlier proportion when there is only a small percentage of outliers in the data ofY. For this, we consider the following distributional setting:

X Laplace(01)Y 0:9Laplace(01) + 0:1Laplace( )

Table 3

Approximate powers of outlier mean and outlier proportion for Laplace mixture

(14)

 (p= 3)m pp (p= 5)m pp ( p= 10)m pp = 3 = 0:45 0:184 0:707 0:253 0:734 0:368 0:756 = 0:35 0:189 0:846 0:276 0:875 0:420 0:896 = 0:25 0:180 0:941 0:299 0:965 0:550 0:978 = 0:15 0:150 0:975 0:255 0:990 0:742 0:996 = 0:05 0:105 0:987 0:130 0:992 0:485 0:999 This computation shows that the outlier proportion is still a satisfactory

one in this case of mixed distribution. This further support the use of outlier proportion in gene expression analysis.

4. An Outlier Proportion Test With Empirical Quantile as Cuto

point

We have observed that the outlier proportion may have satisfactory power performance when we have consistent estimators ^X and ^  to construct test

in (3.1). However, ^  involves estimations of density pointsfY andfX while

estimation of density function at tail quantile points is extremely dicult in practice. Without an alternative proposal avoiding this density estimation, the outlier proportion based test won't be practically powerful in detection of inuential genes unless n1 and n2, the numbers of disease group subjects

and number of normal group subjects, are very large. In this section, we choose cuto point ^ = ^F;1

X ( ) for some > 0. For

not being confused, we denote the outlier proportion as ^  Y = 1n 2 n2 X i=1 I(Yi F^;1 X ( )) for estimating  Y = P(Y F;1

X ( )). We rst study the dierences of two

population outlier proportions under the following distribution setting:

X Laplace(01)Y 0:9Laplace(01) + 0:1Laplace( ):

(15)

X = 3Y = 5Y  = 10Y = 0:9 = 3 0:203 0:751 0:872 0:975 = 5 0:203 0:671 0:779 0:918 = 10 0:203 0:594 0:668 0:798 = 0:95 = 3 0:193 0:747 0:870 0:975 = 5 0:193 0:668 0:777 0:918 = 10 0:193 0:592 0:666 0:797

It is seen that the dierences between two population proportions are quite signicant when the quantile percentage is 0:9 or 0:95. This shows that using quantile as cuto point in detection of outliers is quite satisfactory.

A large sample theory for this quantile based outlier proportion is stated below.

Theorem 4.1.

Suppose that assumptions (A2) and (A3) in the Appendix

are true. Then,n1=2 2 (^  Y ;  Y) converges in distribution toN(0 2 Y) where 2 Y = (1; ) xyf 2 Y(F;1 X ( ))f;2 X (F;1 X ( )) + Y(1;  Y):

To construct a test statistic based on the above theorem, we still face the problem of requiring estimation of 2

Y that involved prediction of density

points fY(F;1

X ( )) and fX(F;1

X ( )) which is dicult unless there is huge

sample. However, under H0 we may replace fY by fX and then 2 Y is induced as 2 X = (1; ) xy+  Y(1;  Y):

In this setting, we need only to nd estimates ^

Y and ^

X to build the outlier

proportion based test as

rejecting H0 if p n2( ^  Y ;^  X ^ X ) z : (4.1)

(16)

An approximate power for outlier proportion based on this quantile cuto point at signicance level  may be derived as bellows

PFY f p n2( ^  Y ;^  X ^ X ) z g =PFY f p n2( ^  Y ;  Y Y ) p n2( z ^ X p n2 + ^  X;  Y Y )g PfZ z X Y + p n2(  X;  Y) Y g: (4.2) where  X =P(X F;1 X ( )).

It is interested to compare outlier mean and outlier proportion both using quantile cuto point in terms of powers. First, we consider the following two location shift models:

Case 1:X N(01) andY N(1)

Case 2:X Laplace(01) and Y  Laplace(1)

We display the results of power in the following table.

Table 5

Approximate powers of outlier mean and outlier proportion

Power  = 1  = 2 = 4 Case 1 ( = 0:9)pm 0:180 0:858 1:0 pp 0:687 0:987 1:0 ( = 0:95)pm 0:122 0:558 1:0 pp 0:407 0:961 1:0 Case 2 ( = 0:9)pm 0:050 0:192 1:0 pp 0:389 0:771 1:0 ( = 0:95)pm 0:050 0:090 1:0 pp 0:235 0:581 1:0

In this location shift models, it still shows that the outlier proportion is better than the outlier mean. This further indicates the appropriateness of applying the outlier proportion in gene expression analysis.

(17)

With observation from Tomlins et al. (2005), it is interested to further investigate a power comparison when there is only a small percentage of outliers in distribution of Y. We evaluate the approximate power for the following two mixed distributions:

Case A :X Laplace(01)Y 0:7Lapace(01) + 0:3N(1)

Case B :X t(5)Y 0:7t(5) + 0:3Laplace(1)

The results are listed in Table 6.

Table 6

Approximate powers of outlier mean and outlier proportion

Power  = 2  = 3  = 4 Case A ( = 0:85)pm 0:107 0:553 0:986 pp 0:634 0:809 0:839 ( = 0:9)pm 0:086 0:252 0:504 pp 0:565 0:815 0:878 ( = 0:95)pm 0:125 0:156 0:237 pp 0:424 0:690 0:881 Case B ( = 0:85)pm 0:335 0:926 0:999 pp 0:637 0:774 0:818 ( = 0:9)pm 0:185 0:640 0:987 pp 0:623 0:805 0:858 ( = 0:95)pm 0:177 0:205 0:458 pp 0:499 0:779 0:880

The approximate powers showing in Table 6 indicates that the outlier pro-portion is still a right choice in these distributional settings. Let us further consider one more distributional setting as

Mixed t :X t(10)Y 0:9t(10) + 0:1(

2(10) +)

(18)

Table 7

Approximate powers of outlier mean and outlier proportion for some mixed distributions

Power  = 2  = 4  = 6 ( = 0:9)pm 0:879 0:895 0:905

pp 0:873 0:953 0:960

( = 0:95)pm 0:873 0:892 0:903

pp 0:900 0:957 0:970

Both methods are with high powers in this distributional setting, however, the outlier proportion based test is still a better one.

5. Simulations Study

Suppose that now we have estimates ^

X and ^ X for 

X and X

respectively. A test based on quantile based outlier probability is stated in (4.1). Let ^ X = 1 n1 Pn 1 i=1I(Xi ^ F;1 X ( )), ^ xy = n2 n1 and ^ X =

(1 ; )^ xy + ^Y(1; ^Y). A question is that is this practically a level

test?

Theoretically the critical point z is 1:645 when we expect the

signi-cance level is 0:05. We conduct m = 100000 replications to simulate the following simulated probablity

pp = 1mXm j=1 I(n1=2 2 ( ^  Y ;^  X ^ X ) `) (5.1)

When we set`= 1:645 (5.1) represents the probability of type I error. with some distributions been used and various sample sizes that the results are displayed in the following table.

(19)

sample size N(01) t(10) Laplace(01) n= 30 0:1156 0:1178 0:1174 n= 50 0:1328 0:1327 0:1341 n= 100 0:1133 0:1125 0:1134 n= 200 0:1258 0:1238 0:1243 n= 500 0:1197 0:1211 0:1198 n= 1000 0:1285 0:1273 0:1264 n= 10000 0:1203 0:1213 0:1205 n= 100000 0:1199 0:1201 0:1198

Unfortunately (4.1) is not practically a level 0:05 test. We now, for each distribution, choose a constant ` such that (5.1) is approximately equal to 0:05 and then further to simulate the power of (5.1) under case I and case II distributions as follows

Case I: X N(01) and Y 0:9N(01) + 0:1(

2(10) +)

Case II: X t(10) and Y 0:9t(10) + 0:1(

2(10) +):

The results are displayed in Table 9 and Table 10.

(20)

H0  = 2  = 4  = 6 = 0:5 pm(c= 2:16) 0:0527 0:9109 0:9303 0:9419 pp(c= 2:38) 0:0516 0:9526 0:9671 0:9782 = 0:55 pm(c= 2:23) 0:0501 0:9167 0:9332 0:9443 pp(c= 2:44) 0:0504 0:9569 0:9685 0:9868 = 0:6 pm(c= 2:28) 0:0504 0:9192 0:9355 0:9443 pp(c= 2:51) 0:0508 0:9739 0:9828 0:9983 = 0:65 pm(c= 2:37) 0:0523 0:9227 0:9394 0:9474 pp(c= 2:62) 0:0513 0:9647 0:9761 0:9802 = 0:7 pm(c= 2:48) 0:0513 0:9227 0:9387 0:9469 pp(c= 2:7) 0:0496 0:9716 0:9826 0:9956 = 0:75 pm(c= 2:74) 0:0511 0:9225 0:9388 0:9493 pp(c= 2:78) 0:0505 0:9623 0:9764 0:9890 = 0:8 pm(c= 2:96) 0:0526 0:9243 0:9388 0:9486 pp(c= 2:83) 0:0510 0:9674 0:9891 0:9912 = 0:85 pm(c= 3:8) 0:0508 0:9169 0:9332 0:942 pp(c= 2:95) 0:0513 0:9598 0:9864 0:9946 = 0:9 pm(c= 4:81) 0:051 0:9034 0:926 0:9368 pp(c= 3:19) 0:0497 0:9580 0:9681 0:9767 = 0:95 pm(c= 20:8) 0:0502 0:6608 0:7208 0:7659 pp(c= 3:58) 0:0506 0:8774 0:9105 0:9423

(21)

H0  = 2  = 4  = 6 = 0:5 pm(c= 2:35) 0:0497 0:8881 0:9119 0:9281 pp(c= 2:18) 0:0508 0:9636 0:9870 0:9958 = 0:55 pm(c= 2:42) 0:0501 0:8932 0:9166 0:9304 pp(c= 2:29) 0:0506 0:9626 0:9863 0:9961 = 0:6 pm(c= 2:47) 0:0508 0:8918 0:9159 0:9336 pp(c= 2:35) 0:0498 0:9540 0:9847 0:9953 = 0:65 pm(c= 2:65) 0:0492 0:8925 0:9167 0:9316 pp(c= 2:42) 0:0509 0:9391 0:9794 0:9937 = 0:7 pm(c= 2:75) 0:051 0:8956 0:917 0:9344 pp(c= 2:5) 0:0495 0:9501 0:9836 0:9951 = 0:75 pm(c= 3:05) 0:05 0:8924 0:9168 0:9308 pp(c= 2:57) 0:0510 0:9207 0:9693 0:9900 = 0:8 pm(c= 3:36) 0:0494 0:8847 0:9109 0:9288 pp(c= 2:73) 0:0497 0:9413 0:9786 0:9925 = 0:85 pm(c= 4:25) 0:0503 0:868 0:9001 0:9185 pp(c= 2:98) 0:0502 0:9164 0:9485 0:9799 = 0:9 pm(c= 5:45) 0:0505 0:8366 0:8775 0:9019 pp(c= 3:21) 0:0509 0:8936 0:9167 0:9549 = 0:95 pm(c= 23) 0:0502 0:5262 0:588 0:6364 pp(c= 3:45) 0:0503 0:7492 0:8406 0:9041

The outlier mean and outlier proportion techniques are both powerful in these settings of distribution. More interestingly the outlier proportion is the more ecient method in this comparison.

6. Appendix

Three assumptions for the asymptotic representation of the sample outlier proportion test are as follows.

1. The limit xy =limn1n2 !1

n2

n1 exists.

(22)

zero in neighborhoods of F;1

X () for  2 (01) and the population cuto

point .

3. Probability density function fY is bounded away from zero in a

neigh-borhood of the population cuto point .

Proof of theorem 3.1.

From the expression of ^Y in (3.1), we have

n1=2 2 (^Y ;Y) =;n ;1=2 2 n2 X i=1 I(Yi +n ;1=2 1 Tn) ;I(Yi )]+n ;1=2 2 n2 X i=1 (I(Yi );Y): (6.1) where Tn =n1=2 1 (^ ;) =n 1=2 1 (2( ^F ;1 X (1;);F^ ;1 X ();(2F ;1 X (1;);F ;1 X ())]:

With assumption (3), the key in this proof is that

n;1=2 2 n2 X i=1 I(Yi+n ;1=2 1 Tn) ;I(Yi )] =; 1=2 xy fY()Tn+op(1) (6.2)

which may seen in Ruppert and Carroll (1980) and Chen and Chiang (1996). With the following representation of empirical quantile,

p n1( ^F ;1 X ();F ;1 X ()) =f;1 X (F;1 X ())n;1=2 1 n1 X i=1 ;I(XiF ;1 X ())] +op(1) (6.3)

(see, for example, Ruppert and Carroll (1980)), a Bahadur representation of the outlier proportion is induced from (6.1)-(6.3) as

n1=2 2 (^Y ;Y) =n ;1=2 1 n1 X i=1 (b1 ;(1;)b 2)I(Xi  F ;1 X ()) +(b1+b2) I(F;1 X ()Xi F ;1 X (1;)) + (;(1;)b 1+b2) I(Xi F;1 X (1;))] +n ;1=2 2 n2 X i=1 I(Yi );Y] +op(1):

(23)

The asymptotic distribution in Theorem 3.1 is induced from the Central Limit Theorem.

References

Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonpara-metric Statistics. 7, 171-185.

Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2010). The p Value for the Outlier Sum in Dierential Gene Expression Analysis. Biometrika, 97, 246-253.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics,

8

, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

Science,

310

, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection. Bio-statistics,

8

, 566-575.

數據

Table 1. Population outlier proportions ( 	 = 1)
Table 2 Approximate powers of outlier mean and outlier proportion
Table 3 Approximate powers of outlier mean and outlier proportion for Laplace mixture
Table 5 Approximate powers of outlier mean and outlier proportion
+5

參考文獻

相關文件

After students have had ample practice with developing characters, describing a setting and writing realistic dialogue, they will need to go back to the Short Story Writing Task

• helps teachers collect learning evidence to provide timely feedback & refine teaching strategies.. AaL • engages students in reflecting on & monitoring their progress

Robinson Crusoe is an Englishman from the 1) t_______ of York in the seventeenth century, the youngest son of a merchant of German origin. This trip is financially successful,

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

Strategy 3: Offer descriptive feedback during the learning process (enabling strategy). Where the

How does drama help to develop English language skills.. In Forms 2-6, students develop their self-expression by participating in a wide range of activities

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

O.K., let’s study chiral phase transition. Quark