• 沒有找到結果。

覆蓋區間之平均連串長度及基因分析之p值

N/A
N/A
Protected

Academic year: 2021

Share "覆蓋區間之平均連串長度及基因分析之p值"

Copied!
33
0
0

加載中.... (立即查看全文)

全文

(1)

國 立 交 通 大 學

統計學研究所

碩士論文

覆蓋區間之平均連串長度

& 基因分析之 p 值

Concept of Average Run Length for Coverage Interval

& p values for Gene Expression Analysis

研 究 生:曾鈺婷

指導教授:陳鄰安

博士

(2)

& 基因分析之 p 值

Concept of Average Run Length for Coverage Interval & p values for Gene Expression Analysis

研 究 生:曾鈺婷 Student:Yu-Ting Tseng 指導教授:陳鄰安 博士 Advisor:Dr. Lin-An Chen

國 立 交 通 大 學 統計學研究所

碩 士 論 文

A Thesis

Submitted to Institute of Statistics College of Science

National Chiao Tung University In Partial Fulfillment of the Requirements

For the Degree of Master

In

Statistics

June 2008

Hsinchu, Taiwan, Republic of China

(3)

研究生:曾鈺婷 指導教授:陳鄰安 教授 國立交通大學統計學研究所

摘 要

主題一: 覆蓋區間的使用是來檢測一個人是否是健康的。如果未來的觀察值是被判斷 正確且要經過多久這個觀察值會被判斷錯誤,我們就希望去計算這個覆蓋區間的 檢定力。為此,我們研究檢定力和平均連串長度,以評估的覆蓋區間。最後將這 兩項工作運用在幾個分佈的研究。 關鍵字:平均連串長度;覆蓋區間;假設檢定;檢定力;參考區間。 主題二:

離群總和的概念已在 Tibshirani 和 Hastie ( 2007 年)和 Wu( 2007 年)等論文中提出,是在癌症研究中用來檢測許多不同基因,而一個或數個疾病 團體指出顯示異常高的基因表達的一個子樣本。我們這裡建議一個新的離群總和 的定義,使我們能夠發展其漸近分佈理論,並訂定出它的 P 值。這個 P 值的計 算可以用在參數或非參數的分佈。我們進一步地在常態的假設下導出 p 值的公 式。為了研究這個 P 值,我們執行了一些模擬及進行實際的數據分析。這個離群 總和,不僅讓我們來計算基因的 P 值,而且是有彈性的處理各種結構的分佈基因 的變數。 關鍵字:基因分析;離群總和;p 值。 i

(4)

& p values for Gene Expression Analysis

Student:Yu-Ting Tseng Advisor:Dr. Lin-An Chen

Institute of Statistics National Chiao Tung University

Abstract

Topic 1:

One use of coverage interval is monitor if an individual should be classified as healthy one. It is then desired to evaluate the coverage interval for its power if a future

observation is classified correctly and how often that this observation could be

mis-classified. For this, we study the power and implement the concept of average run length to evaluate the coverage interval. Some distributions are examined for these two tasks.

Key words: Average run length; coverage interval; hypothesis testing; power;

reference interval. Topic 2:

Outlier sum has been proposed in Tibshirani and Hastie(2007) and Wu(2007) for detection of differential genes in cancer studies where one or several disease groups show unusually high gene expression in a subset of their samples. A new outlier sum is proposed that allows us to develop its asymptotic distribution theory for

formulating p value. Since it is a function of some distributional parameters, this p value may be computed parametrically or nonparametrically. We further formulate parametrically this p value when normal distribution for gene variables is assumed. To investigate this p value, we perform a simulation and conduct a real data analysis which indicates that this outlier sum not only allows us to compute p values for genes but is also flexible for treatment of various structures of distribution for gene

variables.

Key words: Gene expression analysis; outlier sum; p value.

(5)

從大學到研究所,轉眼間在交大已經過了這麼多個年頭,又要畢 業了,回想研究所兩年的時光,雖然時間過得很快,但也過得很充實, 一方面在課業及論文研究上,另一方面則是結識了更多厲害的朋友, 不論在學業或者玩樂的功力,總是能拿捏得當,的確都是值得學習的 對象。 先要感謝的當然是我的指導教授 陳鄰安老師,他總是能很有耐 性的將一個觀念解說的非常清楚,即使在自己忙碌的情況下,依然不 厭其煩的與我討論論文內容,非常願意花時間一起研究一些小細節, 他同時也是生活上的好老師,告訴我很多人生哲學,並且也是一同討 論棒球賽事好伙伴,和老師一起做研究的這一年絕對是一段愉快又難 忘的回憶;也要感謝江永進老師、彭南夫老師以及賴怡璇老師對我這 篇論文的指導與寶貴的建議。 其次就是感謝我的家人,一直在背後支持我,因為我暴躁的脾氣 常常會因為一點不順心就爆發出來,但還是能感受到你們對我的關心 與體諒,真的要說聲對不起以及謝謝。還有一些多年來的朋友,經常 要聽我抱怨東抱怨西,除了安撫我的心情外,同時也給我很多鼓勵。 在碩士班兩年又認識了許多朋友,先是大一屆的學長姐,經常給 予很多課業上的指導,以及日常上的照顧,還有就是同班同學們一直 以來的幫助及陪伴,平常時功課上的討論、每個人的驚喜慶生還有難 忘的畢旅等,都將成為我珍藏的回憶。 在此,將本篇論文獻給我的師長、家人、好朋友以及同學,並致 上我最誠摯的謝意。 曾鈺婷 謹致于 國立交通大學統計研究所 中華民國九十七年六月 iii

(6)

iv

Contents

中文摘要……….………i Abstract……….ii 致謝………..iii Contents……….iv

Topic 1:Concept of Average Run Length for Coverage Interval 1. Introduction………..….... 1

2. Specifications for Evaluating the Coverage Interval………..…….… 3

3. A Study for Normal Distribution………..…... 4

4. Coverage Intervals for Gamma and Exponential Distributions..….… 8

Topic 2:p values for Gene Expression Analysis 5. Introduction………..……….…..13

6. General Formulation for Outlier Means………..…….…...15

7. Formulation of p Value with Normal Samples………..….…….17

8. Simulation and Data Analysis………..….……..21

(7)

Concept of Average Run Length for Coverage Interval

and

p

values for Gene Expression Analysis

Topic 1: Concept of Average Run Length for Coverage Interval

Abstract

One use of coverage interval is monitor if an individual should be classied as healthy one. It is then desired to evaluate the coverage interval for its power if a future observation is classied correctly and how often that this observation could be mis-classied. For this, we study the power and implement the concept of average run length to evaluate the coverage interval. Some distributions are examined for these two tasks.

Key words: Average run length coverage interval hypothesis testing power reference interval.

1. Introduction

The coverage interval, in accordance with the recommendation of theGuide

to the Expression of Uncertainty in Measurement for measuring the uncer-tainty, refers to population-based measurement values obtained from a well-dened group of reference individuals. This is an interval with two condence limits which covers the measurement values in the population in some proba-bilistic sense. Laboratory test results are commonly compared to a coverage interval, called a reference interval in clinical chemistry, before caregivers make physiological assessments, medical diagnoses, or management decisions. An in-dividual who is being screened for some disorder according their relevant mea-surement from that invidual is suspected to be abnormal if their meamea-surement value lies outside the coverage interval.

The coverage interval can be estimated either parametrically or non-parametrically. The parametric method classically assumes that the underlying distribution

of the measurement variable is normal whereas, recently, Chen, Huang and Chen (2007) has proposed a technique for constructing coverage intervals for asymmetric distributions. On the other hand, the non-parametric approach

TypesetbyA M

S-T E

(8)

estimates the quantiles (percentile) directly the most popular technique for estimating the unknown quantiles is through the empirical quantile.

Basically the coverage interval is to assay the measurement units if they meet dened criteria. In radiation protection, it provides a range of maxi-mum acceptable uncertainty in a dose measured under workplace conditions. In its application to clinical chemistry, it serves as reference standards for measurement units such as head circumference, length and mid-arm sircum-ference/head circumference ratio for the evaluation of exclusively breastfed infants and it provides some guidance in the interpretation of patient results. When the measurement values do not meet the dened criteria (falling in the coverage interval), these units may be suspected as unsafe or unhealthy and are required for further investigation. These concerns are all statistical hypothesis problems.

However used as an acceptance region for some hypothetical assumption, little has been known the statistical properties of the test based on coverage intervals.

We say that a manufacturing process is in statistical control if the process distribution for the quality characteristic is constant over time and if there is change over time, the process is said to be statistically out of control. A control chart provides the most popular technique for monitoring the process. For a control chart, the most popularly used technique to evaluate its risk is the average run length (ARL) which is the average number of sample points that must be plotted before a point indicates an out-of-control condition. For a control chart, the ARL is

ARL= 1 (1.1)

whereis the probability that a single sample point exceeds the control limits. Coverage intervals in clinical chemistry are used for mass screening, to con-rm a diagnosis and to monitor a patient's disease status. Diagnosis is test or procedure that helps detect, conrm, document or exclude a disease. An individual is normal if his or her test result falls within a pre-specied cov-erage interval. Once a disease is suspected, testing result falling outside the coverage interval, further intensive tests may be performed aiming to increase

(9)

or decrease the diagnostic certainty of one diagnosis.

How can we measure a coverage interval in terms of eectiveness for its role in diagosis in clinical practice? This is important in reducing the risk of classifying a patient with diseased as non-diseseased person and the risk of clas-sifying an healthy people as diseseased person. One way for this measurement is to transfer the concept of ARL in quality control to measurement science. Suppose that there is a sequence of individuals physically healthy. How many individuals, on the average, in this class that will be examined before a deci-sion of disorder will be claimed is tolerant for the laboratory? Can we design a coverage interval that is more eective in detecting a disorder individual?

2. Specications for Evaluating the Coverage Interval

The International Federation of Clinical Chemists (IFCC) standard coverage

interval for a measurement variable with distribution functionF is an estimate

of the central interfractile interval

C(1;) = F ;1( 2)F;1(1 ;  2 )] (2.1)

(usually with  = 0:05) where F;1() is the th fractile for measurement

variable. The parametric method generally assumes that the underlying distri-bution of the measurement variable is normal. If it is not normal, the classical technique to deal with this case is applying a known transformation to nor-mality, setting the normal limits and then transforming to obtain the required interval.

We consider the parametric coverage interval where the underlying distri-bution is known that we need not to make transformation for approximate

normality. Suppose that the parameter value for healthy people is 0. Then

the true coverage interval is F;1 0 (  2)F;1 0 (1 ;  2 )]: (2.2)

However, parameter value 0 for distribution of healthy people is usually

un-known so that an estimate is required. All approaches to establishing coverage intervals require large groups of individuals (e.g., a minimum of 120 individuals

(10)

in the IFCC recommendation). When an appropriate estimate ^ for  is com-puted from the measurement values is available, the coverage interval based on the central interfractile interval is

^ C(1;) = F ;1 ^ (  2 )F;1 ^ (1 ;  2 )]: (2.3)

Our interest is, as long as we have an established coverage interval ^C(1;),

how is it performed for diagnosis of disease? The use of coverage interval in diagnosis is, in fact, testing the follwoing hypotheses:

H0 : The individual is healthy vs. H1 : The individual is unhealthy: (2.4)

The test is then set as the following:

Accepting H0 when the measurement value falls in ^C(1

;) and

not rejecting H0 when the measurement value falls outside ^C(1 ;):

(2.5)

An individual will be suspected to be abnormal when H0 is rejected. There

are two errors may happen in the diagnosis based on coverage interval:

Type I error: The individual is healthy but he/she is claimed to be unhealthy Type II error: The individual is unhealthy but he/she is claimed to be healthy

Our interest in diagnosis of disease through the estimated coverage interval includes the followings:

(a) A 100(1;)% coverage interval is expected to have probability 1; to

claim a healthy people to be healthy. How is it performed in sample coverage interval?

(b) On the other hand, a coverage interval is expected to have large probability to claim a diseased people to be diseased. How is it performed in sample coverage interval for this case? The test procedure is based on coverage interval.

3. A Study for Normal Distribution

Let X1:::Xn be a random sample drawn from the normal distribution

N(0 2

0). However, 

0 and 

2

0 are assumed to be unkown. The true 100(1

; )% coverage interval is (0 ;z 1; 2 00+z1; 2 0) (3.1)

(11)

which is also unknown. Hence, it is estimated by the 100(1; )% normal coverage interval as ( X;z 1; 2S  X+z1; 2S): (3.2)

Now, suppose that X0 is the characteristic variable of interest for diagonosis

based on coverage interval estimate of (3.2). If X0 is in healthy condition, the

probability of type I error is derived in the follwoings:

P(Type I error) =P00(X 0 62( X;z 1; 2S  X+z1; 2S)) = 1;P 00( X ;z 1; 2S X 0 X +z 1; 2S) = 1;P 00( ;z 1; 2  X0 ;X S z 1; 2) = 1;P 00( ; z1; 2 q 1 + 1 n  X0 ;X q 1 + 1 nS  z1; 2 q 1 + 1 n) = 1;P(; z1; 2 q 1 + 1 n t(n;1) z1; 2 q 1 + 1 n)

where we use the fact that, under H0,

X0 ;  X p 1+ 1 nS  t(n; 1). Next, suppose

thatX0 is in unhealthy condition, letand 

2 be the true mean and variance

of variable X0. For deriving the probability of type II error, we rst derive

the desired test statistic. It is seen that X0

;X has the normal distribution

N(; 0 2+ 2 0 n) and (n;1)S 2 2 0

has chi-square distribution 2(n

;1) and these

two quantities are independent. We then have the following

T = X0 ;X r  0  2 + 1 n S tn ;1( ; 0 q 2+  2 0 n )

where tk(a) represents the noncentralt distribution with degrees of freedom k and noncentrality parameter a. The derivation of type II error is as follows:

P(Type II error) =P(X0 2( X;z 1; 2S  X+z1; 2S)) =P( ;z 1; 2 r  0  2 + 1 n  X0 ;X r  0  2 + 1 n S  z1; 2 r  0  2 + 1 n ) =P( ;z 1; 2 r  0  2 + 1 n tn ;1( ; 0 q 2+  2 0 n )  z1; 2 r  0  2 + 1 n ):

(12)

Let1 =  0 2 = ; 0 r 2 +  2 0 n

. We will evaluate this probability under some values of 1 and 2. In this design, we have

=P(Type II error) =P( ;z 1; 2 q 2 1 + 1 n tn ;1(2)  z1; 2 q 2 1+ 1 n): (3.3)

and the power is 1;. When 

1 = 1 and2 = 0 is true, the power is expected

to be the probability of type I error. On the other hand, when this assumption is not true, we expect that the power is large when the deviation is big.

Any sequence of sample points that leads to a disorder signal is called a run. The number of individuals that is taken during a run is called the \run length." Clearly, the run length is of very importance in evaluating how well a coverage interval performs. Because run length can vary run to run, from the statistical point of view, it is more interesting to evaluate the average run length (ARL) that is dened as

ARL= 11

;:

(3.4) If the coverage interval is monitoring a sequence of healthy people, a perfect

interval would never generate a signal of disorder - thus, the ARL would be

innitely large. If the coverage interval is monitoring a sequence of un-healthy people, a perfect interval would quickly generate a signal of disorder - thus, a

coverage interval with an ARL of 1 would be desired. However, statistically

this is not possible.

We would like to see a high ARL when the coverage interval is treating a

group of healthy people and a lowARLwhen it is treating a group of un-healthy

people. However, from the statistical point, we expect a high ARL when the

parameters of the underlying distribution are on target and lowARLwhen the

parameters shift to an unsatisfactory level.

Denition 3.1.

The average run length (ARL) represents the length of time the consecuitive diagnoses must run, on the average, before a coverage interval will indicate an disorder.

(13)

Table 1.

Powers for Normal distribution N(2) (two-sided) (12) = (10) (11) (20) (12) (21) (22) n= 20 0:07098 0:20102 0:34234 0:54309 0:54165 0:84932 n= 30 0:06369 0:19075 0:33717 0:53437 0:53833 0:84873 n= 50 0:05806 0:18250 0:33310 0:52718 0:53571 0:84827 n= 100 0:05398 0:17630 0:33008 0:52165 0:53376 0:84792 n= 500 0:05079 0:17132 0:32769 0:51714 0:53222 0:84764

Table 2.

ARL for Normal distribution N(2) (two-sided)

(12) = (10) (11) (20) (12) (21) (22) n= 20 14:0885 4:9745 2:9211 1:8413 1:8462 1:1774 n= 30 15:7011 5:2423 2:9658 1:8713 1:8576 1:1782 n= 50 17:2236 5:4793 3:0021 1:8969 1:8667 1:1789 n= 100 18:5254 5:6721 3:0295 1:9170 1:8735 1:1794 n= 500 19:6889 5:8370 3:0517 1:9337 1:8789 1:1797

We have several comments drawn from the above two tables:

(a) When H0 is true, the ARL expected to be 20. This means that, in average,

20 healthy people will have one being classied as an unhealthy individual. However, the results are all not identical to 20 that can be as small as only

14 for sample size n= 20. The ARL increases in sample size n and it is seen

approached to 20 when n goes to innity.

(b) When the parameters are moved away from the null one, the power in-creases and the ARL dein-creases. This satises the expectation for the use of coverage interval in monitoring an individual's health.

There is no other approach that has studied the ARL. So, we can't make comparison for this approach with others.

We may consider a one sided coverage interval as (;1

0 +z1;0) and

its estimate is

(;1X +z

1;S):

The probability of type II error of this coverage interval estimate may be shown as =P(Type II error) =P(;1< tn ;1(2)  z1; q 2 1+ 1 n)

(14)

and the power is 1;. We display the power and ARL results in Tables 3 and

4.

Table 3.

Powers for Normal distribution N(2) (one-sided)

n= 20 n= 30 n= 50 n= 100 n= 500 1 = 12 = ;1 0:00613 0:00540 0:00485 0:00446 0:00415 1 = 12= 1 0:28592 0:27725 0:27022 06489 0:26059 1 = 12 = ;2 0:00025 0:00020 0:00017 0:00015 0:00013 1 = 12= 2 0:65648 0:65076 0:64605 0:64244 0:63950 1 = 22 = ;1 0:03662 0:03579 0:03514 0:03466 0:03428 1 = 22= 1 0:57603 0:57415 0:57267 0:57156 0:57068 1 = 22 = ;2 0:00269 0:00258 0:00250 0:00244 0:00239 1 = 22= 2 0:81616 0:88124 0:88095 0:88073 0:88056

Table 4.

ARL for Normal distribution N(2) (one-sided)

n= 20 n= 30 n= 50 n= 100 n= 500 1 = 12 = ;1 162:97 185:09 206:09 224:20 240:40 1 = 12 = 1 3:4974 3:6068 3:7006 3:7751 3:8374 1 = 12 = ;2 3913:6 4784:8 5676:4 6494:5 7264:3 1 = 12 = 2 1:5233 1:5367 1:5479 1:5566 1:5637 1 = 22 = ;1 27:307 27:939 28:454 28:843 29:164 1 = 22 = 1 1:7360 1:7417 1:7462 1:7496 1:7523 1 = 22 = ;2 371:39 386:786 399:583 409:474 417:573 1 = 22 = 2 1:2252 1:1348 1:1351 1:1354 1:1356

4. Coverage Intervals for Gamma and Exponential Distributions

Consider the Gamma distribution ;(k) with pdf of the form

f(x) = 1;(k)kxk;1e;x=x >0:

The th quantile of this distribution is F;1

 () = 2 2

2k(). The one sided

1;coverage interval isC(1;) = (0 2 2 2k(1 ;)). With mle ^ = P n i=1x i nk ,

a sample coverage interval is ^ C(1;) = (0 Pn i=1xi 2nk 2 2k(1 ;)):

Suppose that the true coverage interval is C(1;) = (0 0 2

2 2k(1

(15)

power function is a function of parameter as () =P(X > Pn i=1Xi 2nk 2 2k(1 ;)) =P(2Pn2X=2k i=1Xi=2nk 0 > 0 2k 2 2k(1 ;)) =P(F(2k2nk)> 0 2k 2 2k(1 ;))

We list the power and ARL results for this Gamma distribution in Tables 5 and 6.

Table 5.

Powers for Gamma distribution ;(k) (one-sided)

= 0:5 = 1 = 5 = 20 k = 1 0:00424 0:05753 0:55253 0:86121 k = 2 0:00137 0:05613 0:75445 0:97566 k = 3 0:00056 0:05550 0:86526 0:99577 k = 4 0:00025 0:05513 0:92660 0:99928 k = 5 0:00012 0:05487 0:96034 0:99987 k = 6 0:00006 0:05468 0:97873 0:99997 k = 7 0:00003 0:05454 0:98867 0:99999 k = 8 0:00001 0:05442 0:99400 0:99999 k = 9 0:00001 0:05433 0:99684 0:99999 k= 10 0:00000 0:05425 0:99834 0:99999 k= 12 0:00000 0:05412 0:99955 1:00000 k= 15 0:00000 0:05397 0:99993 1:00000 k= 20 0:00000 0:05381 0:99999 1:00000

Table 6.

ARL for Gamma distribution ;(k) (one-sided)

= 0:5 = 1 = 5 = 20 k = 1 235:69 17:381 1:8098 1:1612 k = 2 727:76 17:813 1:3255 1:0249 k = 3 1780:6 18:015 1:1557 1:0042 k = 4 3904:8 18:138 1:0792 1:0007 k = 5 8012:0 18:222 1:0413 1:0001 k = 6 15702 18:285 1:0217 1:0000 k = 7 29748 18:333 1:0115 1:0000 k = 8 54888 18:373 1:0060 1:0000 k = 9 99132 18:405 1:0032 1:0000 k = 10 175905 18:433 1:0017 1:0000 k = 12 530669 18:477 1:0004 1:0000 k = 15 2562981 18:526 1:0001 1:0000 k = 20 30419787 18:581 1:0000 1:0000

(16)

For two sided coverage interval 2( 2 2k(  2) 2 2k(1 ;  2)), its estimate is ^ C(1;) = Pn i=1Xi 2nk ( 2 2k(  2) 2 2k(1 ;  2 )): We then see that the power of this coverage interval estimate is

() = 1;P( 0 2k 2 2k(  2 )F(2k2nk) 0 2k 2 2k(1 ;  2 )):

Some of the power and ARL results for this two sided consideration are listed in Tables 7 and 8.

Table 7.

Powers for Gamma distribution ;(k) (two-sided)

= 0:5 = 1 = 5 = 20 k = 1 0:05069 0:05582 0:48751 0:83330 k = 2 0:08650 0:05489 0:69533 0:96742 k = 3 0:12999 0:05455 0:82175 0:99384 k = 4 0:17816 0:05437 0:89733 0:99887 k = 5 0:22926 0:05427 0:94165 0:99979 k = 6 0:28194 0:05420 0:96722 0:99996 k = 7 0:33506 0:05415 0:98176 0:99999 k = 8 0:38771 0:05411 0:98994 0:99999 k = 9 0:43913 0:05408 0:99449 0:99999 k= 10 0:48874 0:05406 0:99701 0:99999 k= 12 0:58083 0:05402 0:99913 1:00000 k= 15 0:69790 0:05398 0:99986 1:00000 k= 20 0:83608 0:05395 0:99999 1:00000

Table 8.

ARL for Gamma distribution ;(k) (two-sided)

= 0:5 = 1 = 5 = 20 k = 1 19:724 17:913 3:2461 1:4376 k = 2 11:559 18:218 2:2077 1:1215 k = 3 7:6925 18:330 1:2169 1:0062 k = 4 5:6128 18:389 1:1144 1:0011 k = 5 4:3618 18:425 1:0620 1:0002 k = 6 3:5468 18:449 1:0339 1:0000 k = 7 2:9845 18:466 1:0186 1:0000 k = 8 2:5792 18:479 1:0102 1:0000 k = 9 2:2772 18:489 1:0055 1:0000 k = 10 2:0461 18:497 1:0030 1:0000 k = 12 1:7217 18:510 1:0009 1:0000 k = 15 1:4329 18:522 1:0001 1:0000 k = 20 1:1960 18:534 1:0000 1:0000

(17)

LetX1:::Xnbe a random sample drawn from the exponential distribution

with probability density function

f(x) = 1e;x= x >0:

The distribution function is F(x) = 1;e

;x= . Hence, the population quantile

function is F;1() =

;ln(1;) indicating that a 100(1;)% population

coverage interval is (;ln(1;  2 );ln(  2)):

An appropriate estimate of  is X and then a sample 100(1;)% coverage

interval is (;Xln (1;  2 );Xln (  2)):

Suppose that the parameter for healthy people is 0. The type I error

probability is deriving as follows:

P(Type I error) =P 0(X 0 62(;Xln (1;  2 );Xln (  2 ))) = 1;P 0( ; Pn i=1Xiln(1 ;  2) n X 0  ; Pn i=1Xiln(  2) n ) = 1;P 0( ;ln(1;  2) n  X0 Pn i=1Xi  ;ln( 2) n ) = 1;P 0( ;ln(1;  2 ) F(22n);ln(  2 ))

where we use the fact that X0

P n i=1X i = 2X 0= 0 2 2 P n i=1X i= 0 2n F(22n). The probbility

of type II error when the true parameter is  is

=P(Type II error) =P(; 1 ln(1 ;  2 ) F(22n); 1 ln(  2 )) where  = 0. We consider (1

;) = 0:95 coverage interval as example and

list the results in Tables 9 and 10.

Table 9.

Powers for Exponential distribution Exp() (two-sided) (Assume

0 =

(18)

n= 5 n= 20 n= 30 n= 50  = 0:2 0:11795 0:11855 0:11867 0:11876  = 0:5 0:05988 0:05118 0:05069 0:05037  = 0:8 0:06916 0:04690 0:04484 0:04328  = 1 0:08803 0:05884 0:05582 0:05345  = 1:5 0:15203 0:11506 0:11080 0:10738  = 2 0:22060 0:18388 0:17954 0:17602  = 2:5 0:28451 0:25090 0:24690 0:24366  = 3 0:34147 0:31161 0:30806 0:30518

Table 10.

ARL for Exponential distribution Exp() (two-sided) (Assume

0 =  ) n= 5 n= 20 n= 30 n= 50  = 0:2 8:4777 8:4349 8:4267 8:4201  = 0:5 16:697 19:536 19:724 19:850  = 0:8 14:459 21:320 22:296 23:100  = 1 11:358 16:993 17:913 18:706  = 1:5 6:5775 8:6910 9:0245 9:3119  = 2 4:5329 5:4382 5:5697 5:6809  = 2:5 3:5147 3:9856 4:0501 4:1040  = 3 2:9285 3:2091 3:2461 3:2767

Let's now consider the one sided coverage interval (0;ln()) that is

esti-mated by (0;Xln ()). The probability of type II error is

=P(Type II error) =P(0< F(22n);

1

ln()):

Again, 1;= 0:95, we list the power and ARL in Tables 11 and 12.

Table 11.

Powers for Exponential distribution Exp() (one-sided) (Assume

0 =  ) n= 5 n= 20 n= 30 n= 50  = 0:2 0:00098 0:00001 0:00000 0:00000  = 0:5 0:01947 0:00529 0:00424 0:00318  = 0:8 0:06111 0:03230 0:02934 0:02702  = 1 0:09562 0:06172 0:05753 0:05450  = 1:5 0:18631 0:14902 0:14464 0:14109  = 2 0:26977 0:23588 0:21848 0:22858  = 2:5 0:34157 0:31230 0:30882 0:30600  = 3 0:40235 0:37740 0:37444 0:37204

(19)

Table 12.

ARL for Exponential distribution Exp() (one-sided) (Assume 0 =  ) n= 5 n= 20 n= 30 n= 50  = 0:2 1018:5 71428 188679 500000  = 0:5 51:336 188:80 235:69 286:80  = 0:8 16:363 30:954 34:081 37:005  = 1 10:457 16:201 17:381 18:346  = 1:5 5:3673 6:7101 6:7136 7:0873  = 2 3:7068 4:2394 4:5770 4:3748  = 2:5 2:9276 3:2020 3:2381 3:2679  = 3 2:4854 2:6497 2:6706 2:6878

Topic 2:

p

Value of an Outllier Sum in Dierential Gene

Expression Analysis

Abstract

Outlier sum has been proposed in Tibshirani and Hastie (2007) and Wu (2007) for detection of dierential genes in cancer studies where one or several disease groups show unusually high gene expression in a subset of their samples. A new outlier sum is proposed that allows us to develop its asymptotic distribution

theory for formulating p value. Since it is a function of some distributional

parameters, thispvalue may be computed parametrically or nonparametrically.

We further formulate parametrically this p value when normal distribution for

gene variables is assumed. To investigate thisp value, we perform a simulation

and conduct a real data analysis which indicates that this outlier sum not only allows us to compute p values for genes but is also "exible for treatment of various structures of distribution for gene variables.

Key words: Gene expression analysis outlier sum p value.

5. Introduction

Microarray technology by probing thousands of genes simultaneously has been successfully used in medical research to classify dierent diseases (see this point in, for examples, Agrawal et al. (2002) Alizadeh et al. (200 0) Ohki et al. (2005) Sorlie et al. (2003)). For example, two molecular subtypes of breast cancer (two distinct gene expression patterns), luminal A and basal-like

(20)

subtypes, have been reported to have dierent clinical outcome (see Sorlie et al. (2003)). Another example is diuse large B-cell lymphoma (DLBCL). Patients with one particular molecular pattern, germinal centre B-like DLBCL, had a signicant better overall survival than those with another molecular pattern, activated B-like DLBCL (see Alizadeh et al. (2000)). Furthermore, microarray analysis has been advanced to identify oulier genes which are over-expressed only in a small number of disease samples (see Beer et al. (2002) Tibshi-rani and Hastie (2007) Tomlins et al. (2005)), such as recurrent chromosomal rearrangements (one type of chromosomal mutation), which is common in lym-phoma and leukemia, but rare in other cancers. Standard statistical methods

for two-group comparisons (e.g., t-tests) have a limitation to identify these

genes to distinguish tumor versus normal samples.

Several statistical approaches have been proposed to address this issue of nding those genes where only a subset of the samples has high expression. Among the proposals, Tomlins et al. (2005) introduced a method called cancer outlier prole analysis (COPA). Latter, Tibshirani and Hastie (2007) intro-duced a sum of the values in the cancer group, called the outlier sums, and showed that the technique of outlier sums is noticeably better in simulation of

pvalues than the technique of COPA. There is an alternative outlier sums - like statistic proposed by Wu (2007). Basically, these methods of outlier sums pool outlier score which is a standardized score centered at median and scales by median absolute deviation in various ways. A larger outlier score indicates an outlier gene. The outlier sum statistics are very promising in detecting genes where only a subset of their samples have high expression. Unfortunately, without development of distribution theory for the outlier sum statistic, its power (see the simulations in Tibshirani and Hastie (2007)) in gene expression analysis relies on that the number of genes with samples having high expression is known. However, this is usually not true in practice and then there is no natural cut o point to decide the number of in"uential genes.

We propose the non-standardized outlier sum statistics and develop a

tech-nique for computing p values for genes. One interesting result is that this

(21)

of outlier genes and non-outlier genes. So, this would not require that there is only one outlier gene. The studies of gene expression detection such as the

t test, Tibshirani and Hastie (2007) and Wu (2007) all assume that the

un-derlying distributions for all genes are normal distributions. Hence, under this

distribution, we further derive a simpler formula for pvalues and perform

sim-ulations evaluate its ability in detection of outlier genes. A formula developed

in this paper makes the study ofp values in parameteric of other distributions

and nonparametric techniques is straight forward, however, we would not go further for this.

6. General Formulation for Outlier Means

Suppose that there arem genes to be cocerned and for each gene there are

two groups of subjects, one normal or healthy group and one cancer (disease)

group. We assume that there are available n1 and n2 expression variables

respectively for two groups forming as follows:

Normal group Cancer group

Gene 1 X11:::X1n 1 Y 11:::Y1n 2 Gene 2 X21:::X2n 1 Y 21:::Y2n 2 ... ... ... Gene m Xm1:::Xmn 1 Ym 1:::Ymn 2 (6.1) The outlier sums for gene expression in literature actually implicitly dened three parameters:

H1 : Centering parameter for measuring distance of observations in Y group

H2 : Threshold for identifying observations from Y group as outliers

H3 : Scale parameter for standardizing an outlier sum

Let H1jH2jH3j represent, respectively the above three parameters for gene

j and we assume that there are appropriate estimators ^H1jH^2jH^3j, based on

variables in gene j, available for estimating these parameters.

The outlier sum statistic for genej dened by Tibshirani and Hastie (2007)

and Wu (2007) may be represented in a general form as

Wj = n 2 X i=1 Yji;H^ 1j ^ H3j I(Yji>H^2j) (6.2)

(22)

where ^H1jH^2j and ^H3j are estimates of H1jH2j and H3j respectively.

Let Fxj and Fyj, respectively, be the distribution functions that fXjii =

1:::n1

g and fYjii = 1:::n 2

g are drawn. Let's denote

^

F;1

xj () :th percentile of the set fXjii= 1:::n 1

g

^

L;1

j () :th percentile of the set fXjii= 1:::n

1Yjii= 1:::n2 g medxj = ^F;1 xj (0:5)medyj = ^F;1 yj (0:5)medj = ^L;1 j (0:5) IQRxj = ^F;1 xj (0:75);F^ ;1 xj (0:25)IQRj = ^L;1 j (0:75);L^ ;1 j (0:25) madxj = 1:4826medianfjYji;medxjji= 1:::n

2 g

where the constant 1:4826 is chosen such that madxj is approximately equal

to the normal standard error.

For comparison of the two approaches on outlier sums by Tibshirani and Hastie (2007) and Wu (2007), we use a table to express their formulations of outlier sums. This expression allows us to generate alternative outlier sums when thresholds ^H1jH^2j and ^H3j are chosen in dierent ways that could be

in consideration of robustness or e#ciency.

Table 14.

Comparison of parameter estimates for outlier sums method and

outlier robust t method

Parameter

estimate and HastieTibshirani Wu

;;;;; ;;;;;; ;;;;;; ^ H1j medj medxj ^ H2j L^ ;1 j (0:75) +IQRj F^;1 xj (0:75) +IQRxj ^ H3j madxj medianfjXji;medxjjn 1 i=1 jYji;medyjjn 2 i=1 g ;;;;; ;;;;;; ;;;;;;

When gene expression values xjii= 1:::n1yjii = 1:::n2 are available,

we can evaluate statistic values wj for the outlier sum statistics Wj of (6.2). The technique applied in Tibshirani and Hastie (2007) of gene expression

anal-ysis computes the p values as

pjw = 1m X

j0 6=j

(23)

The genes with smaller p values are suspected to be signicant genes.

It is desired to evaluate p values with probability sense. Suppose that we

have a statistict(Z) whereZ is a random sample from a distribution involving

parameter  and we consider the null hypothesis H0 :  = 0. The classical

signicance test denes the p value as

pt =P 0

ft(Z) at least as extreme as the observed t(z)g (6.4)

wherez is the realization of the random sampleZ. Extending this concept, the

proposal of p value for gene expression based on outlier sums is appropriate in

the form as

p

j =PFxj

fWj wjgj = 1:::m (6.5)

where statistic Wj involves distributions Fxj and Fyj since it is function of

fXjigand fYjig but we consider that Fxj =Fyj in (6.5).

We consider a non-centered and non-scaled outlier sum statistic in the fol-lowing and use it to introduce a test statistic that does involve centering and scaling estimates.

Denition 6.1.

The outlier sum statistic for jth gene is ~%j = n

2 X

i=1

YjiI(Yji>H^j): (6.6)

The aim in this paper is to develop p values for outlier sum statistics ~%jj = 1:::m.

7. Formulation of

p

Value with Normal Samples

From now on, for simplicity, we drove the indexj. The threshold suggested

by Wu (2007) is ^ Ha= ^F;1 x (0:75) +IQRx= 2 ^F;1 x (0:75);F^ ;1 x (0:25): For latter comparison, we suggested a "exible type of threshold as

^

Hb = ^F;1

(24)

We now further denote the outlier mean % by %a when its threshold is ^H = ^Ha and it by %b when ^H = ^Hb.

We have notes on the design of threshold ^Hb:

(a) Consider that the underlying distributions Fx is normal. We then see that

^

Ha and ^Hb when k = 1 are both estimates of x+ 3xz0:75. Hence, ^Hb when

k = 1 is asymptotically equlivalent to ^Ha.

(b) Small k will make the outlier sum able to detect any positive outliers in

second group. The larger the outliers the more the e#ciency will be. However, it could happen that there are many genes to be identied as outlier genes since their p values all indicate signicant dierent.

(c) Larger k can only detect larger shift in distribution and it will probably

not be able to detect smaller shift in distribution.

(d) We latter will see that when k = 1 the p values pa and pb are identical.

We now assume that fXig and fYig are two random sample, respectively,

from normal distributions N(x2

x) and N(y2

y). With denoted  as the

probability density function of the standard normal distribution N(01), we

further let  be the probability density function of the normal distribution

N(2).

With the normality assumptions,F;1

x () =x+zxindicates thatF;1

x (0:5)+ 1:5k(F;1

x (0:75);F ;1

x (0:25)) =x+3kz0:75x. Hence, the outlier sum may be

reformulated as

~% = n2 X

i=1

YiI(Yi >^x+ 3kz0:75^x)

that requires only estimators ^x and ^x. Furthermore, thepvalue is evaluated under thatH0 is assumed to be true. Hence, we may letx =yx =y and,

(25)

Section 4 are as follows: ~ = n2 X i=1 yiI(yi >^x+ 3kz0:75^x) = Z 1 3kz 0:75 (z)dz a known constant  =x+ x Z 1 3kz 0:75 z(z)dz b1= 1 3kz0:75x(3kz0:75) p h;1(0) b2=b3 = 1:5k b 1 ;1(0) ;1(z 0:75) v= 2 x 2 Z 1 3kz 0:75 z2(z)dz ;( Z 1 3kz 0:75 z(z)dz)2]

where b1b2b3 and v are to formulate  2 =2 (b1b2b3v) where 2 =2 (b1b2b3v) = 0:250:75(0:5b 1+ 0:25b2 ;0:75b 3) 2 + (0:5b 1+ 0:25b2+ 0:25b3) 2 + (;0:5b 1+ 0:25b2+ 0:25b3) 2+ ( ;0:5b 1 ;0:75b 2+ 0:25b3) 2] +v

From the formulations stated earlier, we need only to specify estimators of

hx and x.

Theorem 7.1

Suppose thatfXig and fYig are, respectively, random samples

from distributions N(x2 x) and N(y2 y). Then, under H0 :x =y 2 x = y, W =W(XiYi) =p n2( ~%;n 2 p n2 ) (7.1)

converges asymptotically to the standard normal distribution.

We then apply an estimator of W of (7.1) as the test statistic

Denition 7.2.

Suppose that we have appropriate estimators of  and

 . Then we dene the test statistic as ~ W = ~W(XiYi) =p n2( ~%;n^ 2^ q ^ n2^ ): (7.2)

(26)

Denition 7.3.

Suppose that the outlier mean %jhas the asymptotic property of (6.2) and there are ^j^j and ^j , estimates, respectively, of jj and

j based on observations xji's. We dene the p value for gene j as

pj = Z 1 p n2 ( ~  j ; ^  j n 2 ^  j p ^  j n 2 ^  j ) (z)dzj = 1:::m: (7.3)

We have two notes for the specied p values:

(a) The estimates ^j^j and ^j are designed to be computed from the data

xji's since p values try to see how signicant the observation ~ j's it is when

yji are drawn from the same distribution of xji's.

(b) Suppose that pj's for all j are available. The genes with indexes j's such that their p's are relatively smaller are then suspected to be in"uential and those with relatively larger pj's are not in"uential. This resolve the di#culty of ordinal pvalues proposed in the literature for outlier sums statistics for not been able to determine a nite set of in"uential genes when it is not known the true number of in"uential genes.

Let ^h = n2 n1^x = x = 1 n1 Pn 1 i=1xi^ 2 x = s2 x = 1 n1;1 Pn 1 i=1(xi ;x) 2. Some

elements for computing the observation of the following test statistic ~ W(XiYi) =p n2( ~%;n 2^ p n2^ ) are the followings:

~ = n2 X i=1 yiI(yi >x+ 3kz0:75sx) = Z 1 3kz 0:75 (z)dz a known constant ^  = x+ sx Z 1 3kz 0:75 z(z)dz (7.4) ^b1 = 1 3kz0:75sx(3kz0:75) p ^ h;1(0) ^b2 = ^b3 = 1:5k ^b1 ;1(0) ;1(z 0:75) ^ v= s2 x 2 Z 1 3kz 0:75 z2(z)dz ;( Z 1 3kz 0:75 z(z)dz)2]:

(27)

Then the asymptotic variance 2 is estimated as ^ 2 =2 (^b1^b2^b3v^) (7.5)

and then the p value of (6.4) is

p= Z 1 p n2 ( ~ ;n 2 ^   p n 2 ^   ) (z)dz: (7.6)

Thepvalue of (7.6) uses only xandsx to estimatex andxfor formulating ^

 and ^ . The computation of p value under normality assumption is very

simple. If it is the situation that Gx and Gy are known but not normal, this

procedure of establishing p value may be analogously derived.

8. Simulation and Data Analysis

It is desired to evaluate the ability of outlier sum in detecting signicant genes through thepvalues of genes. We restrict this evaluation for that the un-derlying distributions are normal that are generally assumed in the approaches of Tibshirani and Hastie (2007) and Wu (2007). Under the normal assumption, the outlier sum statistic may be formulated as

~ b = n 2 X i=1 YiI(Yi >X + 3kz0:75Sx) (8.1)

where X andSx are, respectively, sample mean and sample standard deviation

based on sample of normal group people. This outlier sum is equivalent to

the proposals of Wu (2007) when k = 1. It is then interesting to study the

choice of constantk for detecting signicant genes through simulation and data

analysis.

We conduct two simulations. First, the classical t test has been criticized

that when there are occassionally hundreds of in"uential genes if 10 thounsands

genes are investigated. Hence, we generate n1 = 20 and n2 = 20 observations

from N(01) and conduct 1 million replications of this data generation to

compute p values of (7.6). Setting signicance level  = 0:0010:010:05 and

(28)

corresponding specied signicance level. The results are displayed in Table 15.

Table 15.

Numbers in 1 millions replications withp values smaller than   k = 1 k = 2 k = 3

0:05 57808 460 5

0:01 25231 86 2

0:001 9632 23 1

We have two conclusions drawn from the results in Table 1:

(a) Consider thatk = 1. If= 0:05, there are more than 50 thousands genes to be claimed in"uential. So, if there are totally 10 thounsands genes, then there are about 500 or more genes to be identied as in"uential. Similarly,= 0:01

and  = 0:001 indicate to have, respectively, 200 and 90 or more genes to

be identied as in"uential. This shows that outlier sum of k = 1 which is

equivalent to Wu (2007) is still struggled in having too many in"uential genes.

(b) Consider that k = 2. The results show that when the gene number is

about 10 thousands, there will be very small numbers of in"uential genes to

be identied. On the other hand, k = 3 will be almost none to be identied as

in"uential gene. Hence, based on this simulation, k= 2 or 3 is an appropriate constant to contruct the outlier sum.

We rst consider a simulation to evaluate the e#ciency of the approach ofp

value for dierential sum in detcting outlier genes. Let (sh) be a xed index

for gene data generation. We generate n1 = 20 and n2 = 20 observations from

N(01). However, we add h units for s of the samples in the second group of

n2 observations. We then compute thep value of (7.6).

For the next simulation, we consider that there are in"uential genes and see the e#ciency of the approach of p value for detection of in"uential genes.

Again, we generate n1 = 20 and n2 = 20 observations fromN(01). However,

we add h units for s of the samples in the second group of n2 observations.

This process is repeated 10 thousands times and we compute the averaging p

value. For several values of s and h, we perform this simulation and display

the simulation results of averaged p values in Tables 16 and 17.

(29)

(sh) k = 1 k = 2 k= 3 (00) 0:4726 0:4972 0:5 (22) 0:2441 0:4642 0:4985 (24) 0:0198 0:2018 0:4475 (26) 0:00075 0:0162 0:2103 (28) 1:75E;05 0:00056 0:0313 (42) 0:1271 0:4354 0:4973 (44) 0:00038 0:1052 0:4160 (46) 2:88E;08 0:0013 0:1293 (48) 2:89E;13 5:76E ;07 0:0070 (410) 6:19E;18 3:72E ;12 2:77E;05 (62) 0:0694 0:4145 0:4960 (64) 1:74E;05 0:0672 0:3891 (66) 5:37E;13 0:00027 0:0948 (68) 5:34E;24 1:95E ;11 0:0029 (610) 2:03E;36 3:25E ;21 3:74E;06

Table 17.

Average p values of outlier sum

(sh) k = 4 k= 5 k = 6 (00) 0:5 0:5 0:5 (22) 0:4999 0:5 0:5 (24) 0:4958 0:4997 0:4998 (26) 0:4280 0:4890 0:4986 (28) 0:2170 0:4084 0:4798 (42) 0:4999 0:5 0:5 (44) 0:4911 0:4992 0:4998 (46) 0:3901 0:4821 0:4978 (48) 0:1463 0:3712 0:4695 (410) 0:0179 0:1657 0:3518 (62) 0:4999 0:5 0:5 (64) 0:4886 0:4990 0:4999 (66) 0:3633 0:4766 0:4970 (68) 0:1152 0:3482 0:4614 (610) 0:0106 0:1322 0:3291

We have several conclusions drawn from Tables 2 and 3:

(a) Consider the case that (sh) = (00). It is nice that the outlier sums in all cases ofk all have averagepvalues more than 0:4 that indicates not statistical signicant for practically non-in"uential genes.

(b) Consider that k = 1 and (sh) 6= (00). Besides few cases, the average p

(30)

genes. Is k = 1 appropriate for constructing outlier sum? We should remind

that k = 1 may occassionally generate too many in"uential genes as we have

seen in Table 15. So, it is good in detecting in"uential genes but would produce non negligible type I error.

(c) Consider that k = 2. The simulation results for (sh) = (00) in Table 16 shows that it would produce only negligible type I error. For (sh) 6= (00),

when h is far enough away from 0, the outlier sum performs very well. From

consideration of balanced two errors, k = 2 seems to be an appropriate choice

of outlier sum.

(d) From the table results that k > 2, it seems to be not e#cient to detect

in"uential genes in all situations of (sh)6= (00).

We now consider an application of p value of outlier sum on a real gene

data. The breast cancer microarray data reported by Huang et al. (2003) contained the expression levels of 12625 genes from 37 (or 52) breast tumor samples. Each sample had a binary outcome describing the status of lymph node involvement in breast cancer (breast cancer recurrence). Among them, 19 samples had no positive nodes. (Or 34 samples had no cancer recurrence and 18 samples had breast cancer recurrence). The gene expressions, obtained from the Aymetrix human U95a chip. We pre-processed the data using RMA (Irizarry et al. (2003)).

We rst compute thepvalues of (7.6) for various values ofk and we display

the numbersno<0:001of genes that are classied to be signicant for that theirs

p values are less than 0:001 in the following table.

Table 18.

Numbers of genes withp values smaller than 0:001

no<0:001 no<0:001

k = 1 5583 k = 4 35

k = 1:5 2407 k = 5 8

k = 2 922 k = 6 5

k = 3 158

We have several comments drawn from the results in Table 18:

(a) We have seen that ^Ha is the proposal of Wu (2007) and ^Hb with k = 1 is

(31)

to be normal. The number of siginicant genes when k = 1 for ^Hb is 5583. This huge number shows that this gene data is denitely not appropriate to be analyzed by the outlier sum proposals been introduced. The other cases

with k  3 the numbers of genes claimed to be signicant are still too big for

further investigation.

(b) When k is as large as 4 the number of siginicant genes is down to 35 and

it further goes down to 8 when k = 5. This shows that gene data may need

outlier sum of more extreme threshold to simplify the pothetial group of genes for further study.

In the following table, we select the cases k = 5 and 6 and list their

corre-sponding gene numbers that are with signicant p values and the outlier sum

values for reference.

Table 19.

Gene numbers with their outlier sums associated with p value

Gene number OS Gene number OS

k = 5 k = 6 4029 27:88125 4029 27:88125 4028 31:40937 4028 31:40937 10210 16:62765 10210 16:62765 3758 7:615114 3758 7:615114 8972 6:014273 8972 6:014273 10987 5:93685 10019 10:82669 198 10:14491

Detection of signicant genes through the p values of outlier sum solves

the di#culty of classical outlier sum technique that is not not able to detect signicant genes when the number of them is not known. But how to decide

constant k for the outlier sum of (8.1)? We propose to list the numbers of

signicant genes for various values of k and select k for that has a moderate

small group of signicant genes.

References

2004 Guide to the Expression of Uncertainty in Measurement Supplement 1 Numerical Methods for the Propagation of Distributions Draft of JCGM document. p. 38.

(32)

Chen, L.-A., Huang, J.-Y. and Chen, H.-C. (2007). Parametric coverage inter-val. Metrologia. 44, L7-L9.

Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identied as lead marker of colon cancer progression, using pooled sample expression

proling. J. Natl. Cancer Inst. 94, 513-521.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distint types of diuse large B-cell lymphoma identied by gene expression proling. Nature, 403, 503-511.

Beer, D. G., Kardia, S. L., Huang, C. C., et al. (2002). Gene-expression proles

predict survival of patients with lung adenocarcinoma. Nat. Med., 8,

816-824.

Chen, L.-A. and Chiang, Y.-C. (1996). Symmetric quantiles and trimmed

means for location and linear regression model. Journal of Nonparametric

Statistics. 7, 171-185.

Huang, E., Cheng, S. H., Dressman, H., et al. (2003). Gene expression

predic-tors of breast cancer outcomes. Lancet, 361, 1590-1596.

Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U. and Speed, T. (2003). Exploration, normalization, and summarizes

of high density oligonucleotide array probe level data. Biostatistics, 2,

249-64.

Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression proling of human atrial myocardium with atrial brillation by DNA microarray analysis. Int. J. Cardiol. 102, 233-238.

Sorlie, T., Tibshirani, R., Parker, J., eta l. (2003). Repeated observation of

breast tumor subtypes in independent gene expression data sets. Proc.

Natl. Acad. Sci. U.S.A., 100, 8418-8423.

Ruppert, D. and Carroll, R. J. (1980). Trimmed least squares estimation in

the linear model. Journal of the American Statistical Association. 75,

828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expression analysis. Biostatistics, 8, 2-8.

(33)

Tomlins, S. A., Rhodes, D. R., Perner, S., eta l. (2005). Recurrent fusion of

TMPRSS2 and ETS transcription factor genes in prostate cancer. Science,

310, 644-648.

Wu, B. (2007). Cancer outlier dierential gene expression detection.

數據

Table 1. Powers for Normal distribution N (  2 ) (two-sided) (  1  2 ) = (1  0) (1  1) (2  0) (1  2) (2  1) (2  2) n = 20 0 : 07098 0 : 20102 0 : 34234 0 : 54309 0 : 54165 0 : 84932 n = 30 0 : 06369 0 : 19075 0 : 33717 0 : 53437 0 : 53833 0 : 8
Table 4. ARL for Normal distribution N (  2 ) (one-sided)
Table 6. ARL for Gamma distribution ;( k ) (one-sided)
Table 7. Powers for Gamma distribution ;( k ) (two-sided)
+7

參考文獻

相關文件

∗ Suppose we want to determine if stocks picked by experts generally perform better than stocks picked by darts. We might conduct a hypothesis test to de- termine if the available

Given a shift κ, if we want to compute the eigenvalue λ of A which is closest to κ, then we need to compute the eigenvalue δ of (11) such that |δ| is the smallest value of all of

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

strongly monotone or uniform P -function to obtain property of bounded level sets, see Proposition 3.5 of Chen and Pan (2006).. In this section, we establish that if F is either

We investigate some properties related to the generalized Newton method for the Fischer-Burmeister (FB) function over second-order cones, which allows us to reformulate the

Abstract We investigate some properties related to the generalized Newton method for the Fischer-Burmeister (FB) function over second-order cones, which allows us to reformulate

To improve the convergence of difference methods, one way is selected difference-equations in such that their local truncation errors are O(h p ) for as large a value of p as

Otherwise, if a principle of conduct passes only the universal test but fails to pass this test, then it is an “imperfect duty.” For example, the principle “takes care of