無母數離群平均之基因分析

(1)

國立交通大學

統計學研究所

碩士論文

無母數離群平均之基因分析

Nonparametric Outlier Mean for Gene

Expression Analysis

研究生：游雅芳

指導教授：陳鄰安博士

(2)

無母數離群平均之基因分析

Nonparametric Outlier Mean for Gene Expression Analysis

研究生：游雅芳

Student: Ya-Fang You

指導教授：陳鄰安博士

Advisor: Dr. Lin-An Chen

國立交通大學

統計學研究所

碩士論文

A Thesis

Submitted to Institute of Statistics

College of Science

National Chiao Tung University

In Partial Fulfillment of the Requirements

For the Degree of

Master

In

Statistics

June 2009

Hsinchu, Taiwan, Republic of China

(3)

無母數離群平均之基因分析

學生：游雅芳

指導教授：陳鄰安博士

國立交通大學統計學研究所碩士班

摘要

離群平均用於檢定整個分配的偏移時有不錯的檢定力，然而部分

分配偏移時放大了離群平均值的變異數，導致檢定力大幅下降，而這

部分分配偏移的情況在癌症的研究上頻繁可見。傳統的統計方法使用

好的資料來做統計推論，而離群平均是利用離群值做統計推論，二者

在觀念上有很大的不同。我們從兩個觀點來思考無母數離群平均值的

研究，首先推導離群平均之漸進分配，建立

α

水準檢定與計算

_p

值，

接著針對離群值的判定原則，推論檢定力和漸進變異數之間的關係。

(4)

Nonparametric Outlier Mean for Gene Expression Analysis

Student: Ya-Fang You

Advisor: Dr. Lin-An Chen

Institute of Statistics

National Chiao Tung University

ABSTRACT

The outlier mean has a reasonable power when the distribution is in a

location shift, however, its power is remarkably reduced when he

distribution is shifted on only a small fraction of observations, due to

large asymptotic variances, while this happen frequently in the cancer

study. We consider the study of the nonparametric outlier mean (outlier

sum) in two aspects. First, the development of asymptotic distribution for

establishing a level

α

test or computing

_p

value is established. Second,

concept of using outliers for statistical inferences may be treated

differently from the classical statistical inferences that construct rules

based on good data. We study the relation between powers and

asymptotic variances of outliers means aiming at drawing principles for

choosing outliers - based inference techniques.

(5)

致謝

兩年的碩士生活，十八年的學生生涯，即將在此劃上句點。

由衷地感謝我的指導教授陳鄰安老師，有了老師細心、耐心的

指導，不厭其煩地為我解決疑惑，這篇論文才能順利完成。謝謝口試

委員黃冠華老師、蔡明田老師及吳柏林老師，老師們對此論文的指正

與建議，使整體論文更加充實。

謝謝身邊的同學、朋友們，和你們一起成長的感覺真的很棒，情

緒低落時有人分享，遇到問題時一起討論，因為你們，我不是孤軍奮

戰，有你們在真好。

最後謝謝一直陪伴著我的家人們，有你們的支持，讓我求學的一

路上沒有後顧之憂，讓我知道有一個溫暖港口隨時歡迎我停靠休憩，

謝謝你們，我最愛的家人。

在此，將本論文獻給我的家人、朋友和師長們，致上我最誠摯的

謝意，能和你們分享成果與喜悅是我最快樂的事。

雅芳於交通大學統計學研究所中華民國九十八年六月

(6)

摘要 ……… i

Abstract………

ii

致謝 ……… iii

1 Introduction ………

2 Two Test Based on Asymptotic Distribution of the

Outlier Mean

………

3 Outlier Mean Based Hypothesis Testings

………

4 Comparison of Outlier Coverages and Asymptotic

Variances of Outlier Mean ………

5 Power Studies with Tests on Outlier Mean ………

6 Appendix ………

Reference ………

1

4

7

10

14

20

23

(7)

Nonparametric Outlier Mean for Gene

Expression Analysis

Ya-Fang You

Abstract

The outlier mean has a reasonable power when the distribution is in a location shift, however, its power is remarkably reduced when he distribution is shifted on only a small fraction of observations, due to large asymptotic variances, while this happen frequently in the can-cer study. We consider the study of the nonparametric outlier mean (outlier sum) in two aspects. First, the development of asymptotic distribution for establishing a level α test or computing p value is es-tablished. Second, concept of using outliers for statistical inferences may be treated differently from the classical statistical inferences that construct rules based on good data. We study the relation between powers and asymptotic variances of outliers means aiming at drawing principles for choosing outliers - based inference techniques.

1 Introduction

DNA microarray technology, which simultaneously probes thousands of gene expression profiles, has been successfully used in medical research for dis-ease classification (Agrawal et al. (2002); Alizadeh et al. (2000); Ohki et al. (2005)); Sorlie et al. (2003)). For example, Sorlie et al., used gene expression to classify malignant breast tumors into five molecular subtypes (one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroups) (Sorlie et al. (2003)). Alizadeh et al. reported that patients with germinal center B-like diffuse large B-cell lymphoma had a significantly better chance of overall survival than those with another molec-ular pattern-activated B-like diffuse large B-cell lymphoma (Alizadeh et al. (2000)). Recently, microarray analysis has been advanced to disease classi-fication by identifying outlier genes that are over-expressed only in a small number of disease samples (see, for example, Tibshirani and Hastie (2007); Tomlins et al. (2005)). To achieve this goal, common statistical methods

(8)

for two-group comparisons such as t-test, are not appropriate due to a large number of genes expressions and a limited number of subjects available.

Several statistical approaches have been proposed to identify those genes where only a subset of the sample genes has high expression. Among them, Tomlins et al. (2005) introduced a method called cancer outlier profile anal-ysis that identifies outlier profiles by a statistic based on the median and the median absolute deviation of a gene expression profile. Tibshirani and Hastie (2007) suggested use of an outlier sum that sums all the gene expres-sion values in the disease group that are greater than the total of the 75th percentile and the interquartile range of the same gene. They also showed that the statistical test based on this outlier sum is noticeably more power-ful than cancer outlier profile analysis in simulation. An alternative outlier sum-like statistic, called outlier robust t-statistic has been proposed by Wu (2007). Recently Chen, Chen and Chan (2008) has proposed a new version of outlier sum and its corresponding outlier mean and developed its large sam-ple theory that allows us to formulate the p value based on the asymptotic distribution. In specific, they considered the parametric study by specifying the normal distribution and performed simulation studies and data analysis for gene expression analysis.

Although the large sample distribution of an outlier mean has been pro-vided in Chen, Chen and Chan (2008), the nonparametric study of outlier mean is still very restricted so that its application in gene expression analysis is still limited. For specific, an outlier mean can be used to test a relation between distributions of normal group subjects and disease group subjects while this relation may be identity of these two distributions or minor rela-tion such as only identity of two popularela-tion outlier means. This is vital since different assumptions allows us to use it introducing different tests but tests for different hypotheses involves different scale estimates that may produce significant difference in their power performances. It is desired to have an advanced study of nonparametric outlier mean so that a principle for prac-titioner in choosing an appropriate, in terms of power performance, outlier mean test statistic is available. This is the aim that we want to achieve in this paper.

We define an outlier mean with cutoff point representing a specific form from a general class and develop its asymptotic representation and distribu-tion. We also develop an asymptotic distribution for this outlier mean con-sidering when the distributions of normal group subjects and disease group subjects are identical. This allows us to consider testing for hypothesis of equal distributions and hypothesis of equal population outlier means. Eval-uation of power performances of these two tests are conducted and we have several interesting results. 1. If there is distributional shift in location only,

(9)

then a test for hypothesis of population outlier mean is relatively more pow-erful than the other one. On the other hand, if there is shift in both location and scale, the two tests are very competitive. This provides important mes-sage for user when pattern of distributional shift may be observed from data. 2. The popularly used cutoff point with percentage α = 0.25 is quite un-satisfactory in nonparametric power study for gene expression analysis while percentages α = 0.35 or 0.45 for constructing cutoff point are satisfactory ones.

In Section 2, we first introduce an outlier mean with cutoff point rep-resenting a specific form from a general class and develop the asymptotic representation and distribution. We then develop the asymptotic distribu-tion in Secdistribu-tion 3 for this outlier mean restricting on the assumpdistribu-tion that the distribution of disease group subjects and the distribution of the normal group subjects are identical. This allows us to introduce several hypotheses defined on parameters involving in the asymptotic distribution and a test for each hypothesis may be determined through estimation of parameters used in this hypothesis. In Section 4, we perform a asymptotic variance compari-son for this outlier mean with several distributions for normal group variable and disease group variable. This provides a guide for user to determine a hy-pothesis to test when the underlying distributions in this two group belongs to this specific type. In Section 5, we will make a power comparison for these tests. Finally, the proofs of theorems are displayed in Section 6.

(10)

2 Two Tests Based on Asymptotic

Distribu-tion of the Outlier Mean

Let X and Y be expression variables for group of normal subject and group of

disease subject, respectively, with distribution functions FX and FY.

Extend-ing from Tibshirani and Hastie (2007), Wu (2007) and Chen, Chen and Chan (2008), a general type cutoff point used in gene expression analysis to detect

outliers may be formulated as Pk

j=1cjF −1

X (αj), 0 < αj < 1, j = 1, . . . , k. We

now define population type outlier means.

Definition 2.1. If Pk j=1cjFX−1(αj) > FX−1(0.5), we call λp_X,Y(α1, α2, . . . , αk) = 1 P {Y ≥Pk j=1cjFX−1(αj)} E[Y I(Y ≥ k X j=1 cjFX−1(αj))]

a population outlier mean with positive outliers. On the other hand, if

Pk j=1cjFX−1(γj) < FX−1(0.5), we call λn_X,Y(γ1, γ2, . . . , γk) = 1 P {Y ≤Pk j=1cjF −1 X (γj)} E[Y I(Y ≤ k X j=1 cjFX−1(γj))]

a population outlier mean with negative outliers.

In the literature, the outlier sum of Wu (2007) and outlier mean of Chen, Chen and Chan (2008) are of this type that we list their corresponding coef-ficients in Table 1.

Table 1. Coefficients for some outlier means

Outlier Mean {α1, α2, α3} {c1, c2, c3}

Wu (2007) {0.25, 0.75, 0.75} {−1, 1, 1}

Chen, Chen and Chan (2008) {0.25, 0.5, 0.75} {−κ, 1, κ}

where κ > 0

Invariance property is desired for any statistical function and then not every population outlier mean introduced above is interesting with this

con-cern. Suppose that a random variable X has a quantile function F_X−1(α). It

is known that its quantile F_X−1(α) has the following properties

F_aX+b−1 (α) = aF

−1

X (α) + b if a > 0

aF_X−1(1 − α) + b if a < 0

We may see the condition that a population outlier mean satisfies desired invariance properties.

(11)

Theorem 2.2. Suppose that cj, j = 1, ..., k satisfy

Pk

j=1cj = 1. Then, the

population outlier mean with positive outliers has the following properties λp_{aX+b,aY +b}(α1, α2, . . . , αk) =

aλp

X,Y(α1, α2, . . . , αk) + b if a > 0

aλn

X,Y(1 − α1, 1 − α2, . . . , 1 − αk) + b if a < 0

On the other hand, the population outlier mean with negative outliers has the following properties

λn_{aX+b,aY +b}(γ1, γ2, . . . , γk) =

aλn

X,Y(γ1, γ2, . . . , γk) + b if a > 0

aλp_X,Y(1 − γ1, 1 − γ2, . . . , 1 − γk) + b if a < 0

If outlier means λp_X,Y(α1, α2, . . . , αk) and λnX,Y(γ1, γ2, . . . , γk) are

formu-lated withPk

j=1cj 6= 1, we may see from the proof (see Section 6) of Theorem

2.2 that they are no longer to be equivalent like the quantile function.

We suggest the population cutoff point of the form 2F_X−1(1 − α) − F_X−1(α).

Let ˆF_X−1 be the empirical quantile function for estimating population quantile

function F_X−1. The sample outlier mean can be expressed as

ˆ λ = Pn2 i=1YiI(Yi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) Pn2 i=1I(Yi ≥ 2 ˆFX−1(1 − α) − ˆF −1 X (α)) . (2.1)

Implicitly this sample outlier means tries to estimate the following population outlier mean

µλ =

E[Y I(Y ≥ 2F_X−1(1 − α) − F_X−1(α))]

P {Y ≥ 2F_X−1(1 − α) − F_X−1(α)} .

For establishing large sample theory based p value, we consider the fol-lowing location models,

Xi = µX + i, i = 1, . . . , n1,

Yi = µY + δi, i = 1, . . . , n2,

(2.2)

where i’s and δi’s are finite sequences of independent and identically

dis-tributed random variables having distribution functions Fand Fδand

proba-bility density functions f and fδ respectively. In addition, E(i) = E(δi) = 0

and V ar(i) = σX2 and V ar(δi) = σ2Y. With this setup, FX(x) = F(x − µX)

and FY(y) = Fδ(y − µY). In terms of error distributions in (2.2), the

popu-lation outlier mean is

µλ = µY +

R∞

η δfδ(δ)dδ

β

(12)

Theorem 2.3. Suppose that assumptions (A2), (A3) and (A4) in the

Ap-pendix are true.

(a) A Bahadur representation of the outlier mean is √ n2(ˆλ − µλ) =((1 − α)b1− αb2)n −1/2 1 n1 X i=1 I(i ≤ F−1(α)) − α(b1+ b2)n −1/2 1 n1 X i=1 I(F−1(α) ≤ i ≤ F−1(1 − α)) + (−αb1+ (1 − α)b2)n −1/2 1 n1 X i=1 I(i ≥ F−1(1 − α)) + 1 βn −1/2 2 n2 X i=1 {δiI(δi ≥ η) − Z ∞ η δfδ(δ)dδ} + op(1) where b1 = −1 β ηfδ(η) √ hf−1(F−1(α)), b2 = −2 β ηfδ(η) √ hf−1(F−1(1 − α)).

(b) √n2(ˆλ − µλ) converges in distribution to N (0, σ2λ) where

σ_λ2 =σ2(b1, b2, v) =α(1 − α)((1 − α)b1− αb2)2 + 2(1 − 2α)α3(b1+ b2)2 + α(1 − α)(αb1− (1 − α)b2)2+ v where v = 1 β2[ Z ∞ η δ2fδ(δ)dδ − ( Z ∞ η δfδ(δ)dδ)2].

(13)

3 Outlier Mean Based Hypothesis Testings

The basic idea behind the use of the outlier mean or outlier sum in gene ex-pression analysis is to see if the disease group subjects and the normal group subjects are similar in some sense. Asymptotic normality for the outlier mean allows us to develop tests for hypotheses dealing with all combinations

of asymptotic mean µλ and asymptotic standard deviation σλ. However, it is

not ready in introducing these tests without knowing the asymptotic prop-erties of this outlier mean when the distributions for two groups of subjects are assumed to be identical as

H0 : FY = FX. (3.1)

Under H0, model (2.2) may be reformulated as the following model,

Xi = µx+ i, i = 1, . . . , n1+ n2 (3.2)

where Xi, i = 1, . . . , n1belongs to normal group and Xi, i = n1+1, . . . , n1+n2

belongs to disease group and i’s are independent and identically distributed

random variables having distribution as defined. Hence, when H0 is true, the

sample outlier mean of (2.1) may be reformulated as ˆ λ = Pn1+n2 i=n1+1XiI(Xi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) Pn1+n2 i=n1+1I(Xi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) (3.3)

where quantile estimates ˆF_X−1(α) and ˆF_X−1(1 − α) are constructed based on

samples X1, . . . , Xn1. The outlier mean of (3.3) tries to estimate the following

parameter

µλX =

E[XI(X ≥ 2F_X−1(1 − α) − F_X−1(α))]

P {X ≥ 2F_X−1(1 − α) − F_X−1(α)}

which, in terms of error distribution, is

µλX = µX + R∞ ηXf()d βX where βX = P { ≥ ηX} with ηX = 2F−1(1 − α) − F −1 (α).

The following theorem states the asymptotic property for the outlier mean when the observations are drawn from model (3.2).

Theorem 3.1. When H0 is true,

√

n2(ˆλ − µλX) converges in distribution to

a normal random variable having distribution N (0, σ2

λX) with

σ2_λX =σ2(b1X, b2X, vX)

=α(1 − α)((1 − α)b1X− αb2X)2+ 2(1 − 2α)α3(b1X+ b2X)2

(14)

where we denote b1X = −1 βX (ηX)f(ηX) √ hf−1(F−1(α)) b2X = −2 βX (ηX)f(ηX) √ hf−1(F−1(1 − α)), vX = 1 (βX)2 [ Z ∞ ηX 2f()d − ( Z ∞ ηX f()d)2].

Theorem 3.1 indicates that when H0 is true,

√

n2(

ˆ λ−µλX

σλX ) converges to

the standard normal distribution and the distribution parameters when H0

is true involved in the function are µλX and σλX. Then, joining Theorems

2.3 and 3.1, we have three choices of constructing test functions as follows: √ n2( ˆ λ − µλX σλX ),√n2( ˆ λ − µλX σλ ), and √n2( ˆ λ − µλ σλX ). (3.4)

the first function considering testing hypothesis involving both asymptotic mean and standard deviation and the others consider only one of these two parameters. Then when we have appropriate estimates of the unknown pa-rameters, test statistics are provided.

Not all test functions are interesting in gene expression analysis since Tomlins et al. (2005) has observed that when outliers occurs in disease sam-ples, they are either only over-expressed or down-expressed. Hence, without considering a location shift the resulted test function is not practical in gene expression analysis. The following procedures are designed for the first two test functions:

(I) Hypothesis for equality of distributions: Hµ,σ : µλ = µλX, σλ2 = σ2λX

(a) The rule for testing Hµ,σ is:

rejecting Hµ,σ if √ n2( ˆ λ − ˆµλX ˆ σλX ) ≥ zα∗ (3.5)

where ˆµλX and ˆσλX are, respectively, estimators for parameters

µλX and σλX.

(b) An approximate p value based on observations xi’s and yi’s is

defined as p = Z ∞ √ n2( ˆ λ− ˆ_µλX ˆ σλX ) φ(z)dz.

(15)

(a) The rule for testing Hµ is: rejecting Hµ if √ n2( ˆ λ − ˆµλX ˆ σλ ) ≥ zα∗ (3.6)

where ˆσλ is estimator of parameter σλ when Y ∼ FY has

distri-bution FY.

(b) An approximate p value based on observations xi’s and yi’s is

defined as p = Z ∞ √ n2( ˆ λ− ˆ_µλX ˆ σλ ) φ(z)dz.

The determination of test selection now relies on (i) power performance and (ii) choice of parameters estimates that will be studied in subsequent sections.

(16)

4 Comparison of Outlier Coverages and

Asymp-totic Variances of Outlier Mean

Tests (3.5) and (3.6) use the same critical point zα∗ and the same estimate

ˆ

µλX for the outlier variable’s expectation. When we consider to choose one

from hypotheses Hµ,σ or Hµ, the right choice is the one that has smaller

asymptotic variance (σλX or σλ). We will see that the size of this asymptotic

variance has a relation with the outlier coverage β. We compute the outlier

coverage probabilities βX and β and asymptotic variances σλX and σλ with

the following distribution setting:

FX = N (0, 1) and FY = N (θ, 1). (4.1)

Table 2. Coverage probabilities and asymptotic variances when there is distributional shift θ α βX β σλX2 σ2λ 1 0.45 0.3531 0.7333 3.1485 1.6203 0.35 0.1238 0.4380 32.379 5.4481 0.25 0.0215 0.1530 369.41 48.11 0.15 0.0009 0.0174 13068.76 877.44

0.05 4.0e-7 4.2e-5 6.5e+7 6.5e+5

3 0.45 0.3531 0.9956 3.1485 1.0123 0.35 0.1238 0.9674 32.379 1.2644 0.25 0.0215 0.8356 369.41 3.2525 0.15 0.0009 0.4565 13068.76 18.63 0.05 4.0e-7 0.0265 6.5e+7 1416.67 10 0.45 0.3531 1 3.1485 1 0.35 0.1238 1 32.379 1 0.25 0.0215 1 369.41 1 0.15 0.0009 1 13068.76 1 0.05 4.0e-7 1 6.5e+7 1

We have several comments drawn from Table 2:

1. It is seen that βX < β for all cases of θ and α. This indicates that

the outlier interval [2F_X−1(1 − α) − F_X−1(α), ∞) covers space of Y more

probable than space of X. This size of the difference could be huge.

For example, when θ = 10, βX’s are all very small but β is or nearly

1 indicating that outlier interval contains almost whole probable space of variable Y .

(17)

2. The differences in coverage probabilities strongly affect the asymptotic

variances in the way that σ2

λX > σ2λ for all cases of θ and α where

the asymptotic variance under hypothesis Hµ,σ could be hundred or

thousand times it under hypothesis Hµ.

3. When θ = 10, the asymptotic variances under hypothesis concerning population outlier mean are vales nearly 1’s. This indicating that the asymptotic variance under this hypothesis is the variance of the random variable Y .

We may be more interesting in the comparison for the following contam-inated alternative one:

FX = N (0, 1) and FY = (1 − γ)N (0, 1) + γN (θ, 1) (4.2)

where θ > 0. This alternative hypothesis assumes that Y has a location model with positive mean γθ and contaminated error variable. Table 3

(18)

Table 3. Coverage probabilities and asymptotic variances when there small proportion (γ = 0.1) of distributional shift

θ α βX β σ2λX σ 2 λ 1 0.45 0.3531 0.3911 3.1485 3.0160 0.35 0.1238 0.1552 32.379 25.0101 0.25 0.0215 0.0346 369.41 239.5184 0.15 0.0009 0.0025 13068.76 5095.878

0.05 4.0e-7 4.5e-6 6.5e+7 5.9e+6

3 0.45 0.3531 0.4173 3.1485 5.9860 0.35 0.1238 0.2082 32.379 27.118 0.25 0.0215 0.1029 369.41 97.7885 0.15 0.0009 0.0465 13068.76 359.8992 0.05 4.0e-7 0.0003 6.5e+7 12821.53 10 0.45 0.3531 0.4178 3.1485 50.6809 0.35 0.1238 0.2115 32.379 201.8219 0.25 0.0215 0.1194 369.41 640.0708 0.15 0.0009 0.1008 13068.76 895.2527 0.05 4.0e-7 0.1000 6.5e+7 909.9943

We have several comments for interpreting the results in Table 3:

1. Setting FY as a contaminated normal distribution of (4.2) indicating

that response variable for disease gene has large proportion of

obser-vations from the distribution FX but with a small part of observations

shifted to the right. The variance of the contaminated distribution is

1+γ(1−γ)θ2_{. Both the contamination and variance enlargement affect}

the coverage probability β, smaller than those in Table 2. This results

in the outlier mean asymptotic variance σ2

λ, larger than those in Table

2.

2. For mild shifts (θ = 1 or 3), the test for hypothesis Hµ has asymptotic

variances σ_λ2’s almost smaller (except (θ, α) = (3, 0.45)) than those for

hypothesis Hµ,σ. When there is significant shift θ = 10, σλX2 ’s are

smaller than σ2

λ’s for α ∈ {0.25, 0.35, 0.45}.

We now consider the case that random variable Y has a mixed distribution with shift not only the mean but also the variance as follows:

(19)

where θ > 0. For cutting percentage α and true values θ and σ, we compute the asymptotic variance and display the comparison in table 4.

Table 4. Asymptotic variances comparison when there small proportion (γ = 0.1) of distributional shift σ θ σ2 λ < σ2λX σλ2 > σ2λX 1 1 0.45, 0.35, 0.25, 0.15, 0.05 none 3 0.35, 0.25, 0.15, 0.05 0.45 10 0.15, 0.05 0.45, 0.35, 0.25 3 1 0.25, 0.15, 0.05 0.45, 0.35 3 0.25, 0.15, 0.05 0.45, 0.35 10 0.15, 0.05 0.45, 0.35, 0.25 5 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25 10 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25

In this case that both contaminated mean and variance are shifted, it

shows σ2_λ < σ_λX2 for most of smaller α ∈ {0.05, 0.15} and σ_λ2 > σ2_λX for most

larger α ∈ {0.25, 0.35, 0.45}. This provides a guide to choose hypothesis for testing when percentage α is already decided.

(20)

5 Power Studies with Tests Based on Outlier

Mean

Consider the power function for testing equal distributions hypothesis Hµ,σ.

By letting µλY and σλY, respectively, as parameters of µλ and σλ when Y ∼

FY is true, an approximate power with significant level α∗ based on test (3.5)

may be derived as bellows

`Hµ,σ = PFY{ √ n2( ˆ λ − ˆµλX ˆ σλX ) ≥ zα∗} = PFY{ √ n2( ˆ λ − µλY σλY ) ≥ zα∗σˆλX + √ n2(ˆµλX− µλY) σλY } ≈ P {Z ≥ zα∗σˆλX + √ n2(ˆµλX − µλY) σλY }. (5.1)

This is the power function when we test for hypothesis of equal distributions. On the other hand, the power function for testing equal outlier means

hypothesis Hµ with significant level α∗ may be derived as bellows

`Hµ = PFY{ √ n2( ˆ λ − ˆµλX ˆ σλY ) ≥ zα∗} = PFY{ √ n2( ˆ λ − µλY σλY ) ≥ zα∗σˆλY + √ n2(ˆµλX− µλY) σλY } ≈ P {Z ≥ zα∗σˆλY + √ n2(ˆµλX− µλY) σλY }. (5.2)

From (5.1) and (5.2), the performance of these two tests rely on several elements describing in the following:

n2 : the larger the sample size for the disease gene, the larger the powers.

Due to the fact that ˆµλX < µλY when there are outliers in Y.

σ2

λX : the larger the asymptotic variance, the smaller the power for testing

hypothesis Hµ,σ

σ2

λY : the larger the asymptotic variance, the smaller the power for testing

hypothesis Hµ

We also note that when cutoff point percentage α decreases, the outlier

mean asymptotic variances σ2

λX and σ2λY are both increase.

We now consider the design of distributional shift of (4.1) and compute

the approximate powers for testing hypotheses Hµ,σ and Hµ. The results are

(21)

Table 5. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is distributional shift n2 α θ = 1 θ = 3 θ = 5 θ = 10 30 0.45 (0.2775, 0.5230) (1, 1) (1, 1) (1, 1) 0.35 (3.0e-4, 0.1441) (0.0826, 1) (1, 1) (1, 1) 0.25 (4.5e-6, 0.0634) (0, 0.8634) (0, 1) (1, 1) 0.15 (1.2e-10, 0.0517) (0, 0.1513) (0, 1) (0, 1) 0.05 (0, 0.0500) (0, 0.0530) (0, 0.1544) (0, 1) 50 0.45 (0.4622, 0.7099) (1, 1) (1, 1) (1, 1) 0.35 (5.6e-4, 0.1861) (0.7358, 1) (1, 1) (1, 1) 0.25 (5.3e-6, 0.0679) (0, 0.9708) (0, 1) (1, 1) 0.15 (1.3e-10, 0.0521) (0, 0.1971) (0, 1) (0, 1) 0.05 (0, 0.0500) (0, 0.0538) (0, 0.2018) (0, 1)

We have comments drawn from results showing in Tables 5:

1. Testing hypotheses Hµ,σ and Hµ have small powers for mild shifts θ =

1, 3 unless we choose large proportions α (α = 0.35 and 0.45). If there is significant shifting in location (θ = 10), most of these two tests are satisfactory. The percentage α = 0.25 is the recommended popularly in literature (see Hoaglin et al. (1983)).

2. A comparison of approximate powers between these two tests shows

that the test for hypothesis Hµ seems to be the right choice. To test

hypothesis Hµ,σ gives unsatisfactory powers besides cases of strong

dis-tributional shift such as θ = 5 or 10 with choosing percentage α as large as 0.35 or 0.45.

3. The effects of these two tests exist in sample size. Basically the larger the sample size generates larger power for either one test.

Next, we consider that the assumption for distributions of X and Y is that Y has a case of contaminated normal in (4.2) as

FX = N (0, 1) and FY = 0.9N (0, 1) + 0.1N (θ, 1)

where θ > 0. The computed approximate powers for testing hypotheses Hµ,σ

(22)

Table 6. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is

small fraction distributional shift

n2 α θ = 1 θ = 3 θ = 5 θ = 10 30 0.45 (0.0740, 0.0791) (0.4420, 0.2750) (0.7307, 0.4072) (0.8921, 0.5011) 0.35 (0.0363, 0.0584) (0.1353, 0.1713) (0.4635, 0.3136) (0.8060, 0.4512) 0.25 (0.0217, 0.0525) (0.0026, 0.1077) (0.0669, 0.2326) (0.5517, 0.3951) 0.15 (0.0043, 0.0505) (0.0000, 0.0658) (0.0000, 0.1444) (1.9e-7, 0.3285) 0.05 (2.1e-8, 0.0500) (0.0000, 0.0510) (0.0000, 0.0638) (0.0000, 0.2238) 50 0.45 (0.0840, 0.0897) (0.5631, 0.3847) (0.8474, 0.5698) (0.9570, 0.6852) 0.35 (0.0382, 0.0611) (0.1843, 0.2277) (0.5970, 0.4411) (0.9043, 0.6256) 0.25 (0.0221, 0.0532) (0.0038, 0.1311) (0.1087, 0.3213) (0.7023, 0.5537) 0.15 (0.0043, 0.0506) (0.0000, 0.0711) (0.0000, 0.1866) (1.1e-6, 0.4623) 0.05 (2.1e-8, 0.0500) (0.0000, 0.0512) (0.0000, 0.0684) (0.0000, 0.3079)

We have comments drawn from results showing in Tables 6:

1. Basically the contaminated distribution FY reduces the powers of two

tests due to enlarging the asymptotic outlier mean asymptotic

vari-ances σ2

λX and σλ2 due to contamination and increasing the variance of

distribution FY.

2. If we specify cutoff point percentage α to be 0.35 or more, the test

for hypothesis Hµ,σ seems to be the right choice. On the other hand,

if we specify cutoff point percentage α to be smaller than 0.25, the

test for hypothesis Hµ seems to be the right choice. For α = 0.25,

the test for hypothesis Hµ is better unless the location parameter θ in

(23)

Table 7. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is

small fraction distributional shift (n2 = 50)

γ α θ = 5 θ = 10 θ = 20 0.05 0.45 (0.5902, 0.3196) (0.8241, 0.4207) (0.9008, 0.4606) 0.35 (0.3782, 0.2445) (0.7326, 0.3746) (0.8706, 0.4383) 0.25 (0.1323, 0.1961) (0.5828, 0.3339) (0.8200, 0.4130) 0.15 (2.4e-15, 0.1301) (0.0005, 0.2816) (0.1986, 0.3824) 0.05 (0.0000, 0.0632) (0.0000, 0.1955) (0.0000, 0.3301) 0.20 0.45 (0.9818, 0.8759) (0.9977, 0.9385) (0.9992, 0.9550) 0.35 (0.8295, 0.7529) (0.9881, 0.9064) (0.9981, 0.9455) 0.25 (0.0614, 0.5591) (0.8336, 0.8492) (0.9879, 0.9289) 0.15 (0.0000, 0.3026) (8.6e-13, 0.7516) (0.0376, 0.9011) 0.05 (0.0000, 0.0758) (0.0000, 0.5274) (0.0000, 0.8367) 0.30 0.45 (0.9985, 0.9785) (1.0000, 0.9939) (1.0000, 0.9965) 0.35 (0.9338, 0.9227) (0.9990, 0.9871) (0.9999, 0.9952) 0.25 (0.0302, 0.7601) (0.9101, 0.9687) (0.9986, 0.9924) 0.15 (0.0000, 0.4305) (0.0000, 0.9187) (0.0102, 0.9859) 0.05 (0.0000, 0.0825) (0.0000, 0.7246) (0.0000, 0.9635) 0.50 0.45 (1.0000, 0.9968) (1.0000, 1.0000) (1.0000, 1.0000) 0.35 (0.9952, 0.9111) (1.0000, 1.0000) (1.0000, 1.0000) 0.25 (0.0038, 0.4660) (0.9836, 0.9999) (1.0000, 1.0000) 0.15 (0.0000, 0.1104) (0.0000, 0.9986) (0.0002, 1.0000) 0.05 (0.0000, 0.0525) (0.0000, 0.9616) (0.0000, 0.9998)

We have several comments on the results in Table 7:

1. Although the more the contamination (γ) makes the variance of the response variable Y , however, it is easier in detection of existence of outliers so that the powers of two tests increase. The powers for γ = 0.5 are very close to the performance of location shift in Table 5.

2. Even the large contamination (γ = 0.5), the test for hypothesis Hµ,σ

with low α’s (α = 0.05 and 0.05) is still very poor in power performance. 3. Combining the discussions for the results in Tables 5-7, the test for

(24)

Besides the cases of normal or mixed normal distributions, we may con-sider the cases that X and Y draw from the following two cases:

Case 1: FX = Laplace(0, 1) and FY = Laplace(θ, 1)

Case 2: FX = t(5) and FY = t(5) + θ.

Table 8. Approximate powers (`Hµ,σ, `Hµ) for hypothesis Hµ and Hµ,σ when

X and Y are with Laplace or t distribution (n2 = 30)

8.(a) Y ∼ Laplace(θ, 1) α θ = 1 θ = 3 θ = 5 θ = 10 0.45 (0.0193, 0.2899) (1.0000, 1.0000) (1, 1) (1, 1) 0.35 (5.2e-7, 0.0500) (0.0134, 0.9997) (1, 1) (1, 1) 0.25 (7.1e-7, 0.0500) (0, 0.4908) (0, 1) (1, 1) 0.15 (6.7e-4, 0.0500) (0, 0.0500) (0, 0.8694) (0, 1) 8.(b) Y ∼ t(5) + θ α θ = 1 θ = 3 θ = 5 θ = 10 0.45 (0.0465, 0.4128) (1, 1) (1, 1) (1, 1) 0.35 (8.2e-8, 0.0777) (0.0002, 0.9999) (1, 1) (1, 1) 0.25 (4.0e-6, 0.0433) (0, 0.5184) (0, 1) (0.9838, 1) 0.15 (0.0001, 0.0460) (0, 0.0167) (0, 0.8794) (0, 1)

From the displayed results, it seems that two tests are quite satisfactory when there are significant location shifts. However, the test for hypothesis

Hµ is uniformly better than it for hypothesis Hµ,σ. The test for hypothesis

Hµ is very satisfactory for small percentage α when there is location shift is

as large as 5 or more.

We now consider the case that random variable Y has a mixed distribution with shift not only the mean but also the variance as follows:

FX = N (0, 1) and FY = 0.9N (0, 1) + 0.1N (θ, σ2)

where θ > 0. For sample size n2 = 30, cutting percentage α and true

values θ and σ, we compute the approximate powers, `Hµ,σ and `Hµ. With,

α = 0.05, 0.15, 0.25, 0.35, 0.45, θ = 1, 3, 10, we display a comparison of two approximate powers in the following table.

(25)

Table 9. Comparison of approximate powers σ θ `Hµ > `Hµ,σ `Hµ < `Hµ,σ 1 1 0.45, 0.35, 0.25, 0.15, 0.05 none 3 0.35, 0.25, 0.15, 0.05 0.45 10 0.15, 0.05 0.45, 0.35, 0.25 3 1 0.25, 0.15, 0.05 0.45, 0.35 3 0.25, 0.15, 0.05 0.45, 0.35 10 0.15, 0.05 0.45, 0.35, 0.25 5 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25 10 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25

In general, we test hypothesis Hµ,σ is more powerful than to test

hy-pothesis Hµ,σ when we choose percentage α as large as 0.35 or 0.45 and test

(26)

6 Appendix

To investigate Theorem 3.2, let’s establish a more general theory for outlier mean Π. The following assumptions are needed.

(A1) The limit h = limn1,n2→∞

n2

n1 exists.

(A2) Suppose that there is constant C such that

√

n₁( ˆC − C) = Op(1) where

C depends generally on distribution of X.

(A3) Probability density function fδ of distribution Fδis bounded away from

zero in a neighborhood of quantity C − µY.

(A4) Probability density function f is bounded away from zero in the

neigh-borhood of F−1(α) for α ∈ (0, 1). Proof of Theorem 2.2 Proof. If a > 0, λp_{aX+b,aY +b}(α1, α2, . . . , αk) =E[(aY + b)I(aY + b ≥ Pk j=1cjF −1 aX+b(αj))] P {aY + b ≥Pk j=1cjFaX+b−1 (αj)} =E[(aY + b)I(aY + b ≥ Pk j=1cj(aF −1 X (αj) + b))] P {aY + b ≥Pk j=1cj(aF −1 X (αj) + b)} =E[(aY + b)I(Y ≥ Pk j=1cjFX−1(αj))] P {Y ≥Pk j=1cjF −1 X (αj)} =aE[Y I(Y ≥ Pk j=1cjFX−1(αj))] P {Y ≥Pk j=1cjF −1 X (αj)} + b =aλp_X,Y(α1, α2, . . . , αk) + b.

(27)

λp_{aX+b,aY +b}(α1, α2, . . . , αk) =E[(aY + b)I(aY + b ≥ Pk j=1cj(aF −1 X (1 − αj) + b))] P {aY + b ≥Pk j=1cj(aFX−1(1 − αj) + b)} =E[(aY + b)I(Y ≤ Pk j=1cjF −1 X (1 − αj))] P {Y ≤Pk j=1cjFX−1(1 − αj)} =aE[Y I(Y ≤ Pk j=1cjF −1 X (1 − αj))] P {Y ≤Pk j=1cjF −1 X (1 − αj)} + b =aλn_X,Y(1 − α1, 1 − α2, . . . , 1 − αk) + b.

The proof of transformation on outlier mean with negative outliers may be similarly proved and it is skipped.

Proof of Theorem 2.3

Proof. Let C = 2F_X−1(1 − α) − F_X−1(α) and ˆC = 2 ˆF_X−1(1 − α) − ˆF_X−1(α). From

model (2.2) and the expression of ˆλX in (2.1), we have

ˆ λ = µY + Pn2 i=1δiI(δi > C − µy + n −1/2 1 T ) Pn2 i=1I(Yi > ˆC) where T =√n1( ˆC − C).

This implies that √ n2(ˆλ − µY) = n−1/2₂ Pn2 i=1δiI(δi > C − µy+ n −1/2 1 T ) n−1₂ Pn2 i=1I(Yi > ˆC) . (6.1)

With assumption (A4), the key in this proof is that

n−1/2₂ n2 X i=1 δi[I(δi > C − µX + n −1/2 1 T ) − I(δi > C − µY)] = − n−1/2₂ n2 X i=1 δi[I(δi ≤ C − µY + n −1/2 1 T ) − I(δi ≤ C − µY)] = − (C − µX)gy(C − µY) √ hT + op(1) (6.2)

which may seen in Ruppert and Carroll (1980) and Chen and Chiang (1996).

(28)

Assumption (A1), equation (6.1), (6.2) and the following representation of empirical quantile √ n1( ˆF−1(α) − F −1 (α)) =f−1(F−1(α))n−1/2₁ n1 X i=1 [α − I(i ≤ F−1(α))] + op(1) (6.3)

see, for example, Ruppert and Carroll (1980). The asymptotic distribution in (b) of Theorem 2.3 is induced from the Central Limit Theorem.

The proof of Theorem 3.1 is exactly identical to it of Theorem 2.3 with

(29)

Reference

1. Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identified as lead marker of colon cancer progression, using pooled sample expression profiling. J. Natl. Cancer Inst., 94, 513-521.

2. Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511.

3. Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2008). The p Value for the Outlier Sum in Differential Gene Expression Analysis. Submitted to Biometrika for publication (In revision).

4. Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonparametric Statistics. 7, 171-185.

5. Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983). Understanding Robust and Exploratory Data Analysis, Wiley: New York.

6. Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression profiling of human atrial myocardium with atrial fibrillation by DNA microarray analysis. Int. J. Cardiol., 102, 233-238.

7. Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estima-tion in the linear model. Journal of American Statistical Associaestima-tion, 75, 828-838.

8. Sorlie, T., Tibshirani, R., Parker, J., et al. (2003). Repeated obser-vation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. U.S.A., 100, 8418-8423.

9. Tibshirani, R. and Hastie, T. (2007). Outlier sums differential gene expression analysis. Biostatistics, 8, 2-8.

10. Tomlins, S. A., Rhodes, D. R., Perner, S., et al. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science, 310, 644-648.

11. Wu, B. (2007). Cancer outlier differential gene expression detection. Biostatistics, 8, 566-575.

無母數離群平均之基因分析

國 立 交 通 大 學

統 計 學 研 究 所

碩士論文

無母數離群平均之基因分析

Nonparametric Outlier Mean for Gene

Expression Analysis

研 究 生：游雅芳

指導教授：陳鄰安 博士

無母數離群平均之基因分析

Nonparametric Outlier Mean for Gene Expression Analysis

研 究 生：游雅芳

Student: Ya-Fang You

指導教授：陳鄰安 博士

Advisor: Dr. Lin-An Chen

國 立 交 通 大 學

統計學研究所

碩士論文

A Thesis

Submitted to Institute of Statistics

College of Science

National Chiao Tung University

In Partial Fulfillment of the Requirements

For the Degree of

Master

In

Statistics

June 2009

Hsinchu, Taiwan, Republic of China

無母數離群平均之基因分析

學生：游雅芳

指導教授：陳鄰安 博士

國立交通大學統計學研究所碩士班

摘 要

離群平均用於檢定整個分配的偏移時有不錯的檢定力，然而部分

分配偏移時放大了離群平均值的變異數，導致檢定力大幅下降，而這

部分分配偏移的情況在癌症的研究上頻繁可見。傳統的統計方法使用

好的資料來做統計推論，而離群平均是利用離群值做統計推論，二者

在觀念上有很大的不同。我們從兩個觀點來思考無母數離群平均值的

研究，首先推導離群平均之漸進分配，建立

α

水準檢定與計算

值，

接著針對離群值的判定原則，推論檢定力和漸進變異數之間的關係。

Nonparametric Outlier Mean for Gene Expression Analysis

Student: Ya-Fang You

Advisor: Dr. Lin-An Chen

Institute of Statistics

National Chiao Tung University

ABSTRACT

The outlier mean has a reasonable power when the distribution is in a

location shift, however, its power is remarkably reduced when he

distribution is shifted on only a small fraction of observations, due to

large asymptotic variances, while this happen frequently in the cancer

study. We consider the study of the nonparametric outlier mean (outlier

sum) in two aspects. First, the development of asymptotic distribution for

establishing a level

α

test or computing

value is established. Second,

concept of using outliers for statistical inferences may be treated

differently from the classical statistical inferences that construct rules

based on good data. We study the relation between powers and

asymptotic variances of outliers means aiming at drawing principles for

choosing outliers - based inference techniques.

致 謝

兩年的碩士生活，十八年的學生生涯，即將在此劃上句點。

由衷地感謝我的指導教授 陳鄰安老師，有了老師細心、耐心的

指導，不厭其煩地為我解決疑惑，這篇論文才能順利完成。謝謝口試

委員黃冠華老師、蔡明田老師及吳柏林老師，老師們對此論文的指正

與建議，使整體論文更加充實。

謝謝身邊的同學、朋友們，和你們一起成長的感覺真的很棒，情

緒低落時有人分享，遇到問題時一起討論，因為你們，我不是孤軍奮

戰，有你們在真好。

最後謝謝一直陪伴著我的家人們，有你們的支持，讓我求學的一

路上沒有後顧之憂，讓我知道有一個溫暖港口隨時歡迎我停靠休憩，

謝謝你們，我最愛的家人。

在此，將本論文獻給我的家人、朋友和師長們，致上我最誠摯的

謝意，能和你們分享成果與喜悅是我最快樂的事。

Contents

國立交通大學

統計學研究所

研究生：游雅芳

指導教授：陳鄰安博士

研究生：游雅芳

指導教授：陳鄰安博士

國立交通大學

指導教授：陳鄰安博士

摘要

致謝

由衷地感謝我的指導教授陳鄰安老師，有了老師細心、耐心的