國 立 交 通 大 學
統 計 學 研 究 所
碩士論文
無母數離群平均之基因分析
Nonparametric Outlier Mean for Gene
Expression Analysis
研 究 生:游雅芳
指導教授:陳鄰安 博士
無母數離群平均之基因分析
Nonparametric Outlier Mean for Gene Expression Analysis
研 究 生:游雅芳
Student: Ya-Fang You
指導教授:陳鄰安 博士
Advisor: Dr. Lin-An Chen
國 立 交 通 大 學
統計學研究所
碩士論文
A Thesis
Submitted to Institute of Statistics
College of Science
National Chiao Tung University
In Partial Fulfillment of the Requirements
For the Degree of
Master
In
Statistics
June 2009
Hsinchu, Taiwan, Republic of China
無母數離群平均之基因分析
學生:游雅芳
指導教授:陳鄰安 博士
國立交通大學統計學研究所碩士班
摘 要
離群平均用於檢定整個分配的偏移時有不錯的檢定力,然而部分
分配偏移時放大了離群平均值的變異數,導致檢定力大幅下降,而這
部分分配偏移的情況在癌症的研究上頻繁可見。傳統的統計方法使用
好的資料來做統計推論,而離群平均是利用離群值做統計推論,二者
在觀念上有很大的不同。我們從兩個觀點來思考無母數離群平均值的
研究,首先推導離群平均之漸進分配,建立
α
水準檢定與計算
p值,
接著針對離群值的判定原則,推論檢定力和漸進變異數之間的關係。
Nonparametric Outlier Mean for Gene Expression Analysis
Student: Ya-Fang You
Advisor: Dr. Lin-An Chen
Institute of Statistics
National Chiao Tung University
ABSTRACT
The outlier mean has a reasonable power when the distribution is in a
location shift, however, its power is remarkably reduced when he
distribution is shifted on only a small fraction of observations, due to
large asymptotic variances, while this happen frequently in the cancer
study. We consider the study of the nonparametric outlier mean (outlier
sum) in two aspects. First, the development of asymptotic distribution for
establishing a level
α
test or computing
pvalue is established. Second,
concept of using outliers for statistical inferences may be treated
differently from the classical statistical inferences that construct rules
based on good data. We study the relation between powers and
asymptotic variances of outliers means aiming at drawing principles for
choosing outliers - based inference techniques.
致 謝
兩年的碩士生活,十八年的學生生涯,即將在此劃上句點。
由衷地感謝我的指導教授 陳鄰安老師,有了老師細心、耐心的
指導,不厭其煩地為我解決疑惑,這篇論文才能順利完成。謝謝口試
委員黃冠華老師、蔡明田老師及吳柏林老師,老師們對此論文的指正
與建議,使整體論文更加充實。
謝謝身邊的同學、朋友們,和你們一起成長的感覺真的很棒,情
緒低落時有人分享,遇到問題時一起討論,因為你們,我不是孤軍奮
戰,有你們在真好。
最後謝謝一直陪伴著我的家人們,有你們的支持,讓我求學的一
路上沒有後顧之憂,讓我知道有一個溫暖港口隨時歡迎我停靠休憩,
謝謝你們,我最愛的家人。
在此,將本論文獻給我的家人、朋友和師長們,致上我最誠摯的
謝意,能和你們分享成果與喜悅是我最快樂的事。
雅芳 於交通大學統計學研究所 中華民國九十八年六月Contents
摘要 ……… i
Abstract………
ii
致謝 ……… iii
1
Introduction ………
2
Two Test Based on Asymptotic Distribution of the
Outlier Mean
………
3
Outlier Mean Based Hypothesis Testings
………
4
Comparison of Outlier Coverages and Asymptotic
Variances of Outlier Mean ………
5
Power Studies with Tests on Outlier Mean ………
6
Appendix ………
Reference ………
1
4
7
10
14
20
23
Nonparametric Outlier Mean for Gene
Expression Analysis
Ya-Fang You
Abstract
The outlier mean has a reasonable power when the distribution is in a location shift, however, its power is remarkably reduced when he distribution is shifted on only a small fraction of observations, due to large asymptotic variances, while this happen frequently in the can-cer study. We consider the study of the nonparametric outlier mean (outlier sum) in two aspects. First, the development of asymptotic distribution for establishing a level α test or computing p value is es-tablished. Second, concept of using outliers for statistical inferences may be treated differently from the classical statistical inferences that construct rules based on good data. We study the relation between powers and asymptotic variances of outliers means aiming at drawing principles for choosing outliers - based inference techniques.
1
Introduction
DNA microarray technology, which simultaneously probes thousands of gene expression profiles, has been successfully used in medical research for dis-ease classification (Agrawal et al. (2002); Alizadeh et al. (2000); Ohki et al. (2005)); Sorlie et al. (2003)). For example, Sorlie et al., used gene expression to classify malignant breast tumors into five molecular subtypes (one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroups) (Sorlie et al. (2003)). Alizadeh et al. reported that patients with germinal center B-like diffuse large B-cell lymphoma had a significantly better chance of overall survival than those with another molec-ular pattern-activated B-like diffuse large B-cell lymphoma (Alizadeh et al. (2000)). Recently, microarray analysis has been advanced to disease classi-fication by identifying outlier genes that are over-expressed only in a small number of disease samples (see, for example, Tibshirani and Hastie (2007); Tomlins et al. (2005)). To achieve this goal, common statistical methods
for two-group comparisons such as t-test, are not appropriate due to a large number of genes expressions and a limited number of subjects available.
Several statistical approaches have been proposed to identify those genes where only a subset of the sample genes has high expression. Among them, Tomlins et al. (2005) introduced a method called cancer outlier profile anal-ysis that identifies outlier profiles by a statistic based on the median and the median absolute deviation of a gene expression profile. Tibshirani and Hastie (2007) suggested use of an outlier sum that sums all the gene expres-sion values in the disease group that are greater than the total of the 75th percentile and the interquartile range of the same gene. They also showed that the statistical test based on this outlier sum is noticeably more power-ful than cancer outlier profile analysis in simulation. An alternative outlier sum-like statistic, called outlier robust t-statistic has been proposed by Wu (2007). Recently Chen, Chen and Chan (2008) has proposed a new version of outlier sum and its corresponding outlier mean and developed its large sam-ple theory that allows us to formulate the p value based on the asymptotic distribution. In specific, they considered the parametric study by specifying the normal distribution and performed simulation studies and data analysis for gene expression analysis.
Although the large sample distribution of an outlier mean has been pro-vided in Chen, Chen and Chan (2008), the nonparametric study of outlier mean is still very restricted so that its application in gene expression analysis is still limited. For specific, an outlier mean can be used to test a relation between distributions of normal group subjects and disease group subjects while this relation may be identity of these two distributions or minor rela-tion such as only identity of two popularela-tion outlier means. This is vital since different assumptions allows us to use it introducing different tests but tests for different hypotheses involves different scale estimates that may produce significant difference in their power performances. It is desired to have an advanced study of nonparametric outlier mean so that a principle for prac-titioner in choosing an appropriate, in terms of power performance, outlier mean test statistic is available. This is the aim that we want to achieve in this paper.
We define an outlier mean with cutoff point representing a specific form from a general class and develop its asymptotic representation and distribu-tion. We also develop an asymptotic distribution for this outlier mean con-sidering when the distributions of normal group subjects and disease group subjects are identical. This allows us to consider testing for hypothesis of equal distributions and hypothesis of equal population outlier means. Eval-uation of power performances of these two tests are conducted and we have several interesting results. 1. If there is distributional shift in location only,
then a test for hypothesis of population outlier mean is relatively more pow-erful than the other one. On the other hand, if there is shift in both location and scale, the two tests are very competitive. This provides important mes-sage for user when pattern of distributional shift may be observed from data. 2. The popularly used cutoff point with percentage α = 0.25 is quite un-satisfactory in nonparametric power study for gene expression analysis while percentages α = 0.35 or 0.45 for constructing cutoff point are satisfactory ones.
In Section 2, we first introduce an outlier mean with cutoff point rep-resenting a specific form from a general class and develop the asymptotic representation and distribution. We then develop the asymptotic distribu-tion in Secdistribu-tion 3 for this outlier mean restricting on the assumpdistribu-tion that the distribution of disease group subjects and the distribution of the normal group subjects are identical. This allows us to introduce several hypotheses defined on parameters involving in the asymptotic distribution and a test for each hypothesis may be determined through estimation of parameters used in this hypothesis. In Section 4, we perform a asymptotic variance compari-son for this outlier mean with several distributions for normal group variable and disease group variable. This provides a guide for user to determine a hy-pothesis to test when the underlying distributions in this two group belongs to this specific type. In Section 5, we will make a power comparison for these tests. Finally, the proofs of theorems are displayed in Section 6.
2
Two Tests Based on Asymptotic
Distribu-tion of the Outlier Mean
Let X and Y be expression variables for group of normal subject and group of
disease subject, respectively, with distribution functions FX and FY.
Extend-ing from Tibshirani and Hastie (2007), Wu (2007) and Chen, Chen and Chan (2008), a general type cutoff point used in gene expression analysis to detect
outliers may be formulated as Pk
j=1cjF −1
X (αj), 0 < αj < 1, j = 1, . . . , k. We
now define population type outlier means.
Definition 2.1. If Pk j=1cjFX−1(αj) > FX−1(0.5), we call λpX,Y(α1, α2, . . . , αk) = 1 P {Y ≥Pk j=1cjFX−1(αj)} E[Y I(Y ≥ k X j=1 cjFX−1(αj))]
a population outlier mean with positive outliers. On the other hand, if
Pk j=1cjFX−1(γj) < FX−1(0.5), we call λnX,Y(γ1, γ2, . . . , γk) = 1 P {Y ≤Pk j=1cjF −1 X (γj)} E[Y I(Y ≤ k X j=1 cjFX−1(γj))]
a population outlier mean with negative outliers.
In the literature, the outlier sum of Wu (2007) and outlier mean of Chen, Chen and Chan (2008) are of this type that we list their corresponding coef-ficients in Table 1.
Table 1. Coefficients for some outlier means
Outlier Mean {α1, α2, α3} {c1, c2, c3}
Wu (2007) {0.25, 0.75, 0.75} {−1, 1, 1}
Chen, Chen and Chan (2008) {0.25, 0.5, 0.75} {−κ, 1, κ}
where κ > 0
Invariance property is desired for any statistical function and then not every population outlier mean introduced above is interesting with this
con-cern. Suppose that a random variable X has a quantile function FX−1(α). It
is known that its quantile FX−1(α) has the following properties
FaX+b−1 (α) = aF
−1
X (α) + b if a > 0
aFX−1(1 − α) + b if a < 0
We may see the condition that a population outlier mean satisfies desired invariance properties.
Theorem 2.2. Suppose that cj, j = 1, ..., k satisfy
Pk
j=1cj = 1. Then, the
population outlier mean with positive outliers has the following properties λpaX+b,aY +b(α1, α2, . . . , αk) =
aλp
X,Y(α1, α2, . . . , αk) + b if a > 0
aλn
X,Y(1 − α1, 1 − α2, . . . , 1 − αk) + b if a < 0
On the other hand, the population outlier mean with negative outliers has the following properties
λnaX+b,aY +b(γ1, γ2, . . . , γk) =
aλn
X,Y(γ1, γ2, . . . , γk) + b if a > 0
aλpX,Y(1 − γ1, 1 − γ2, . . . , 1 − γk) + b if a < 0
If outlier means λpX,Y(α1, α2, . . . , αk) and λnX,Y(γ1, γ2, . . . , γk) are
formu-lated withPk
j=1cj 6= 1, we may see from the proof (see Section 6) of Theorem
2.2 that they are no longer to be equivalent like the quantile function.
We suggest the population cutoff point of the form 2FX−1(1 − α) − FX−1(α).
Let ˆFX−1 be the empirical quantile function for estimating population quantile
function FX−1. The sample outlier mean can be expressed as
ˆ λ = Pn2 i=1YiI(Yi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) Pn2 i=1I(Yi ≥ 2 ˆFX−1(1 − α) − ˆF −1 X (α)) . (2.1)
Implicitly this sample outlier means tries to estimate the following population outlier mean
µλ =
E[Y I(Y ≥ 2FX−1(1 − α) − FX−1(α))]
P {Y ≥ 2FX−1(1 − α) − FX−1(α)} .
For establishing large sample theory based p value, we consider the fol-lowing location models,
Xi = µX + i, i = 1, . . . , n1,
Yi = µY + δi, i = 1, . . . , n2,
(2.2)
where i’s and δi’s are finite sequences of independent and identically
dis-tributed random variables having distribution functions Fand Fδand
proba-bility density functions f and fδ respectively. In addition, E(i) = E(δi) = 0
and V ar(i) = σX2 and V ar(δi) = σ2Y. With this setup, FX(x) = F(x − µX)
and FY(y) = Fδ(y − µY). In terms of error distributions in (2.2), the
popu-lation outlier mean is
µλ = µY +
R∞
η δfδ(δ)dδ
β
Theorem 2.3. Suppose that assumptions (A2), (A3) and (A4) in the
Ap-pendix are true.
(a) A Bahadur representation of the outlier mean is √ n2(ˆλ − µλ) =((1 − α)b1− αb2)n −1/2 1 n1 X i=1 I(i ≤ F−1(α)) − α(b1+ b2)n −1/2 1 n1 X i=1 I(F−1(α) ≤ i ≤ F−1(1 − α)) + (−αb1+ (1 − α)b2)n −1/2 1 n1 X i=1 I(i ≥ F−1(1 − α)) + 1 βn −1/2 2 n2 X i=1 {δiI(δi ≥ η) − Z ∞ η δfδ(δ)dδ} + op(1) where b1 = −1 β ηfδ(η) √ hf−1(F−1(α)), b2 = −2 β ηfδ(η) √ hf−1(F−1(1 − α)).
(b) √n2(ˆλ − µλ) converges in distribution to N (0, σ2λ) where
σλ2 =σ2(b1, b2, v) =α(1 − α)((1 − α)b1− αb2)2 + 2(1 − 2α)α3(b1+ b2)2 + α(1 − α)(αb1− (1 − α)b2)2+ v where v = 1 β2[ Z ∞ η δ2fδ(δ)dδ − ( Z ∞ η δfδ(δ)dδ)2].
3
Outlier Mean Based Hypothesis Testings
The basic idea behind the use of the outlier mean or outlier sum in gene ex-pression analysis is to see if the disease group subjects and the normal group subjects are similar in some sense. Asymptotic normality for the outlier mean allows us to develop tests for hypotheses dealing with all combinations
of asymptotic mean µλ and asymptotic standard deviation σλ. However, it is
not ready in introducing these tests without knowing the asymptotic prop-erties of this outlier mean when the distributions for two groups of subjects are assumed to be identical as
H0 : FY = FX. (3.1)
Under H0, model (2.2) may be reformulated as the following model,
Xi = µx+ i, i = 1, . . . , n1+ n2 (3.2)
where Xi, i = 1, . . . , n1belongs to normal group and Xi, i = n1+1, . . . , n1+n2
belongs to disease group and i’s are independent and identically distributed
random variables having distribution as defined. Hence, when H0 is true, the
sample outlier mean of (2.1) may be reformulated as ˆ λ = Pn1+n2 i=n1+1XiI(Xi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) Pn1+n2 i=n1+1I(Xi ≥ 2 ˆF −1 X (1 − α) − ˆF −1 X (α)) (3.3)
where quantile estimates ˆFX−1(α) and ˆFX−1(1 − α) are constructed based on
samples X1, . . . , Xn1. The outlier mean of (3.3) tries to estimate the following
parameter
µλX =
E[XI(X ≥ 2FX−1(1 − α) − FX−1(α))]
P {X ≥ 2FX−1(1 − α) − FX−1(α)}
which, in terms of error distribution, is
µλX = µX + R∞ ηXf()d βX where βX = P { ≥ ηX} with ηX = 2F−1(1 − α) − F −1 (α).
The following theorem states the asymptotic property for the outlier mean when the observations are drawn from model (3.2).
Theorem 3.1. When H0 is true,
√
n2(ˆλ − µλX) converges in distribution to
a normal random variable having distribution N (0, σ2
λX) with
σ2λX =σ2(b1X, b2X, vX)
=α(1 − α)((1 − α)b1X− αb2X)2+ 2(1 − 2α)α3(b1X+ b2X)2
where we denote b1X = −1 βX (ηX)f(ηX) √ hf−1(F−1(α)) b2X = −2 βX (ηX)f(ηX) √ hf−1(F−1(1 − α)), vX = 1 (βX)2 [ Z ∞ ηX 2f()d − ( Z ∞ ηX f()d)2].
Theorem 3.1 indicates that when H0 is true,
√
n2(
ˆ λ−µλX
σλX ) converges to
the standard normal distribution and the distribution parameters when H0
is true involved in the function are µλX and σλX. Then, joining Theorems
2.3 and 3.1, we have three choices of constructing test functions as follows: √ n2( ˆ λ − µλX σλX ),√n2( ˆ λ − µλX σλ ), and √n2( ˆ λ − µλ σλX ). (3.4)
the first function considering testing hypothesis involving both asymptotic mean and standard deviation and the others consider only one of these two parameters. Then when we have appropriate estimates of the unknown pa-rameters, test statistics are provided.
Not all test functions are interesting in gene expression analysis since Tomlins et al. (2005) has observed that when outliers occurs in disease sam-ples, they are either only over-expressed or down-expressed. Hence, without considering a location shift the resulted test function is not practical in gene expression analysis. The following procedures are designed for the first two test functions:
(I) Hypothesis for equality of distributions: Hµ,σ : µλ = µλX, σλ2 = σ2λX
(a) The rule for testing Hµ,σ is:
rejecting Hµ,σ if √ n2( ˆ λ − ˆµλX ˆ σλX ) ≥ zα∗ (3.5)
where ˆµλX and ˆσλX are, respectively, estimators for parameters
µλX and σλX.
(b) An approximate p value based on observations xi’s and yi’s is
defined as p = Z ∞ √ n2( ˆ λ− ˆµλX ˆ σλX ) φ(z)dz.
(a) The rule for testing Hµ is: rejecting Hµ if √ n2( ˆ λ − ˆµλX ˆ σλ ) ≥ zα∗ (3.6)
where ˆσλ is estimator of parameter σλ when Y ∼ FY has
distri-bution FY.
(b) An approximate p value based on observations xi’s and yi’s is
defined as p = Z ∞ √ n2( ˆ λ− ˆµλX ˆ σλ ) φ(z)dz.
The determination of test selection now relies on (i) power performance and (ii) choice of parameters estimates that will be studied in subsequent sections.
4
Comparison of Outlier Coverages and
Asymp-totic Variances of Outlier Mean
Tests (3.5) and (3.6) use the same critical point zα∗ and the same estimate
ˆ
µλX for the outlier variable’s expectation. When we consider to choose one
from hypotheses Hµ,σ or Hµ, the right choice is the one that has smaller
asymptotic variance (σλX or σλ). We will see that the size of this asymptotic
variance has a relation with the outlier coverage β. We compute the outlier
coverage probabilities βX and β and asymptotic variances σλX and σλ with
the following distribution setting:
FX = N (0, 1) and FY = N (θ, 1). (4.1)
Table 2. Coverage probabilities and asymptotic variances when there is distributional shift θ α βX β σλX2 σ2λ 1 0.45 0.3531 0.7333 3.1485 1.6203 0.35 0.1238 0.4380 32.379 5.4481 0.25 0.0215 0.1530 369.41 48.11 0.15 0.0009 0.0174 13068.76 877.44
0.05 4.0e-7 4.2e-5 6.5e+7 6.5e+5
3 0.45 0.3531 0.9956 3.1485 1.0123 0.35 0.1238 0.9674 32.379 1.2644 0.25 0.0215 0.8356 369.41 3.2525 0.15 0.0009 0.4565 13068.76 18.63 0.05 4.0e-7 0.0265 6.5e+7 1416.67 10 0.45 0.3531 1 3.1485 1 0.35 0.1238 1 32.379 1 0.25 0.0215 1 369.41 1 0.15 0.0009 1 13068.76 1 0.05 4.0e-7 1 6.5e+7 1
We have several comments drawn from Table 2:
1. It is seen that βX < β for all cases of θ and α. This indicates that
the outlier interval [2FX−1(1 − α) − FX−1(α), ∞) covers space of Y more
probable than space of X. This size of the difference could be huge.
For example, when θ = 10, βX’s are all very small but β is or nearly
1 indicating that outlier interval contains almost whole probable space of variable Y .
2. The differences in coverage probabilities strongly affect the asymptotic
variances in the way that σ2
λX > σ2λ for all cases of θ and α where
the asymptotic variance under hypothesis Hµ,σ could be hundred or
thousand times it under hypothesis Hµ.
3. When θ = 10, the asymptotic variances under hypothesis concerning population outlier mean are vales nearly 1’s. This indicating that the asymptotic variance under this hypothesis is the variance of the random variable Y .
We may be more interesting in the comparison for the following contam-inated alternative one:
FX = N (0, 1) and FY = (1 − γ)N (0, 1) + γN (θ, 1) (4.2)
where θ > 0. This alternative hypothesis assumes that Y has a location model with positive mean γθ and contaminated error variable. Table 3
Table 3. Coverage probabilities and asymptotic variances when there small proportion (γ = 0.1) of distributional shift
θ α βX β σ2λX σ 2 λ 1 0.45 0.3531 0.3911 3.1485 3.0160 0.35 0.1238 0.1552 32.379 25.0101 0.25 0.0215 0.0346 369.41 239.5184 0.15 0.0009 0.0025 13068.76 5095.878
0.05 4.0e-7 4.5e-6 6.5e+7 5.9e+6
3 0.45 0.3531 0.4173 3.1485 5.9860 0.35 0.1238 0.2082 32.379 27.118 0.25 0.0215 0.1029 369.41 97.7885 0.15 0.0009 0.0465 13068.76 359.8992 0.05 4.0e-7 0.0003 6.5e+7 12821.53 10 0.45 0.3531 0.4178 3.1485 50.6809 0.35 0.1238 0.2115 32.379 201.8219 0.25 0.0215 0.1194 369.41 640.0708 0.15 0.0009 0.1008 13068.76 895.2527 0.05 4.0e-7 0.1000 6.5e+7 909.9943
We have several comments for interpreting the results in Table 3:
1. Setting FY as a contaminated normal distribution of (4.2) indicating
that response variable for disease gene has large proportion of
obser-vations from the distribution FX but with a small part of observations
shifted to the right. The variance of the contaminated distribution is
1+γ(1−γ)θ2. Both the contamination and variance enlargement affect
the coverage probability β, smaller than those in Table 2. This results
in the outlier mean asymptotic variance σ2
λ, larger than those in Table
2.
2. For mild shifts (θ = 1 or 3), the test for hypothesis Hµ has asymptotic
variances σλ2’s almost smaller (except (θ, α) = (3, 0.45)) than those for
hypothesis Hµ,σ. When there is significant shift θ = 10, σλX2 ’s are
smaller than σ2
λ’s for α ∈ {0.25, 0.35, 0.45}.
We now consider the case that random variable Y has a mixed distribution with shift not only the mean but also the variance as follows:
where θ > 0. For cutting percentage α and true values θ and σ, we compute the asymptotic variance and display the comparison in table 4.
Table 4. Asymptotic variances comparison when there small proportion (γ = 0.1) of distributional shift σ θ σ2 λ < σ2λX σλ2 > σ2λX 1 1 0.45, 0.35, 0.25, 0.15, 0.05 none 3 0.35, 0.25, 0.15, 0.05 0.45 10 0.15, 0.05 0.45, 0.35, 0.25 3 1 0.25, 0.15, 0.05 0.45, 0.35 3 0.25, 0.15, 0.05 0.45, 0.35 10 0.15, 0.05 0.45, 0.35, 0.25 5 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25 10 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25
In this case that both contaminated mean and variance are shifted, it
shows σ2λ < σλX2 for most of smaller α ∈ {0.05, 0.15} and σλ2 > σ2λX for most
larger α ∈ {0.25, 0.35, 0.45}. This provides a guide to choose hypothesis for testing when percentage α is already decided.
5
Power Studies with Tests Based on Outlier
Mean
Consider the power function for testing equal distributions hypothesis Hµ,σ.
By letting µλY and σλY, respectively, as parameters of µλ and σλ when Y ∼
FY is true, an approximate power with significant level α∗ based on test (3.5)
may be derived as bellows
`Hµ,σ = PFY{ √ n2( ˆ λ − ˆµλX ˆ σλX ) ≥ zα∗} = PFY{ √ n2( ˆ λ − µλY σλY ) ≥ zα∗σˆλX + √ n2(ˆµλX− µλY) σλY } ≈ P {Z ≥ zα∗σˆλX + √ n2(ˆµλX − µλY) σλY }. (5.1)
This is the power function when we test for hypothesis of equal distributions. On the other hand, the power function for testing equal outlier means
hypothesis Hµ with significant level α∗ may be derived as bellows
`Hµ = PFY{ √ n2( ˆ λ − ˆµλX ˆ σλY ) ≥ zα∗} = PFY{ √ n2( ˆ λ − µλY σλY ) ≥ zα∗σˆλY + √ n2(ˆµλX− µλY) σλY } ≈ P {Z ≥ zα∗σˆλY + √ n2(ˆµλX− µλY) σλY }. (5.2)
From (5.1) and (5.2), the performance of these two tests rely on several elements describing in the following:
n2 : the larger the sample size for the disease gene, the larger the powers.
Due to the fact that ˆµλX < µλY when there are outliers in Y.
σ2
λX : the larger the asymptotic variance, the smaller the power for testing
hypothesis Hµ,σ
σ2
λY : the larger the asymptotic variance, the smaller the power for testing
hypothesis Hµ
We also note that when cutoff point percentage α decreases, the outlier
mean asymptotic variances σ2
λX and σ2λY are both increase.
We now consider the design of distributional shift of (4.1) and compute
the approximate powers for testing hypotheses Hµ,σ and Hµ. The results are
Table 5. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is distributional shift n2 α θ = 1 θ = 3 θ = 5 θ = 10 30 0.45 (0.2775, 0.5230) (1, 1) (1, 1) (1, 1) 0.35 (3.0e-4, 0.1441) (0.0826, 1) (1, 1) (1, 1) 0.25 (4.5e-6, 0.0634) (0, 0.8634) (0, 1) (1, 1) 0.15 (1.2e-10, 0.0517) (0, 0.1513) (0, 1) (0, 1) 0.05 (0, 0.0500) (0, 0.0530) (0, 0.1544) (0, 1) 50 0.45 (0.4622, 0.7099) (1, 1) (1, 1) (1, 1) 0.35 (5.6e-4, 0.1861) (0.7358, 1) (1, 1) (1, 1) 0.25 (5.3e-6, 0.0679) (0, 0.9708) (0, 1) (1, 1) 0.15 (1.3e-10, 0.0521) (0, 0.1971) (0, 1) (0, 1) 0.05 (0, 0.0500) (0, 0.0538) (0, 0.2018) (0, 1)
We have comments drawn from results showing in Tables 5:
1. Testing hypotheses Hµ,σ and Hµ have small powers for mild shifts θ =
1, 3 unless we choose large proportions α (α = 0.35 and 0.45). If there is significant shifting in location (θ = 10), most of these two tests are satisfactory. The percentage α = 0.25 is the recommended popularly in literature (see Hoaglin et al. (1983)).
2. A comparison of approximate powers between these two tests shows
that the test for hypothesis Hµ seems to be the right choice. To test
hypothesis Hµ,σ gives unsatisfactory powers besides cases of strong
dis-tributional shift such as θ = 5 or 10 with choosing percentage α as large as 0.35 or 0.45.
3. The effects of these two tests exist in sample size. Basically the larger the sample size generates larger power for either one test.
Next, we consider that the assumption for distributions of X and Y is that Y has a case of contaminated normal in (4.2) as
FX = N (0, 1) and FY = 0.9N (0, 1) + 0.1N (θ, 1)
where θ > 0. The computed approximate powers for testing hypotheses Hµ,σ
Table 6. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is
small fraction distributional shift
n2 α θ = 1 θ = 3 θ = 5 θ = 10 30 0.45 (0.0740, 0.0791) (0.4420, 0.2750) (0.7307, 0.4072) (0.8921, 0.5011) 0.35 (0.0363, 0.0584) (0.1353, 0.1713) (0.4635, 0.3136) (0.8060, 0.4512) 0.25 (0.0217, 0.0525) (0.0026, 0.1077) (0.0669, 0.2326) (0.5517, 0.3951) 0.15 (0.0043, 0.0505) (0.0000, 0.0658) (0.0000, 0.1444) (1.9e-7, 0.3285) 0.05 (2.1e-8, 0.0500) (0.0000, 0.0510) (0.0000, 0.0638) (0.0000, 0.2238) 50 0.45 (0.0840, 0.0897) (0.5631, 0.3847) (0.8474, 0.5698) (0.9570, 0.6852) 0.35 (0.0382, 0.0611) (0.1843, 0.2277) (0.5970, 0.4411) (0.9043, 0.6256) 0.25 (0.0221, 0.0532) (0.0038, 0.1311) (0.1087, 0.3213) (0.7023, 0.5537) 0.15 (0.0043, 0.0506) (0.0000, 0.0711) (0.0000, 0.1866) (1.1e-6, 0.4623) 0.05 (2.1e-8, 0.0500) (0.0000, 0.0512) (0.0000, 0.0684) (0.0000, 0.3079)
We have comments drawn from results showing in Tables 6:
1. Basically the contaminated distribution FY reduces the powers of two
tests due to enlarging the asymptotic outlier mean asymptotic
vari-ances σ2
λX and σλ2 due to contamination and increasing the variance of
distribution FY.
2. If we specify cutoff point percentage α to be 0.35 or more, the test
for hypothesis Hµ,σ seems to be the right choice. On the other hand,
if we specify cutoff point percentage α to be smaller than 0.25, the
test for hypothesis Hµ seems to be the right choice. For α = 0.25,
the test for hypothesis Hµ is better unless the location parameter θ in
Table 7. Approximate powers (`Hµ,σ, `Hµ) of outlier mean when there is
small fraction distributional shift (n2 = 50)
γ α θ = 5 θ = 10 θ = 20 0.05 0.45 (0.5902, 0.3196) (0.8241, 0.4207) (0.9008, 0.4606) 0.35 (0.3782, 0.2445) (0.7326, 0.3746) (0.8706, 0.4383) 0.25 (0.1323, 0.1961) (0.5828, 0.3339) (0.8200, 0.4130) 0.15 (2.4e-15, 0.1301) (0.0005, 0.2816) (0.1986, 0.3824) 0.05 (0.0000, 0.0632) (0.0000, 0.1955) (0.0000, 0.3301) 0.20 0.45 (0.9818, 0.8759) (0.9977, 0.9385) (0.9992, 0.9550) 0.35 (0.8295, 0.7529) (0.9881, 0.9064) (0.9981, 0.9455) 0.25 (0.0614, 0.5591) (0.8336, 0.8492) (0.9879, 0.9289) 0.15 (0.0000, 0.3026) (8.6e-13, 0.7516) (0.0376, 0.9011) 0.05 (0.0000, 0.0758) (0.0000, 0.5274) (0.0000, 0.8367) 0.30 0.45 (0.9985, 0.9785) (1.0000, 0.9939) (1.0000, 0.9965) 0.35 (0.9338, 0.9227) (0.9990, 0.9871) (0.9999, 0.9952) 0.25 (0.0302, 0.7601) (0.9101, 0.9687) (0.9986, 0.9924) 0.15 (0.0000, 0.4305) (0.0000, 0.9187) (0.0102, 0.9859) 0.05 (0.0000, 0.0825) (0.0000, 0.7246) (0.0000, 0.9635) 0.50 0.45 (1.0000, 0.9968) (1.0000, 1.0000) (1.0000, 1.0000) 0.35 (0.9952, 0.9111) (1.0000, 1.0000) (1.0000, 1.0000) 0.25 (0.0038, 0.4660) (0.9836, 0.9999) (1.0000, 1.0000) 0.15 (0.0000, 0.1104) (0.0000, 0.9986) (0.0002, 1.0000) 0.05 (0.0000, 0.0525) (0.0000, 0.9616) (0.0000, 0.9998)
We have several comments on the results in Table 7:
1. Although the more the contamination (γ) makes the variance of the response variable Y , however, it is easier in detection of existence of outliers so that the powers of two tests increase. The powers for γ = 0.5 are very close to the performance of location shift in Table 5.
2. Even the large contamination (γ = 0.5), the test for hypothesis Hµ,σ
with low α’s (α = 0.05 and 0.05) is still very poor in power performance. 3. Combining the discussions for the results in Tables 5-7, the test for
Besides the cases of normal or mixed normal distributions, we may con-sider the cases that X and Y draw from the following two cases:
Case 1: FX = Laplace(0, 1) and FY = Laplace(θ, 1)
Case 2: FX = t(5) and FY = t(5) + θ.
Table 8. Approximate powers (`Hµ,σ, `Hµ) for hypothesis Hµ and Hµ,σ when
X and Y are with Laplace or t distribution (n2 = 30)
8.(a) Y ∼ Laplace(θ, 1) α θ = 1 θ = 3 θ = 5 θ = 10 0.45 (0.0193, 0.2899) (1.0000, 1.0000) (1, 1) (1, 1) 0.35 (5.2e-7, 0.0500) (0.0134, 0.9997) (1, 1) (1, 1) 0.25 (7.1e-7, 0.0500) (0, 0.4908) (0, 1) (1, 1) 0.15 (6.7e-4, 0.0500) (0, 0.0500) (0, 0.8694) (0, 1) 8.(b) Y ∼ t(5) + θ α θ = 1 θ = 3 θ = 5 θ = 10 0.45 (0.0465, 0.4128) (1, 1) (1, 1) (1, 1) 0.35 (8.2e-8, 0.0777) (0.0002, 0.9999) (1, 1) (1, 1) 0.25 (4.0e-6, 0.0433) (0, 0.5184) (0, 1) (0.9838, 1) 0.15 (0.0001, 0.0460) (0, 0.0167) (0, 0.8794) (0, 1)
From the displayed results, it seems that two tests are quite satisfactory when there are significant location shifts. However, the test for hypothesis
Hµ is uniformly better than it for hypothesis Hµ,σ. The test for hypothesis
Hµ is very satisfactory for small percentage α when there is location shift is
as large as 5 or more.
We now consider the case that random variable Y has a mixed distribution with shift not only the mean but also the variance as follows:
FX = N (0, 1) and FY = 0.9N (0, 1) + 0.1N (θ, σ2)
where θ > 0. For sample size n2 = 30, cutting percentage α and true
values θ and σ, we compute the approximate powers, `Hµ,σ and `Hµ. With,
α = 0.05, 0.15, 0.25, 0.35, 0.45, θ = 1, 3, 10, we display a comparison of two approximate powers in the following table.
Table 9. Comparison of approximate powers σ θ `Hµ > `Hµ,σ `Hµ < `Hµ,σ 1 1 0.45, 0.35, 0.25, 0.15, 0.05 none 3 0.35, 0.25, 0.15, 0.05 0.45 10 0.15, 0.05 0.45, 0.35, 0.25 3 1 0.25, 0.15, 0.05 0.45, 0.35 3 0.25, 0.15, 0.05 0.45, 0.35 10 0.15, 0.05 0.45, 0.35, 0.25 5 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25 10 1 0.15, 0.05 0.45, 0.35, 0.25 3 0.15, 0.05 0.45, 0.35, 0.25 10 0.15, 0.05 0.45, 0.35, 0.25
In general, we test hypothesis Hµ,σ is more powerful than to test
hy-pothesis Hµ,σ when we choose percentage α as large as 0.35 or 0.45 and test
6
Appendix
To investigate Theorem 3.2, let’s establish a more general theory for outlier mean Π. The following assumptions are needed.
(A1) The limit h = limn1,n2→∞
n2
n1 exists.
(A2) Suppose that there is constant C such that
√
n1( ˆC − C) = Op(1) where
C depends generally on distribution of X.
(A3) Probability density function fδ of distribution Fδis bounded away from
zero in a neighborhood of quantity C − µY.
(A4) Probability density function f is bounded away from zero in the
neigh-borhood of F−1(α) for α ∈ (0, 1). Proof of Theorem 2.2 Proof. If a > 0, λpaX+b,aY +b(α1, α2, . . . , αk) =E[(aY + b)I(aY + b ≥ Pk j=1cjF −1 aX+b(αj))] P {aY + b ≥Pk j=1cjFaX+b−1 (αj)} =E[(aY + b)I(aY + b ≥ Pk j=1cj(aF −1 X (αj) + b))] P {aY + b ≥Pk j=1cj(aF −1 X (αj) + b)} =E[(aY + b)I(Y ≥ Pk j=1cjFX−1(αj))] P {Y ≥Pk j=1cjF −1 X (αj)} =aE[Y I(Y ≥ Pk j=1cjFX−1(αj))] P {Y ≥Pk j=1cjF −1 X (αj)} + b =aλpX,Y(α1, α2, . . . , αk) + b.
λpaX+b,aY +b(α1, α2, . . . , αk) =E[(aY + b)I(aY + b ≥ Pk j=1cj(aF −1 X (1 − αj) + b))] P {aY + b ≥Pk j=1cj(aFX−1(1 − αj) + b)} =E[(aY + b)I(Y ≤ Pk j=1cjF −1 X (1 − αj))] P {Y ≤Pk j=1cjFX−1(1 − αj)} =aE[Y I(Y ≤ Pk j=1cjF −1 X (1 − αj))] P {Y ≤Pk j=1cjF −1 X (1 − αj)} + b =aλnX,Y(1 − α1, 1 − α2, . . . , 1 − αk) + b.
The proof of transformation on outlier mean with negative outliers may be similarly proved and it is skipped.
Proof of Theorem 2.3
Proof. Let C = 2FX−1(1 − α) − FX−1(α) and ˆC = 2 ˆFX−1(1 − α) − ˆFX−1(α). From
model (2.2) and the expression of ˆλX in (2.1), we have
ˆ λ = µY + Pn2 i=1δiI(δi > C − µy + n −1/2 1 T ) Pn2 i=1I(Yi > ˆC) where T =√n1( ˆC − C).
This implies that √ n2(ˆλ − µY) = n−1/22 Pn2 i=1δiI(δi > C − µy+ n −1/2 1 T ) n−12 Pn2 i=1I(Yi > ˆC) . (6.1)
With assumption (A4), the key in this proof is that
n−1/22 n2 X i=1 δi[I(δi > C − µX + n −1/2 1 T ) − I(δi > C − µY)] = − n−1/22 n2 X i=1 δi[I(δi ≤ C − µY + n −1/2 1 T ) − I(δi ≤ C − µY)] = − (C − µX)gy(C − µY) √ hT + op(1) (6.2)
which may seen in Ruppert and Carroll (1980) and Chen and Chiang (1996).
Assumption (A1), equation (6.1), (6.2) and the following representation of empirical quantile √ n1( ˆF−1(α) − F −1 (α)) =f−1(F−1(α))n−1/21 n1 X i=1 [α − I(i ≤ F−1(α))] + op(1) (6.3)
see, for example, Ruppert and Carroll (1980). The asymptotic distribution in (b) of Theorem 2.3 is induced from the Central Limit Theorem.
The proof of Theorem 3.1 is exactly identical to it of Theorem 2.3 with
Reference
1. Agrawal, D., Chen, T., Irby, R., et al. (2002). Osteopontin identified as lead marker of colon cancer progression, using pooled sample expression profiling. J. Natl. Cancer Inst., 94, 513-521.
2. Alizadeh, A. A., Eisen, M. B., Davis, R. E., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511.
3. Chen, L.-A., Chen, Dung-Tsa and Chan, Wenyaw. (2008). The p Value for the Outlier Sum in Differential Gene Expression Analysis. Submitted to Biometrika for publication (In revision).
4. Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed means for location and linear regression model. Journal of Nonparametric Statistics. 7, 171-185.
5. Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983). Understanding Robust and Exploratory Data Analysis, Wiley: New York.
6. Ohki, R., Yamamoto, K., Ueno, S., et al. (2005). Gene expression profiling of human atrial myocardium with atrial fibrillation by DNA microarray analysis. Int. J. Cardiol., 102, 233-238.
7. Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estima-tion in the linear model. Journal of American Statistical Associaestima-tion, 75, 828-838.
8. Sorlie, T., Tibshirani, R., Parker, J., et al. (2003). Repeated obser-vation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. U.S.A., 100, 8418-8423.
9. Tibshirani, R. and Hastie, T. (2007). Outlier sums differential gene expression analysis. Biostatistics, 8, 2-8.
10. Tomlins, S. A., Rhodes, D. R., Perner, S., et al. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science, 310, 644-648.
11. Wu, B. (2007). Cancer outlier differential gene expression detection. Biostatistics, 8, 566-575.