基因表現分析之穩健回歸估計量

(1)

國立交通大學

統計學研究所

碩士論文

基因表現分析之穩健回歸估計量

Robust Regression Estimators in Gene Expression Analysis

研究生:張祐華

指導教授:陳鄰安教授

(2)

基因表現分析之穩健回歸估計量

Robust Regression Estimators in Gene Expression Analysis

研究生:張祐華………Student:Yu-Hua Chang 指導教授:陳鄰安………Advisor:Lin-An Chen 國立交通大學統計學研究所碩士論文 A Thesis

Submitted to Institude of Statistics College of Science

National Chiao Tung University In Partial Fulfillment of the Requirements

for the Degree of Master

in Statistics June 2013

Hsinchu, Taiwan, Republic of China

(3)

摘要

對基因表現分析來說，經由偵測疾病組樣本的離群值來發現對其有影響力的基因，是一個很新而且很重要的方法。不幸的是，我們在文獻裡找到，為了建構回歸模型而發展出的離群值最小平方法估計量，它的影響函數(influence function)無法限制住對獨立變數的影響。為了建構線性回歸模型，我們用 Mallow's type 離群值有界影響最小平方法估計量及離群值回歸分位數的漸進分布，產生出一個影響函數(influence function)在獨立變數空間是有界的統計方法。由蒙地卡羅模擬比較均方差的結果顯示，當過失誤差(gross error) 在獨立變數空間發生時，有界影響的估計量比無界影響的更有效。關鍵字: 基因表現分析, 影響函數, 最小平方法估計量, 線性回歸, 回歸分位數 i

(4)

誌謝

韶光荏苒，兩年碩士即將到達尾聲，而我的學生生涯也將到此告一段落。大學及研究所都在新竹度過的我，想必會相當緬懷在這裡遇見的人、事、物，還有那無法令人遺忘的風吧。首先我要由衷地感謝我的論文指導老師－陳鄰安教授。老師對於教導學生是相當地有一套自己的方法，每次接受老師的指導總是能很有架構的理解新的知識，也非常有耐心地講解其細節的部份，十分慶幸能在大學及研究所都當老師的學生。還有口試期間的三位口試委員，許文郁教授、蕭金福教授以及彭南夫教授，謝謝你們對我要補強的地方的建議與指教。接著我要感謝交大統計所一百級的同學們，在這兩年的研究所生活中，大家相處十分融洽，一起切磋、一起成長，就好像我們認識了不只兩年一樣。最後我要感謝我的家人，在我累了的時候，總是有個溫暖的家在等著我、關懷我、照顧我。謝謝，我最愛的家人們。這篇論文獻給我的家人、朋友、老師及所有曾經幫助我的人。張祐華謹誌于國立交通大學統計學研究所中華民國一百零二年五月 ii

(5)

Content

中文摘要 ... i 誌謝 ... ii Content ...iii Abstract ... 1 Introduction ... 1

Mallows Type Bounded Influence Outlier Least Squares Estimator ... 3

Monte Carlo Study ... 5

Mallows Type Outlier Regression Quantile ... 9

Appendix ... 15

References ... 17

(6)

Robust Regression Estimators in Gene Expression Analysis

Abstract

Discovering the inuential genes through the detection of outliers in sam-ples from disease group subjects is a very new and important approach for gene expression analysis. Technique of outlier least squares estimator for re-gression model has been found in literature that, unfortunately, its inuence function can not limit the eect of independent variables. We present as-ymptotic distributions of the Mallow's type bounded-inuence outlier least squares estimator and outlier regression quantile for linear regression mod-els producing statistical techniques with inuence functions bounded in the space of independent variables. Monte Carlo simulations comparing mean squared errors show that the bounded-inuence ones are more ecient than the unbounded-inuence ones when gross errors occur in the independent-variable-space.

Key words. Gene expression analysis inuence function least squares esti-mation linear regression regression quantile.

1. Introduction

Among the existing techniques in inuential genes detection, common statistical methods for two-group comparisons, such as t-test, are not ap-propriate due to a large number of genes and a limited number of subjects available. Tomlins et al. (2005) observed in a study of prostate cancer that inuential genes are over expressed in a small number of disease samples. The problem of constructing statistical procedures based on outlier samples has been attracted considerable recent attention. Tibshirani and Hastie (2007) and Wu (2007) suggested to use an outlier sum, the sum of all the gene expression values in the disease group that are greater than a specied cuto point and Chen, Chen and Chan (2010) considered the distributional theory of the outlier mean. These methods show desired eciency for tests based on outliers in detection of inuential genes.

TypesetbyA M

S-T E

(7)

Uncertainties of gene expressions also show causal eect upon one or some predictor variables (independent variables) such as age, cell line type or genotype information (see Jin, Si et al. (2006), Huang and Pan (2003), Rambow, Piton et al. (2008), Muller, Chiou and Leng (2008), Vinciotti and Yu (2009) and Zapala and Schork (2006)). Lai, et al. (2013) considered that we have gene expressions for normal group subjects with regression model

yai=x 0 ai

a+ii= 1:::n1 (1.1)

and those for disease group subjects with regression model

ybi =x 0 bi

b+ii= 1:::n2: (1.2)

They proposed the outlier least squares estimator (LSE) for inuential genes detection by showing that the outlier LSE has an asymptotic representation with inuence function of the form

Maxaa() +Mbxbb() (1.3)

whereMa andMb are xed matrices,a is a bounded function and b

mea-sures the tail mean of variable . The inuence function is not bounded in the independent-variable-space. Therefore, one can conjecture that in small samples the outlier LSE will not be able to handle outliers in the X space. For a general discussion of inuence analysis, see Cook and Weisberg (1982). In the literature, consideration has been given to the development of esti-mators of regression parameters that limit the eects of the error variable and the independent variables. Among them, approaches which simultane-ously bound the inuence of the design points and the residuals for the lin-ear regression model include Krasker and Welsch(1982) and Krasker(1985). On the other hand, the approach of the Mallow's type bounded-inuence trimmed mean is to bound the inuence of the design points and the residu-als by De Jongh and De Wet (1985) and in the linear regression model by De Jongh, De Wet and Welsh(1988). In a study by Giltinan, Carroll and Rup-pert (1986), they found these two approaches are competitive in a way that

(8)

neither is preferable to the other one. They also note that the Mallow's type estimators should theoretically give more stable inference than the Krasker-Welsch approach. This desired property has been further studied by Chen, Thompson and Hung (2000). In light of the fact that bounded-inuence type estimation has not been studied for outliers based estimators, our aim is to study the Mallow's type outlier least squares estimator (LSE) and outlier regression quantile for regression gene expression data sets. The asymptotic theory for the outlier LSE is given in Section 2 for the linear regression model and a simulation study for it is given in Section 3. We introduce the statistical theory and simulation study for the outlier regression quantile in Section 4. Finally the proofs of theorems are displayed in Section 5.

2. Mallows Type Bounded Inuence Outlier Least Squares

Esti-mator

For easy expression, let us x one from thousands of genes for examina-tion. Suppose that there are n1 subjects in the normal control group and

n2 subjects in the disease group. We assume that this gene expressions for

normal group subjects have the regression model

yai=x 0 ai

a+ii= 1:::n1 (2.1)

where xai is p-vector with 1 as rst element and i's are independent and

identically distribute (iid) error variables with distribution function F and

those disease group subjects have the regression model

ybi =x 0 bi

b+ii= 1:::n2 (2.2)

where xbi is p-vector with 1 as rst element and i's are iid error variables

with distribution function F. Motivated from Tomlins et al. (2005) we

need to construct a cuto from model (2.1) to identify outlier observations in model (2.2) and develop a statistic based on these outliers as the basis for statistical inferences.

We let the sample Mallow's type bounded-xinuence regression-quantile of Koenker and Bassett (1978) be a vector ^BIa() that solves

Minb2R p n1 X i=1 wai(yai ;x 0 aib)( ;I(y ai x 0 aib))

(9)

for dening the cuto where waii = 1:::n2 are weights. The

Mallows-type bounded inuence outlier LSE (De Jongh, De Wet and Welsh (1988)) is dened as ^ BIbout = (X 0 bW bABIXb) ;1X0 bW bABIyb (2.3)

where trimming matrixABI = diag fa ii=I(ybi x 0 bi^ BIa())i= 1:::n2 g

and Wb is a diagonal matrix of weights wbi's.

We denote bout =P( F ;1 ()+ a0 ;

b0). For the class of

Mallows-type bounded inuence outlier LSE, we assume that the following assump-tions are valid.

Assumption 1: limn2n1!1 n 2 n 1 = ` ba and n ;1 2 P n 2 i=1x 4 bij = O(1) where

xbij is the jth element of vectorxbi.

Assumption 2: limn 1 !1n ;1 1 X 0 aX a = Qa, limn 2 !1n ;1 2 X 0 bX b = Qb, limn1!1X 0 aW aXa = Qawlimn1!1X 0 aW 2 aX a = Qawwlimn2!1X 0 bW bXb = Qbw and limn 2 !1X 0 bW 2 bX

b = Qbww, where Qa, Qb, QawQawwQbw and

Qbww are p

p positive denite matrices.

Assumption 3: a1 =b1where we denoteb = b0 b1 anda= a0 a1

with b0 anda0 the intercept parameters andb1 and a1 being vectors of

slope parameters.

We denote the outlier proportion bout = P(yb x

0

a()). Under

As-sumption 3, we see that bout =P( F ;1 ()+ a0 ; b0). We also denote

f and f the densitity functions, respectively, for F and F. For the rest

of this paper, we assume that Assumptions 1-4 are true where 4 is listed in Appendix. These assumptions are similar to the standard ones for linear regression models as given in Ruppert and Carroll (1980) and Portnoy and Koenker (1989).

(10)

has the following representation n1=2 2 (^ BIbout ; bout) = ; ;1 bout` 1=2 ba (F ;1 () + a0 ; b0)f(F ;1 () + a0 ; b0) f;1 (F ;1 ())Q ;1 awn ;1=2 1 n 1 X i=1 waixi( ;I( i F ;1 ())) + ;1 boutQ ;1 bwn ;1=2 2 n2 X i=1 wbixiiI(i F ;1 () + a0 ; b0) ;E(I(F ;1 () + a0 ; b0))] +op(1)

where bout =b+oute and where e is p vector (10:::0)

0 and out = E(jF ;1 () + a0 ; b0). (b)n1=2 2 (^ BIbout ;

bout) converges in distribution to a normal random

vec-tor with distribution Np(0 2 cQ ;1 awQ awwQ ;1 aw + 2 outQ ;1 bwQ bwwQ ;1 bw) where 2 out =var( ;1 boutI( F ;1 () + a0 ; b0)) = ;2 bout Z 1 F ;1 ()+ a0 ; b0 2dF () ; 2 out and 2 c = (1;) 2 bout `ba(F ;1 () + a0 ; b0)f(F ;1 () + a0 ; b0) f;1 (F ;1 ())] 2:

The unbounded outlier LSE of Lai et al. (2013) equals ^bout = ^BIbout

with wai=wbi = 1 for all i's.

3. Monte Carlo Study

We now compare the eciencies of the unbounded-inuence and the bounded-inuence outlier LSE's through a Monte Carlo study. The pur-pose of the Monte Carlo study is to evaluate the small-sample behavior of these two outlier LSE's. The performance of these two outlier LSE's in presence of outliers and leverage points is of particular interest.

Denote the n observations of the (j-1)-th independent variable byx1j:::xnj

forj = 23:::p. Order the n observationsx(1)j:::x(n)jand dener

1j:::rnj

as the ranks of x1j:::xnj. Let L = n] + 1 and U = n+ 1

;L where

(11)

by De Jongh, De Wet and Welsh (1988). The weights associated with the (j-1)-th independent variable are now dened as

wij = 8 < : 1 if L ij U (x(L)j ;x (U)j)=Dij if ij < L (x(U)j ;x (L)j)=D ij if ij > U whereDij = 2xij ;x (U)j ;x

(L)ji= 1:::n. The Mallow's weights are now

dened as wi = p j=2w

ij. See Denby and Larsen(1977) and De Jongh, De

Wet and Welsh (1988) for these settings in regression parameters estimation which also perform well in our quantile study. With sample sizes n = 50100, the simple linear egression models of (1.1) and (1.2) are considered. The distribution of error variable is the standard normal (N(0,1)) and contaminated normal distribution

CN() = (1;)N(01) +N(1)

with = 0:1.

The sample of independent variables is considered in the following de-signs:

D1: xiji= 1:::n are i.i.dN(01) for j = 2:::p.

D2: As D1, but one point is moved out 5 units in X space. D3: As D1, but two points are moved out 5 units in X space. D4: As D1, but one point is moved out 10 units in X space. D5: As D1, but two points are moved out 10 units in X space.

Design D1 generates ideal observationsxij and we expect the

unbounded-inuence outlier LSE to be more ecient than the bounded one no matter what the distribution of the error variable is. On the other hand, inuential observations xij would occur for designs D2 - D5 where we expect that the

bounded-inuence outlier LSE to be more ecient than the unbounded one however, it is interesting to see how much more ecient it is.

Table 1.

The ecencies of unbounded-inuence outlier LSE and bounded inuence outlier LSE (n= 50)

(12)

= 0:6

EffbEffBIb

= 0:7

EffbEffBIb

= 0:8

EffbEffBIb

= 0:9 EffbEffBIb D1 = 0:5 100 84 100 85 100 87 100 91 = 1 100 84 100 85 100 87 100 90 = 1:5 100 84 100 85 100 86 100 89 = 2 100 84 100 85 100 86 100 88 = 2:5 100 84 100 84 100 85 100 88 D2 = 0:5 19 100 24 100 31 100 48 100 = 1 19 100 23 100 30 100 45 100 = 1:5 19 100 23 100 29 100 42 100 = 2 21 100 24 100 30 100 41 100 = 2:5 23 100 26 100 31 100 41 100 D3 = 0:5 30 100 37 100 47 100 64 100 = 1 29 100 35 100 45 100 61 100 = 1:5 29 100 35 100 43 100 58 100 = 2 29 100 35 100 43 100 56 100 = 2:5 30 100 36 100 43 100 55 100 D4 = 0:5 63 100 70 100 77 100 85 100 = 1 62 100 68 100 76 100 84 100 = 1:5 61 100 67 100 74 100 83 100 = 2 61 100 67 100 73 100 81 100 = 2:5 61 100 67 100 73 100 81 100 D5 = 0:5 78 100 82 100 87 100 90 100 = 1 77 100 82 100 86 100 90 100 = 1:5 77 100 81 100 85 100 89 100 = 2 77 100 81 100 85 100 89 100 = 2:5 77 100 81 100 85 100 89 100 A total of 10000 replications were performed. Table 1 presents the Monte Carlo results in the form of eciencies compared with the best of the unbounded-inuence outlier LSE and the bounded-inuence outlier LSE that is, the eciency is equal to the average mean squared error of the best one times 100 divided by the average mean squared error of the outlier LSE

EffBIb = minfMSE bMSEBIb g MSEBIb and Effb = minfMSE bMSEBIb g MSEb

(13)

LSE and MSEBIb is the average of MSE's of the bounded-inuence outlier

LSE. In Tables 1 and 2, we consider gross errors appear only on disease group data (xbi).

Table 2.

The ecencies of uninuence outlier LSE and bounded-inuence outlier LSE (na=nb = 100)

= 0:6

EffbEffBIb

= 0:7

EffbEffBIb

= 0:8

EffbEffBIb

= 0:9 EffbEffBIb D1 = 0:5 100 83 100 84 100 84 100 88 = 1 100 83 100 84 100 84 100 87 = 1:5 100 82 100 83 100 84 100 86 = 2 100 82 100 83 100 84 100 86 = 2:5 100 83 100 83 100 84 100 85 D2 = 0:5 47 100 55 100 64 100 77 100 = 1 47 100 54 100 63 100 76 100 = 1:5 47 100 53 100 61 100 73 100 = 2 47 100 53 100 62 100 72 100 = 2:5 49 100 55 100 62 100 72 100 D3 = 0:5 65 100 71 100 78 100 86 100 = 1 64 100 70 100 77 100 85 100 = 1:5 64 100 70 100 76 100 84 100 = 2 64 100 70 100 76 100 83 100 = 2:5 65 100 70 100 76 100 83 100 D4 = 0:5 86 100 88 100 90 100 92 100 = 1 86 100 88 100 89 100 91 100 = 1:5 86 100 88 100 89 100 91 100 = 2 86 100 88 100 89 100 91 100 = 2:5 86 100 88 100 89 100 91 100 D5 = 0:5 92 100 93 100 93 100 93 100 = 1 92 100 92 100 93 100 93 100 = 1:5 92 100 92 100 93 100 94 100 = 2 92 100 93 100 93 100 94 100 = 2:5 92 100 93 100 93 100 94 100 Several conclusions can be drawn from the simulated results:

(14)

variables have distributions with moderate to very heavy tails. The results are as expected, that is, the unbounded-inuence outlier LSE is more e-cient than the Mallow's type bounded-inuence outlier LSE. However, the eciency of the Mallow's type bounded-inuence outlier LSE is quite robust in that its eciencies are all greater than 84 in Table 1 and 82 in Table 2 in this idea design of the regression matrices.

(b). In designs D2-D5, the error variablles follow the distributions exactly as in design D1, but gross errors are introduced in the regression matrices. The Mallow's type bounded-inuence outlier LSE's performed much better than the inuence outlier LSE's. For the design D2, the unbounded-inuence outlier LSE in Table 1 is very poor with eciency less than 19 in Table 1 and 47 in Table 2.

In the next we consider the simulation that response variables in model (2.1) of control group and model (2.2) of disease group are both simultane-ously imposed with gross errors from D1 to D5 to evaluate the eciencies of Mallows type outlier estimators.

The results also show that the Mallows type bounded inuence outlier LSE is much better than the unbounded inuence one when gross erros exist in

x-space.

4. Mallows Type Outlier Regression Quantile

Regression quantile, introduced by Koenker and Bassett (1978), plays the role of order statistics for the linear regression model that is useful in con-structing broad class of L-estimators (Koenker and Zhao (1994) and Portnoy and Koenker (1989)) as dierent measures of central tendency and statisti-cal dispersion and also measures of other distributional characteristics. A regression outlier -quantile bout() models the relationship between

co-variates and variableyb with =P(yb x 0 bq() jy b x 0 a()) that could

be seen in the form

bq() =b +F ;1 (1 ; bout(1 ;))e:

(15)

regression outlier -quantile as ^ BIbq() = argb2R pmin n X i=1 wbi(ybi ;x 0 bib) ;I(y bi x 0 bib)]I(y bi x 0 bi^ BIa())

Table 3.

The ecencies of outlier LSE and bounded inuence outlier LSE (n= 30)

= 0:6

EffoqEffboq

= 0:7

EffoqEffboq

= 0:8

EffoqEffboq

= 0:9

EffoqEffboq

D1 = 0:5 100 83 100 83 100 83 100 85 = 1 100 82 100 83 100 84 100 85 = 1:5 100 83 100 82 100 83 100 84 = 2 100 83 100 83 100 83 100 84 = 2:5 100 83 100 83 100 83 100 84 D2 = 0:5 47 100 53 100 61 100 69 100 = 1 47 100 53 100 60 100 68 100 = 1:5 48 100 53 100 59 100 66 100 = 2 50 100 54 100 59 100 65 100 = 2:5 53 100 57 100 61 100 65 100 D3 = 0:5 57 100 62 100 67 100 75 100 = 1 56 100 61 100 67 100 74 100 = 1:5 57 100 60 100 66 100 74 100 = 2 58 100 61 100 66 100 73 100 = 2:5 59 100 62 100 66 100 73 100 D4 = 0:5 51 100 60 100 66 100 71 100 = 1 49 100 59 100 65 100 69 100 = 1:5 49 100 58 100 63 100 69 100 = 2 49 100 57 100 62 100 67 100 = 2:5 50 100 57 100 61 100 67 100 D5 = 0:5 64 100 68 100 72 100 75 100 = 1 63 100 67 100 71 100 75 100 = 1:5 63 100 66 100 70 100 74 100 = 2 62 100 66 100 70 100 75 100 = 2:5 64 100 66 100 69 100 74 100 The following theorem gives ^BIbout() the asymptotic representation

(16)

Theorem 4.1.

(a) A Bahadur representation for the bounded-inuence outlier regression quantile is

n1=2 2 (^ BIbq() ; bq()) =f ;1 (F ;1 (1 ; bout(1 ;)))f (a0 ; b0+F ;1 ())f ;1 (F ;1 ()) `1=2 ba Q ;1 awn ;1=2 1 n 1 X i=1 waixai ;I( i F ;1 ())] +f ;1 (F ;1 (1 ; bout(1 ;)))Q ;1 bw n;1=2 2 n 2 X i=1 wbixbi ;I( i F ;1 (1 ; bout(1 ;)))]I( i a0 ; b0+F ;1 ()) +o p(1) (b) n1=2 2 (^ BIbq() ;

bq()) coverges to normal distribution with mean 0p

and covariance matrix

2 qQ ;1 awQ awwQ ;1 aw+ 2 outQ ;1 bwQ bwwQ ;1 bw where 2 q =(1 ;)` ba(f ;1 (F ;1 (1 ; bout(1 ;)))f (a0 ; b0+F ;1 ()) f;1 (F ;1 ())) 2 and 2 out = bout(1 ;)(f ;1 (F ;1 (1 ; bout(1 ;)))) 2:

Let ^bq() be the unbounded-inuence outlier regression -quantile of

Lai et al. (2013). We perform a simulation study of replications 1000. Let MSEBIbq and MSEbq be the average MSE's of ^BIbqq() and ^bq(),

respectively. We dene eciencies of these two unbounded-inuence and bounded-inuence regression quantiles as

Effbq = minfMSE bqMSEBIbq g MSEbq and EffBIbq = minfMSE bqMSEBIbq g MSEBIbq :

Table 4.

The ecencies of uninuence outlier quantile and bounded-inuence outlier quantile (= 0:8)

(17)

= 0:6

EffbqEffBIbq

= 0:7

EffbqEffBIbq

= 0:8

EffbqEffBIbq

= 0:9 EffbqEffBIbq D1 = 0:5 100 90 100 88 100 89 100 96 = 1 100 90 100 88 100 89 100 97 = 1:5 100 90 100 88 100 89 100 96 = 2 100 91 100 88 100 89 100 96 = 2:5 100 91 100 89 100 89 100 96 D2 = 0:5 77 100 39 100 19 100 61 100 = 1 79 100 42 100 20 100 58 100 = 1:5 80 100 45 100 24 100 57 100 = 2 82 100 50 100 31 100 61 100 = 2:5 85 100 58 100 43 100 65 100 D3 = 0:5 44 100 22 100 28 100 84 100 = 1 48 100 23 100 28 100 80 100 = 1:5 53 100 28 100 30 100 77 100 = 2 60 100 38 100 38 100 77 100 = 2:5 70 100 52 100 48 100 78 100 D4 = 0:5 20 100 11 100 12 100 48 100 = 1 23 100 13 100 12 100 45 100 = 1:5 27 100 16 100 15 100 43 100 = 2 35 100 22 100 19 100 46 100 = 2:5 46 100 33 100 29 100 50 100 D5 = 0:5 20 100 13 100 23 100 88 100 = 1 23 100 14 100 21 100 84 100 = 1:5 27 100 17 100 23 100 80 100 = 2 35 100 23 100 29 100 79 100 = 2:5 46 100 35 100 39 100 89 100

Table 5.

The ecencies of uninuence outlier quantile and bounded-inuence outlier quantile (= 0:9)

(18)

= 0:6

EffbqEffBIbq

= 0:7

EffbqEffBIbq

= 0:8

EffbqEffBIbq

= 0:9 EffbqEffBIbq D1 = 0:5 100 94 100 92 100 91 100 97 = 1 100 94 100 92 100 91 100 97 = 1:5 100 94 100 92 100 91 100 96 = 2 100 95 100 92 100 91 100 97 = 2:5 100 95 100 93 100 91 100 96 D2 = 0:5 91 100 68 100 34 100 65 100 = 1 92 92 72 100 38 100 62 100 = 1:5 100 92 83 100 60 100 67 100 = 2 94 100 83 100 60 100 67 100 = 2:5 100 92 89 89 71 100 71 100 D3 = 0:5 75 100 53 100 42 100 85 100 = 1 78 100 58 100 43 100 81 100 = 1:5 83 100 68 100 51 100 79 100 = 2 89 100 80 100 63 100 79 100 = 2:5 10090 89 89 73 100 81 100 D4 = 0:5 48 100 34 100 22 100 51 100 = 1 52 100 39 100 25 100 47 100 = 1:5 61 100 48 100 31 100 46 100 = 2 73 100 62 100 43 100 50 100 = 2:5 84 100 77 100 56 100 55 100 D5 = 0:5 48 100 35 100 32 100 89 100 = 1 52 100 39 100 32 100 86 100 = 1:5 61 100 49 100 39 100 83 100 = 2 73 100 63 100 50 100 83 100 = 2:5 85 100 78 100 64 100 84 100 Several conclusions can be drawn from the simulated results:

(a). In design D1, the regression matricse are well-behaved and the error variables have distributions with moderate to very heavy tails. The results are as expected, that is, the unbounded-inuence outlier regression quantile is more ecient than the Mallow's type bounded-inuence outlier regression quantile. However, the eciency of the Mallow's type bounded-inuence outlier regression quantile is quite robust in that its eciencies are all greater than 88 in Table 1 and 92 in Table 2 in this idea design of the regression

(19)

matrices.

Table 6.

The ecencies of outlier quantile and bounded inuence outlier quantile (= 0:8n= 50)

= 0:6

EffoqEffboq

= 0:7

EffoqEffboq

= 0:8

EffoqEffboq

= 0:9

EffoqEffboq

D1 = 0:5 100 94 100 92 100 91 100 97 = 1 100 94 100 92 100 91 100 96 = 1:5 100 94 100 92 100 91 100 95 = 2 100 95 100 92 100 91 100 95 = 2:5 100 94 100 92 100 91 100 94 D2 = 0:5 72 100 53 100 59 100 86 100 = 1 75 100 55 100 59 100 86 100 = 1:5 77 100 60 100 62 100 85 100 = 2 80 100 67 100 68 100 85 100 = 2:5 84 100 75 100 76 100 85 100 D3 = 0:5 61 100 59 100 77 100 91 100 = 1 63 100 59 100 75 100 90 100 = 1:5 68 100 62 100 74 100 90 100 = 2 75 100 67 100 74 100 89 100 = 2:5 81 100 74 100 77 100 88 100 D4 = 0:5 40 100 31 100 48 100 81 100 = 1 44 100 33 100 48 100 79 100 = 1:5 49 100 37 100 50 100 79 100 = 2 57 100 47 100 56 100 78 100 = 2:5 66 100 59 100 64 100 79 100 D5 = 0:5 42 100 50 100 76 100 88 100 = 1 45 100 48 100 73 100 88 100 = 1:5 50 100 50 100 72 100 87 100 = 2 58 100 56 100 72 100 85 100 = 2:5 68 100 65 100 76 100 84 100 (b). In designs D2-D5, the error variablles follow the distributions exactly as in design D1, but gross errors are introduced in the regression matrices. The Mallow's type bounded-inuence outlier regression quantile's performed much better than the unbounded-inuence outlier regression quantile's. For

(20)

the design D2, the unbounded-inuence outlier regression quantile in Table 1 is very poor with eciency less than 11 in Table 1 and 22 in Table 2.

In the next we consider the simulation that response variables in model (2.1) of control group and model (2.2) of disease group are both simultane-ously imposed with gross errors from D1 to D5 to evaluate the eciencies of Mallows type outlier quantile estimators.

The results also show that the Mallows type bounded inuence outlier re-gression quantile is much better than the unbounded inuence one when gross erros exist inx-space.

5. Appendix

It requires one more assumption for the proofs of theorems in this paper. Assumption 4: Pobability density functions f andf are bounded away

from zero, respectively, in neighborhoods of F;1

() and F ;1

() for 2

(01).

Proof of Theorem 2.1.

From the expression of ^BIbout of (2.3) and model

(2.2), we have n1=2 2 (^ BIbout ; bout) =n 1=2 2 ( n 2 X i=1 wbixbix 0 biI(y bi x 0 i^ aw())) ;1 f n 2 X i=1 wbixbiiI(i F ;1 () +a0 ; b0 +n ;1=2 1 x 0 iT a) ;I( i F ;1 () + a0 ; b0)] + n 2 X i=1 wbixbiiI(i F ;1 () + a0 ; b0) g+o p(1) (5.1) where Ta =n 1=2 1 (^ BIa() ; a()).

With Assumption (4) and Jureckova and Sen (1987) extension of Billingsly's Theorem (see also Koul (1992)), the rst term on the right hand side of (5.1) may be expressed as n;1=2 2 n2 X i=1 wbixbiiI(i F ;1 () + a0 ; b0 +n ;1=2 1 x 0 iT n) ;I( i F ;1 () + a0 ; b0)] =;(F ;1 () + a0 ; b0)` 1=2 ba f (F ;1 () + a0 ; b0)QbwTn+op(1) (5.2) for any sequence Tn with Tn =Op(1).

(21)

We know that, from Chen, Thompson and Chuang (2000), n1=2 1 (^ BIa() ; a()) =Q ;1 awf ;1 (F ;1 ())n ;1=2 1 n 1 X i=1 waixai( ;I( i F ;1 ()))+o p(1): (5.3) By the same rational, we can derive

n;1=2 2 n 2 X i=1 wbixbix 0 biI( i F ;1 () + a0 ; b0+n ;1=2 1 x 0 biT a) =n;1=2 2 n 2 X i=1 wbixbix 0 biI( i F ;1 () + a0 ; b0) +op(1)

for any sequence Ta =Op(1). This indicates

n;1 2 n 2 X i=1 wbixbix 0 biI(y bi x 0 i^ aw()) = boutQbw+op(1): (5.4)

By lettingTa=Tn and combining the results in (5.1)-(5.4), result (a) of the

theorem is followed.

The asymptotic normality of (b) is a direct consequence of the represen-tation and the central limit theorem.

Proof of Theorem 4.1.

Let U(t1t2) =n ;1=2 2 P n 2 i=1w bixbiI(i F ;1 (1 ; bout(1 ;)) +n ;1=2 2 x 0 bit 2)I(i a0 ; b0+F ;1 () +n ;1=2 1 x 0 bit 1): From

Jureckova and Sen's (1987) extension of Billingsley's Theorem (see also Koul (1992)), we have U(T1T2) ;U(00) =Q bwf(F ;1 (1 ; bout(1 ;)))T 2 ;Q bwf(a0 ; b0+F ;1 ())` 1=2 ba T 1 +op(1) (5.5)

for any sequences T1 = Op(1) and T2 = Op(1). Following the proof of

Lemma 3.3 of Chen and Chiang (1996) (see also Ruppert and Carroll (1980)), it can see that

U(n1=2 1 (^ BIa() ; a())n 1=2 2 (^ BIbq() ; bq())) =n;1=2 2 n 2 X i=1 wbixbi ;I(y bi x 0 bi^ bq())]I(ybi x 0 bi^ a()) =op(1): (5.6)

(22)

Also, using the method of Jureckova (1977, Lemma 5.2) and (5.5), one can show that for >0 there exists, k and N0 such that

Pfinf jt 2 jkn ;1=2 2 j n 2 X i=1 wbixbi ;I( i F ;1 (1 ; bout(1 ;) +n ;1=2 2 x 0 bit 2)] I(i a0 ; b0+F ;1 () +n ;1=2 1 x 0 biT 3) jg (5.7)

where T3 is any sequence of random vector with T3 = Op(1). Then the

weak consistency of ^BIbout() can be obtained from the root-consistency

of ^BIbout() given by n1=2 2 (^ BIbq() ; bq()) =Op(1)

which is induced from (5.6) and (5.7). Result (a) in Theorem 4.1 is fol-lowed from (5.5) and (5.7) by setting T1 = n

1=2 1 (^ BIa() ; a()) and T2 =n 1=2 2 (^ BIbq() ; bq()). REFERENCES

10. Chen, L.-A., Thompson, P. and Chuang, H.-C. (2000). Mallow's type bounded inuence regression quantile for linear regression model and simultaneous equations model. Sankhya Ser. B.62, 217-232.

Chen, L.-A., Chen, D.-T. and Chan, W. (2010). ThepValue for the Outlier Sum in Dierential Gene Expression Analysis. Biometrika,

97

, 246-253. Chen, L.-A. and Chiang, Y. C. (1996). Symmetric type quantile and trimmed

means for location and linear regression model. Journal of Nonpara-metric Statistics.

7

, 171-185.

Cook, R. D. and Weisberg, S. (1982). Residuals and Inuence in Regression, Chapmanand Hall, NewYork.

De Jongh, P. J. and De Wet, T. (1985). Trimmed mean and bounded inuence estimators for the parameters of the AR(1) process, Commu-nications in Statistics - Theory and Methods, 14,1361-1357.

De Jongh, P. J., De Wet, T. and Welsh, A. H. (1988). Mallows-type bounded-inuence-regression trimmed means. Journal of the Ameri-can Statistical Association

83

, 805-810.

(23)

Giltinan, D. M., Carroll, R. J. and Ruppert, D. (1986). Some new estima-tion methods for weighted regression when there are possible outliers.

Technometrics,28, 219-230.

Koenker, R. and Bassett, G.J. (1978). Regression quantiles. Econometrica

46

, 33-50.

Koenker, R. W. and Portnoy, S. (1987). L-estimation for linear model.

Journal of the American Statistical Association, 82, 851-857.

Krasker, W. S. (1985). Two stage bounded-inuence estimators for simulta-neous equations models. Journal of Business and Economic Statistics, 4, 432-444.

Krasker, W. S. and Welsch, R. E. (1982). Ecient bounded inuence re-gression estimation. Journal of the American Statistical Association, 77, 595-604.

Lai, Y.-H., Chen, H.-C., Chen, L.-A. and Chen, D.-T. (2013). Statistical inferences based on outliers for gene expression analysis. Unpublished paper.

Ruppert, D. and Carroll, R.J. (1980). Trimmed least squares estimation in the linear model. Journal of American Statistical Association

75

, 828-838.

Tibshirani, R. and Hastie, T. (2007). Outlier sums dierential gene expres-sion analysis. Biostatistics,

8

, 2-8.

Tomlins, S. A., Rhodes, D. R., Perner, S., et al. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer.

基因表現分析之穩健回歸估計量

國立交通大學

統計學研究所

碩 士 論 文

基因表現分析之穩健回歸估計量

研 究 生:張祐華

指導教授:陳鄰安 教授

摘要

誌謝

Content

Robust Regression Estimators in Gene Expression Analysis

Abstract

1. Introduction

2. Mallows Type Bounded Inuence Outlier Least Squares

Esti-mator

3. Monte Carlo Study

Table 1.

Table 2.

4. Mallows Type Outlier Regression Quantile

Table 3.

Theorem 4.1.

Table 4.

Table 5.

Table 6.

5. Appendix

Proof of Theorem 2.1.

Proof of Theorem 4.1.

97

7

83

46

75

8

碩士論文

研究生:張祐華

指導教授:陳鄰安教授

2. Mallows Type Bounded Inuence Outlier Least Squares