多維度參考區間

(1)

國

立

交

通

大

學

統計學研究所

碩

士

論

文

多維度參考區間

Multivariate Reference Regions

研究生：陳羽偉

指導教授：陳鄰安博士

(2)

多維度參考區間

Multivariate Reference Regions

Student: Yu-wei Chen Advisors: Dr. Lin-an Chen

國立交通大學統計學研究所

碩士論文

A Thesis

Submitted to Institute of Statistics College of Science National Chiao Tung University

In partial Fullfillment of the Requirement For the Degree of Master

In Statistics June 2010

Hsinchu, Taiwan, Republic of China

(3)

i

多維度參考區間

研究生：陳羽偉指導教授：陳鄰安博士國立交通大學統計學研究所摘要我們介紹了一個非常一般性的多維度參考區間觀念。給定幾個多維機率分配，我們提供不同型態的母體多維度參考區間。因此傳統統計推論中的估計與檢定皆可應用於這一未知區間之預測。在設定為多元常態分配下我們提供了詳細的估計與檢定的方法與模擬分析討論。

(4)

ii

Multivariate Reference Regions

Student: Yu-wei Chen Advisors: Dr. Lin-an Chen

Institute of Statistics

National Chiao Tung University

Summary

A general concept of population type multivariate reference region is introduced. This provides flexible applications of multivariate reference region. Given one population multivariate reference region, its estimation and hypothesis testing are important topics in statistical inference for this unknown region. We present several examples of population multivariate reference regions. Given multivariate normal case techniques and criterions for estimation and hypothesis testing are presented and evaluated.

(5)

iii

誌謝

在碩士這兩年中，由衷的感謝指導教授陳鄰安老師的指導與教誨，讓我不只能順利完成碩士學業，也在其中學到不少做人處事的道理。在做論文的過程中，教授讓我學會如何發現問題，思考問題，和如何解決問題。相信經由這個過程的學習，在未來人生的道路中，碰到任何的挫折與問題時，我都能迎刃而解。也非常謝謝所上的教授們在這兩年中的教導，因為教授們認真仔細的指導，我學會了許多統計分析的技巧。研究所的生活，非常感謝一起奮戰統研盃的隊友們于憶、士傑、洋德、為翔和瀚宇，一起研究電腦知識的祥福，以及班上的所有人，沒有你們我的碩士生活將不會過得這麼的不平凡和充實。最後謝謝一路在我徬徨無助或遭遇困難時支持我的爸媽和女友，沒有你們的鼓勵，將不會有今天的我，謝謝你們。陳羽偉謹誌于國立交通大學統計學研究所中華民國一百年六月

(6)

Contents

摘要 i Summary ii 致謝 iii

Introduction 1

Multivariate Reference Regions for Independent Transformation Available

Distribution 2

Reference Region Transformed from Multivariate Rectangles 3 Estimators and Statistical Properties for Reference Region 7 Hypothesis Testing of Reference Region 11

(7)

Multivariate Reference Regions

SUMMARY

A general concept of population type multivariate reference region is introduced. This provides exible applications of multivariate reference region. Given one pop-ulation multivariate reference region, its estimation and hypothesis testing are im-portant topics in statistical inference for this unknown region. We present several examples of population multivariate reference regions. Given multivariate normal case, techniques and criterions for estimation and hypothesis testing are presented and evaluated.

Key words: Estimation hypothesis testing reference interval multivariate reference region.

1. Introduction

The determination of intervals to provide reference limits is fundamentally im-portant in clinical chemistry. The reference interval in laboratory chemistry refers to population-based reference values obtained from a well-dened group of refer-ence individuals. This is an interval with two condrefer-ence limits which covers the measurement values in the population in some probabilstic sense. The reference interval tells the physician if the patient's value is expected in a healthy or diseased individual or if further testing is warranted. For review of reference intervals, see Horn and Pesce (2003) and Hung, Chen and Welsh (2010).

Most medical decisions require consideration of several co-existing pieces of infor-mation, and because these pieces such as blood constituents are often correlated, the multivariate reference regions is more useful than conventional univariate reference

TypesetbyA M S-T E X 1

(8)

intervals for interpreting clinical laboratory results. There is the uncomfortable statistical fact that when many clinical tests are run on a blood sample from a healthy person, there is a high probability that at least one result will lie outside its reference interval. This indicates that a multidimensional point of correlated ob-servations is likely to lie within the individual's multivariate reference region, even when one or more of the observations lie outside their separate reference intervals for the individual (see Schoen (1970) and Harris, Yasaka et al. (1982)).

Although multivariate reference regions in the practice of clinical chemistry and laboratory medicine is very important, however, it has been received only limited attention in literature and applications. The major reason for this is that there is lack of a natural ordering for multivariate data. This reason also make the existed proposals of multivariate reference regions more or less ad hoc and then most existed ones do not have parametrized versions so that their applications are extremely limited (see Chen and Welsh (2002)). This leads to an unfortunate result. Those laboratories that can not perform their own detailed reference region (interval) studies may need to validate reference regions published elsewhere for their own populations. However, validation of a reference region or interval is generally done through statistical inferences technique such as condence interval or hypothesis testing that is not allowed to do so if a multivariate reference region is not a sample realization of a population type multivariate reference region. This paper aims to introduce some general but systematic and concise techniques in constructions of probabilistic population multivariate reference regions that allows us to establish statistical inferences such as estimation and hypothesis testing for this unknown region.

(9)

In section 2, we introduce general concepts of population multivariate reference region. Examples of this population region for multivariate normal distribution and an beta related multivariate distribution are introduced. In Section 3, a techique of multivariate reference region that may be tranformed from multivariate rectangle is introduced and studied. In Section 4, we introduce two criterions of estimation of unknown multivariate reference region. Simulation results for crietrions of area and mean square error (MSE) are presented. In Section 5, we present a technique that can test a hypothesis of location parameters and scale parameters simultaneously as a tool for validation of a multivariate reference region for a laboratory's population.

2. Multivariate Reference Regions for Independent Transformation

Avail-able Distribution

Let Y be random vector of p variables with joint probability density function (pdf)f(y) whereparameter vector in . We denote the sample space of random vectorY by ;y.

Denition 2.1.

A dependent subset Cy() of space ;y is called the reference

region if it satises

P(Y 2Cy()) = for 2:

The interest is how to develop reference region Cy() for a distribution of Y.

The diculty in constructing reference region for Y is that elements of Y are generally correlated.

Denition 2.2.

We say that the distribution of a random vectorY is independence-transformable if there is invertible functionZ =G(Y) such that elemets,Z1:::Zp,

(10)

Let us denote the sample space of vector Z by ;z. In the following example, we

present two independence transformable distributions.

Example 1.

(a) Suppose that Y has multivariate normal distribution Np()

where is positive denite matrix. We know that Z = ;1=2 0

(Y ;) is p vector

of i.i.d. random variables with standard normal distribution N(01). Hence Y is independence-transformable where he sample space of Y and Z are, respectively, ;y = ;z =Rn.

(b) Suppose that bivariate random vector (Y1

Y2 ) has a joint pdf fY1Y2(y 1y2) = ;( ++) ;();();()y;1 2 (1 ;y 2) ;1(y 1 y2 )+;1(1 ; y1 y2 );1 1 y2 0< y1 < y2 <1: By letting (Z1 Z2 ) = ( Y2 Y1=Y2

), we may see that Z1 and Z2 are independent random

variables, respectively, with distributionsbeta() and beta(+). HenceY is independence-transformable where the sample space ofZ is ;z = (01)(01) and

sample space of Y is ;y =f y1 y2 : 0< y1 < y2<1 g.

We now consider that for p-vector Y, we have a transformed vector Z =G(Y) that includes independent and parameter-free elements Z1:::Zp. Then, reference

region Cy() may be constructed based on a Z-based reference region through

in-version.

Denition 2.3.

Suppose that there is aZ-based reference regionCz, a subset of

;Z. We dene the reference region for distribution of Y as

Cy() =G;1

(Cz) =fy 2; :G(y)2Czg (2.1)

whereG;1

(11)

Two approaches are available for construction of Z-based reference region Cz.

First, in some situations, we can introduceCz through a univariate mapping on Z.

Second, since Z has independent elements, it is allowed to construct CZ through

product of element-wise reference intervals. We rst introduce the second approach.

Denition 2.4.

If there is distribution of a univariate mapping Qz =q(Z) so that

a coverage interval ofQz, denoted byCq, is available, then we have

Cz =fz :q(z)2Cqg:

In the follwoing example, we presents two methods in constructing the univariate mapping when Y has a mutivariate normal distribution.

Example 2.

Again, let Y be with the multivariate normal distribution Np()

and we let the independence-transformation beZ = ;1=2 0

(Y;). Let the elements

of vector Z be Z1:::Zp.

(a) We then have one univariate mapping (Y ;)

0;1(Y ;) = Pp i=1Z 2 i that has

chi-square distribution2(p). One popularly way in construct a reference region

for Z is based on this chi-square transform as

CZ =fz 2Rp :z 0z 2 g where 2

is the quantile point of the chi-square distribution 2(p) and we

im-plement coverage interval Cq = (02

] for chi-square variable Q = Pp i=1Z

2 i.

Through the inversion, we have

Cy() =fy=+ 1=2z :z0z 2 g =fy: (y;) 0;1(y ;) 2 g: (2.2)

(12)

(b) We can consider the univariate mapping 1 p p10 pZ = 1 p p10 p;1=2 0 (Y;) N(01). Since (;( 1+ 2 )( 1+

2 )) covers the standard normal random variable with

proba-bility, an alternative reference region forZ is

CZ =fz 2Rp : 1 p p10 pz 2(; ;1(1 + 2 );1(1 + 2 ))g: (2.3)

Through inversion, we have

Cy() =fy=+ 1=2z : 1 p p10 pz 2(; ;1(1 + 2 );1(1 + 2 ))g:

3. Reference Region Transformed from Multivariate Rectangles

In this section, we start from constructing coverage intervals for independent transformed variables Z1:::Zp and then take inversion from the product of these

element-wise coverage intervals.

Denition 3.1.

Let C1:::Cp be, respectively, the

1=p coverage intervals for

independent variables Z1:::Zp. With product Cz = C1 C

2

:::Cp, we may

dene the reference region for distribution ofY as Cy() =G;1 (Cz).

The following example gives the reference region from the multivariate rectan-gle.

Example 3.

(a) We continue the settings for multivariate normal vector Y and transformation Z. By letting = 1=p, we choose quantile z

1+

2 . Let Cz be the

product of p 1=p-coverage intervals, respectively, for Z

1:::Zp. Then the reference

region for Y then is

Cy() =fy:y =+ 1=2zz

(13)

Generally we choose shortest element-wise coverage intervals that leads to the prod-uct Cz as Cz =f 0 B B @ z1 z2 ... zp 1 C C A: ; ;1(1 + 2 )zj ;1(1 + 2 )j = 1:::pg

while a general type product reference region is

Cz =f 0 B B @ z1 z2 ... zp 1 C C A:zj 1 zj zj 2j = 1:::p with P(zj1 Zj zj 2) = 1=p g:

(b) We next consider the beta distribution case. With = 1=2, a product of

coverage intervals is Cz =f z1 z2 :z1 2(F ;1 Z1 (1 ; 2 )F;1 Z1 (1 + 2 ))z2 2(F ;1 Z2 (1 ; 2 )F;1 Z2 (1 + 2 ))g:

The reference region for Y then is

Cy() =f y1 y2 = z1z2 z1 : z1 z2 2Czg:

4. Estimators and Statistical Properties for Reference Region

Now, suppose that we have a random sample Y1:::Yn from the distribution

f(y). It is desired to introduce concepts and methods of statistical inferences when the reference region is unknown. In our settings, the reference regions are unknown due to that there unknown distribution parameters. Hence, statistical inferences for unknown reference region may be reduced to inferences for unknown distribution parameters. We start from point estimation aspect.

Denition 4.1.

LetT be a random region in ; constructed by the random sample

Y1:::Yn. We dene its expectation as E(T) =

(14)

and probability limit Plim(T) = fPlim(tn) : tn 2 Tg if all Plim(tn) exist where

Plim(tn) = a if tn converges to a in probability. We then say that an estimator

^

Cy() is unbiased estimator of Cy() if E( ^Cy()) = Cy() and it is consistent for

Cy() if Plim( ^Cy()) =Cy().

An estimator of reference region Cy() may be obtained by plugging by ^

when estimator ^ is available.

Denition 4.2.

Let ^ be an estimator of parameter . We let estimator of

reference region be ^Cy() =Cy(^). Then ^Cy() is a maximum likelihood estimator

(mle) of Cy() if mle ^ exists.

Example 4.

Now, suppose that random sample Y1:::Yn is drawn from normal

distribution Np(). We know that Y and ^ = Sy = 1 nPn

i=1(Yi

;Y)(Yi ;Y) 0

are, respectively, mle's of and .

In straight forward way, the estimator of reference regions for Cy() of (2.2)

and (2.3) are, respectively, ^ Cy() =fy = Y +S 1=2 y z :z0z 2 g and ^ Cy() =fy= Y +S 1=2 y z : 1p p10 pz 2(; ;1(1 + 2 );1(1 + 2 ))g:

These two estimated reference regions are consistent, respectively, for Cy() of

(2.2) and (2.3) since Y and Sy are, resctively, consistent for and .

We here consider some other criterions for evaluation of estimator of ref-erence region. Let us denote the area of the true reference region by AC.

(15)

N2(02 = 2 y 12 21 2 y ). Let Ai ^

C represents the area of the estimate ^Cy() in

ith replication. We dene the averaging area as

AC^ = 1 m m X i=1 AiC^

and the square root of the mean square error (MSE) as

SMSEA^ = ( 1 m m X i=1 (Ai ^ C ;AC) 2)1=2:

The simulated results of averaging area AC^, square root of the mean square error

SMSEA^ associated with the true area AC are displayed in Table 1.

Table 1

. Comparison of areas of estimated and true reference region associated withSMSEA^ AC A ^ C (SMSEA^) n= 30 n= 50 n= 70 n= 100 2 y = 1 12 = 0:3 10:32 9:834 (2:22) (110::91)00 (110:49):07 (110::27)10 12 = 0:5 9:372 8:949 (2:07) (19::04268) (19::13737) (19::24910) 12 = ;0:3 10:32 9:746 (2:23) (110::75)06 (110:52):16 (110::30)16 12 = ;0:5 9:372 8:935 (2:40) (19::02667) (19::18929) (19::20220) 2 y = 0:3 12 = 0:09 3:09 2:910 (0:69) (02::98653) (03::01346) (03::05938) 12 = 0:15 2:811 2:690 (0:61) (02::73048) (02::75040) (02::77336) 12 = ;0:09 3:097 2:917 (0:68) (02::99154) (03::05744) (03::04437) 12 = ;0:15 2:811 2:649 (0:62) (02::71247) (02::75541) (02::75534)

(16)

Several comments may be drawn from the results in Table 1:

(a) In terms of area for a region, the estimated area and the true area of reference region in the designed cases are all under estimated, however, not with too much di erences.

(b) As expected, the variation showing in MSE is larger when the variance 2 y is

larger.

Comparison of areas and MSE's between the estimated one and the true region is not sucient to evaluate the eciency of an estimator of unknown reference region. It requires further study to see if the estimated and the true regions are really overlapping closely. For this need, we dene the area non-overllaped between these two as

ANL= Non-overlapping Area =AC +AC^

;2AC y ()\ ^ Cy ()

and the following MSE

MSE = 1_mXm j=1 (₂ Non-overlapping Area (Length (C()) + Width (C())) 2 = 1_mXm j=1 ( AC +AC^ ;2AC y ()\ ^ Cy () 2(Length (C()) + Width (C()))) 2

where we choose this denominator term to make this MSE dimension-free. The simulated results of ANL and MSE are displayed in Table 2.

Table 2

. Eciencies of estimation of reference region through area di erence and MSE

(17)

n= 30 n= 50 n= 70 n= 100 2 y = 0:3 12 = 0:09ANL 1:074 0:869 0:745 0:641 MSE 0:149 0:120 0:103 0:088 12 = 0:15ANL 0:987 0:794 0:669 0:583 MSE 0:137 0:110 0:092 0:081 12 = ;0:09ANL 1:065 0:868 0:746 0:649 MSE 0:147 0:120 0:103 0:090 12 = ;0:15ANL 0:981 0:795 0:677 0:591 MSE 0:136 0:110 0:094 0:082 2 y = 1 12 = 0:3ANL 4:000 3:068 2:552 2:099 MSE 0:304 0:233 0:194 0:159 12 = 0:5ANL 3:275 2:625 2:237 1:972 MSE 0:248 0:199 0:170 0:149 12 = ;0:3ANL 3:622 2:885 2:480 2:109 MSE 0:275 0:219 0:188 0:160 12 = ;0:5ANL 3:289 2:621 2:237 1:930 MSE 0:250 0:199 0:170 0:146 We have two comments for the simulated results:

(a) The non-overlapping area ANL decreases and then the eciency of point esti-mator increases when sample size n rises or variance 2

y decreases.

(b) The dimension-free MSE shows that the estimation of true reference region is satisfactory.

5. Testing for Hypothesis of Reference Region

The establishment of reference region requires careful planning, control, and doc-umentation of each aspect of the study. Thus, the resulting reference regions are well-characterized in terms of the variation attributable to pre-analytical and an-alytical factors. With this consideration, to establish a laboratory's own reference region (interval) is dicult due to costs and forces. Even large laboratories are nd-ing it increasnd-ingly dicult to conduct these comprehensive studies cost-e ectively.

(18)

Therefore, laboratories are becoming more reliant on manufacturers to establish scientically sound reference regions that can be veried using simpler, less labor-intensive, and lower cost approaches. One important approach requiring less e ort for the establishment of reference regions (intervals) is the validation through hy-pothesis testing to verify if an established reference region can match the use for this laboratory's specic population. This task can be done statistically only when the unknown reference region is function of distributional parameters.

We require the reference region Cy() for the laboratory's population to be

dependent on unknown parameter that fullls

=P(Y 2Cy()) for 2:

When the reference region is unknown, we only know that it is one with the space of Cy() as fCy() : 2g: (5.1) We assume that Cy(1) 6 = Cy(2) if 1 6 = 2. Any set D ;y is a reference

region if there exists 0

2 such that D =Cy(

0). Hence, testing hypothesis of

reference region such as

H0 :Cy() =D

is equivalent to test the hypothesis of unknown parameter as

H0 : =0: (5.2)

Suppose that the random vector Y has the normal distributionNp(). Then a

testing hypothesis of any type of reference region is equivalent to test the following hypothesis of distribution parameters

(19)

In literature, we can see approaches for testing hypothesis about mean vector as

H0 : = 0 and approaches for testing hypothesis about covariance matrix as

H0 : = 0. It is rare to have approaches to test hypothesis for mean vector

and covariance matrix simultaneously. The hypothesis about the reference region is reduced to test hypothesis in (5.3) that requires a new test.

With the normality assumption, we have (Y ; 0) 0 ;1 0 (Y ; 0) 2(p)

when H0 is true. Suppose that we further have a random sample Y1:::Yn from

Np(). Then, Q =Xn i=1 (Yi; 0) 0;1 0 (Yi ; 0) 2(np)

whenH0 is true. A rule for testing H0 is

rejecting H0 if Q

(np)

where(np) is the (1;)th quantile of the chi-square distribution 2(np).

We consider the hypothesis (5.3) by choosing data from the following distribution

N(0+r

1 1

0+I2)

where (r) = (00) corresponds to distribution of H0. With replications m =

10000, we perform a silmulation to verify the power performance of this chi-square test dened as 1 m m X j=1 I(Qj (np))

where Qj is the observation of statistic Q from j-th sample. By settings 0 =

0 0 and 0 = 1 0 :3 0:3 1

, we display the simulated results in Tables 3 and 4 respectively for signicance level= 0:05 and 0:1.

(20)

Table 3

. Power performance (= 0:05) (r) n= 30 n= 50 n= 70 n= 100 (00) 0:0466 0:0517 0:0500 0:0503 (0:50) 0:2707 0:3640 0:4549 0:5645 (10) 0:9461 0:9980 1 1 (00:5) 0:7888 0:9287 0:9756 0:9971 (01) 0:9886 0:9996 1 1 (0:250:25) 0:4860 0:6471 0:7684 0:8774 (0:50:5) 0:9253 0:9864 0:9982 1

Table 4

. Power performance (= 0:1)

(r) n= 30 n= 50 n= 70 n= 100 (00) 0:1016 0:0940 0:0962 0:1017 (0:50) 0:3875 0:5041 0:5959 0:6890 (10) 0:9744 0:9980 1 1 (00:5) 0:8706 0:9594 0:9889 0:9986 (01) 0:9948 1 1 1 (0:250:25) 0:6149 0:7668 0:8581 0:9305 (0:50:5) 0:9613 0:9955 0:9999 1

We have several comments drawn from the simulated power results in Tables 3 and 4:

(a) The results for (r) = (00) are all close to values 's ensuring that this is a level test.

(b) Large sample size does improve to raise the power.

(c) Power performance reecting from the shift in scale is stronger than the shift in location.

References

Chen, L.-A. and Welsh, A. H. (2002). Distribution-function-based bivariate quan-tiles. Journal of Multivariate Analysis, 83, 208-231.

(21)

Harris, E. K., Yasaka, T., Horton, M., R. and Shakarji, G. (1982). Comparing mul-tivariate and univariate subject-specic reference regions for blood constituents in healthy persons. Clinical Chemistry, 28, 422-426.

Horn, P. S. and Pesce, A. J. (2003). Reference intervals: an update. Clinica Chimica Acta, 334, 5-23.

Huang, J.-Y., Chen, L.-A. and Welsh, A.H. (2010). A note on reference limits. IMS Collections, Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in Hornor of Professoe Jana Jureckova. 7, 84-94.

Schoen, I. and Brooks, S. (1970) Judgement based on 95% condence limits. Amer-ican Journal of Clinical Pathology. 53, 190-193.

多維度參考區間

國

立

交

通

大

學

統計學研究所

碩

士

論

文

多維度參考區間

Multivariate Reference Regions

研究生：陳羽偉

指導教授：陳鄰安 博士

多維度參考區間

Multivariate Reference Regions

多維度參考區間

Multivariate Reference Regions

Summary

誌 謝

Multivariate Reference Regions

SUMMARY

1. Introduction

2. Multivariate Reference Regions for Independent Transformation

Avail-able Distribution

Denition 2.1.

Denition 2.2.

Example 1.

Denition 2.3.

Denition 2.4.

Example 2.

3. Reference Region Transformed from Multivariate Rectangles

Denition 3.1.

Example 3.

4. Estimators and Statistical Properties for Reference Region

Denition 4.1.

Denition 4.2.

Example 4.

Table 1

Table 2

5. Testing for Hypothesis of Reference Region

Table 3

Table 4

References

指導教授：陳鄰安博士

誌謝

Denition 2.1.

Denition 2.2.

Denition 2.3.

Denition 2.4.

Denition 3.1.

Denition 4.1.

Denition 4.2.