Multiple deletion diagnostics in beta regression models

(1)

DOI 10.1007/s00180-012-0370-9

O R I G I NA L PA P E R

Multiple deletion diagnostics in beta regression models

Li-Chu Chien

Received: 9 October 2011 / Accepted: 8 September 2012 / Published online: 23 October 2012 © Springer-Verlag Berlin Heidelberg 2012

Abstract We consider the problem of identifying multiple outliers in a general class of beta regression models proposed by Ferrari and Cribari-Neto (J Appl Stat 31:799– 815,2004). The currently available single-case deletion diagnostic measures, e.g., the standardized weighted residual (SWR), the Cook-like distance (LD), etc., often fail to identify multiple outlying observations, because they suffer from the well-known problems of masking and swamping effects. In this article, we develop group deletion diagnostic measures, such as generalized SWR, generalized LD, generalized DFFITS and generalized DFBETAS, and suggest a simple procedure for identifying multiple outliers using these. The performance of the proposed methods is investigated through simulation studies and two practical examples.

Keywords Beta regression · Multiple outliers · Generalized SWR · Generalized LD· Generalized DFFITS · Generalized DFBETAS

1 Introduction

Many fields of studies involve data in the form of percentages, rates or proportions that are measured continuously in the open interval (0, 1). For example, one may be interested in modeling the proportion of income spent on food as a function of the level of income and the number of persons in the household. The beta distribution is a flexible and useful tool for modeling data on the standard unit interval (0, 1), since the beta density can display quite different shapes depending on the values of the parameters that index the distribution; see, for example,Kieschnick and McCullough

(2003).

L.-C. Chien (

B

)

Institute of Statistics, National Chiao Tung University, Hsinchu, Taiwan e-mail: [email protected]

(2)

Ferrari and Cribari-Neto(2004) proposed a class of beta regression models which is the class of models derived from generalized linear models (GLMs), except the response variable is not from a linear exponential family distribution. Under gener-alized beta linear models (GBLMs), they developed maximum likelihood inference including parameter and interval estimation, and also hypothesis tests. They provided complete inference tools for the new class of models. The tools are freely available in the R package betareg. SeeCribari-Neto and Zeileis(2010) for details. Hence, the execution of the beta regression techniques in practical problems is convenient.

Identifying observations that may affect the results of a regression analysis is a fundamental step in regression model building processes. In general, observations that lie well outside the majority of the data are termed outliers, in the sense that outliers come from a different probability distribution or from a different deterministic model than the mass of the data. The existence of outliers always distorts the outcome and accuracy of regression results, and hence, outliers must be detected in the regression analysis processes.

In GBLMs, to identify the outlying observations that depart from the postulated model of the bulk of the data, some authors have provided guidelines for diagnos-tic analysis. For example,Ferrari and Cribari-Neto(2004) provided some diagnostic measures to identify atypical observations and to detect model misspecification. Espin-heira et al.(2008a) proposed two new beta residuals and numerically compared their behavior to those originally suggested byFerrari and Cribari-Neto(2004). The results indicate a preference for one of the new residuals, more specifically the residual that accounts for the different leverages of the observations. On the other hand, Espin-heira et al.(2008b) developed the Cook-like distance (LD) to measure the effects of influential observations on regression parameter estimates of GBLMs.

These currently available outlier measures and influence diagnostics seem to be available only when the data merely contain a single outlier. However, if the data contain more than one outlying observation, these existing methods may become ineffective, due to the problems of masking and swamping effects. Hence, multiple outlier detection methods that are free from these problems are proposed in this article. This article unfolds as follows. Section2contains a concise review of GLBMs pro-posed byFerrari and Cribari-Neto(2004). In the next section, we briefly introduce some of the current diagnostic tools, e.g., the standardized weighted residual (SWR), LD, etc. We also define the GBLM versions of the influence measures DFFITS and DFBETAS in this section. In the section after, we introduce SWR, LD, DFFITS and DFBETAS based on group deletion techniques and suggest an easy procedure for detecting multiple outliers using these group deletion diagnostics. Sections5and6

illustrate applications of these newly proposed deletion diagnostic methods in simu-lated and real data examples, respectively. Finally, in Sect.7, some conclusions about these proposed diagnostic measures are set out.

2 Beta regression model

The probability function for a single response variable Y in a GBLM (Ferrari and Cribari-Neto 2004) is

(3)

f(y; μ, φ) = (φ)

(μφ)((1 − μ)φ)yμφ−1(1 − y)(1−μ)φ−1, 0 < y < 1 (1)

where 0< μ < 1, φ > 0 and (·) is the gamma function. The mean and variance of Y are E(Y ) = μ and V ar(Y ) = V (μ)/(1 + φ), respectively, where V (μ) = μ(1 − μ) is a variance function. Note thatφ can be viewed as a precision parameter, in the sense that, for a fixed meanμ, the variance of Y decreases as φ increases. Here our concern is with the mean parameterμ, and φ may be viewed as a nuisance parameter.

Now consider observations y1, . . . , ynwhich are regarded as realizations of

inde-pendent random variables Y1, . . . , Ynand each yi, i = 1, . . . , n, follows the density

(1) with meanμi and unknown precisionφ. The mean of Yi involves explanatory

variables through a link function g(·), so that g(μi) = ηi where ηi = xTi β = β1xi,1+ β2xi,2 + · · · + βpxi,p. Hereηi is a linear predictor, xi = (xi,1, . . . , xi,p)T

is a p-vector of explanatory variables,β = (β1, . . . , βp)T is a p-vector of unknown

regression parameters, and g(·) is a strictly monotonic and twice differentiable link function that forms a mapping from the interval (0, 1) toR. The parameters β and

φ can be estimated by maximum likelihood methods that can easily be implemented

through the R package betareg. For details, seeCribari-Neto and Zeileis(2010).

3 Measures of influence based on single-case deletion diagnostics

In this section, we succinctly review the currently available single-case deletion diag-nostic methods. We also define the GBLM versions of the influence measures DFFITS and DFBETAS according to the classical versions of DFFITS and DFBETAS estab-lished under the normal regression settings and discussed in Subsection 2.1 ofBelsley et al.(1980).

3.1 Assessing the influence of case deletion on residuals

This subsection is focused on the single-case outlier identification tools which use the residuals to detect the atypical observations and check model adequacy. We review the weighted residual (WR) and SWR proposed byEspinheira et al.(2008a).

WR and SWR.Espinheira et al.(2008a) defined WR for point i as

ri∗= y_i∗− μ∗_i φvi (2) where y_i∗ = log(yi/(1 − yi)), μ∗_i = ψ(μiφ) − ψ((1 − μi)φ),vi = {ψ(μiφ) + ψ

((1 − μi)φ)} and ψ(·) is the trigamma function. Here the symbol “ˆ” is used for

quantities that are evaluated at parameter values of the maximum likelihood solution based on all observations. Then the standardized version, SWR, for point i is defined by

r_iww= r ∗ i 1− hi /φ = y_i∗− μ∗_i vi 1− hi (3)

(4)

where hi is the i th diagonal element of H = W

1/2

X(XTWX )−1XTW1/2and W1/2 is a symmetric square root of W. Here X = (x1, . . . , xn)T and W = φGVG with

diagonal matrices G= Diag(1/g(μ1), . . . , 1/g(μn)) and V = Diag(v1, . . . ,vn).

An excessively large or small value of WRi or SWRi indicates point i having an

unusual residual.

3.2 Assessing the influence of case deletion on regression parameter estimates This subsection is concerned with the single-case influence identification tools which measure the influence of deleting one observation on the regression parameter esti-mates or on the fitted values. We review LD proposed byEspinheira et al.(2008b) and propose the GBLM versions of the influence measures DFFITS and DFBETAS.

LD.Espinheira et al.(2008b) defined LD for point i as

LDi = β − β(−i)TφXT WX _{β −}_β(−i)

where β(−i)is the maximum likelihood estimator (MLE) ofβ without the ith obser-vation. From the approximate relation

_β(−i)_≈_{β −}XTWX −1xiw1_i/2

1− hi

r_i∗ (4)

wherew_i1/2is the i th diagonal entry of W1/2, it follows that

LDi ≈ r_i∗ 1− hi 2 φhi = riww 2 hi 1− hi . (5)

A large LDi indicates that point i has a large impact on β.

DFFITS. We define DFFITS for point i as

DFFITSi = w1/2 i xiT _{β −}_β(−i) hi φ(−i)

which, in view of (4), is expressed by

DFFITSi ≈ r_i∗ 1− hi hiφ(−i) 1− hi = riww φ(−i) φ hi 1− hi (6)

(5)

where φ(−i)is the MLE ofφ with observation i omitted. Using (6), we obtain, from (5), that LDi = (DFFITSi)2 φ φ(−i).

An extreme DFFITSi implies that observation i is influential on the weighted fit,

w1/2

i x

T i β.

DFBETAS. We define DFBETAS for point i forβjas

DFBETASi j =

βj− β(−i)j

XTWX −1_j /φ(−i)

where(XTWX )−1_j is the j th diagonal element of the inverse of XTWX , (XTWX )−1. Let cj i be the j th component of the p-vector(XTWX )−1xiw_i1/2. We then have

DFBETASi j ≈ ri∗ φ(−i) 1− hi cj i n i=1c2j i = riww φ(−i) φ(1 − hi) cj i n i=1c2j i . (7)

A large absolute value of DFBETASi j shows that observation i is influential on βj.

4 Measures of influence based on group deletion diagnostics

In this section, we introduce the group deletion diagnostic measures and suggest a diagnostic procedure for detecting multiple outlying observations using these. We assume that d observations among a set of n observations are unusual observations and omitted before the fitting of the model. Let [R] index a set of the(n − d) observations that are remaining in the analysis after deleting a set of the d unusual observations indexed by D.

Let y∗ = (y₁∗, . . . , yn∗)T, μ = (μ1, . . . , μn)T and μ∗ = (μ∗1, . . . , μ∗n)T be

the n× 1 vectors. Without loss of generality, we assume that observations to be deleted are the last d components of y∗ and X, so that y∗T = (y∗T_[R], y∗T_D ) and XT = (XT_[R], XT_D). Let β_[R], φ_[R], μ_[R], μ∗_[R], V_[R], G_[R] and W_[R] be the MLEs, respectively, ofβ, φ, μ, μ∗, V, G and W without the d observations in the deletion set D. Let W−1_[R]be the inverse of W_[R]. Then write

z[R] = Xβ[R]+ W−1[R]G[R]y∗− μ∗_[R] (8) and partition W_[R]as W_[R]= φ_[R]G_[R]V_[R]G_[R]= ⎡ ⎣W[R][R] 0 0T WD_[R] ⎤ ⎦

(6)

with diagonal matrices W_[R][R]and WD_[R]being related to observations from the [R]

and D sets, respectively, in which 0 is a zero matrix of order (n − d) × d. When a group of observations D is omitted, we define the leverage measure for case i as

hi[R] = w_i1_[R]/2xT_i (X_[R]T W[R][R]X[R])−1xiw_i1_[R]/2 wherew1_i_[R]/2 is the i th element of W

1/2 [R] that is a symmetric square root of W_[R].

4.1 Identifying multiple outliers using group-deleted versions of WR and SWR The generalized WR (GWR) and the generalized SWR (GSWR), based on group deletion techniques, are proposed by the single-case outlier detection measures WR and SWR.

GWR. For point i from the [R] set, WR is defined, in view of (2), as

r_i∗_[R]= y ∗ i − μ∗i[R] φ[R]vi[R] (9)

whereμ∗_i_[R]= ψ(μi[R]φ[R]) − ψ((1 − μi[R])φ[R]) and vi[R]is the i th diagonal unit of

V_[R]. Using (8), r_i∗_[R]is expressed in the alternative form by

ri∗_[R]= w 1_/2 i[R] zi_[R]− xTi β[R] (10) which is called the internal WR (IWR), because the point i is from the [R] set.

In view of (10), WR for point i from the D set is defined as r_i∗_[R+i]= w_i1_[R]/2(zi[R]−

x_iTβ_[R+i]) where notations “” and “[R + i]” are used for quantities that are evaluated at parameter values of the maximum likelihood solution based on the remaining set [R+ i], which consists of (n − d + 1) observations with the point i from the D set and the others from the [R] set. Using

_β_[R+i]_≈_β_[R]₊ XT_[R]W_[R][R]X_[R] ₋₁ xiw1_i_[R]/2 1+ hi[R] r_i∗_[R], (11)

r_i∗_[R+i]is equivalently expressed in the form

r_i∗_[R+i]= r

∗

i[R]

1+ hi[R]

(12)

which is called the external WR (EWR), because the point i is from the D set. Then GWRi for any data points is defined by (9) for i in [R] and by (12) for i in D.

GSWR. For point i from the [R] set, the internal SWR (ISWR) is defined, using (3),

as I r_iww_[R] = r ∗ i[R] 1− h _[R] φ_[R] . (13)

(7)

On the other hand, for point i from the D set, the external SWR (ESWR) is defined, using (13), as Er_iww_[R+i]= r_i∗_[R+i]/

(1 − hi[R+i]) / φ[R+i]. By writing

hi[R+i]= wi1_[R]/2x T i XT_[R]W_[R][R]X_[R]+ xiwi[R]xiT ₋₁ xiw1i_[R]/2 = hi_[R] 1+ hi[R] (14)

and using (12), ESWR is seen to be equal to

Er_iww_[R+i]= r ∗ i[R] 1+ hi[R] φ[R+i] . (15)

Then GSWRifor any data point is defined by (13) for i in [R] and by (15) for i in D.

In fact, group deletion residuals for both the [R] set and the D set, which are measured based on a similar scale, have received a great deal of attention in the literature in recent years. For example, similar residuals were derived byHadi and Simonoff(1993) for detecting multiple outliers in linear models and, later, a similar diagnostic technique was sketched byAtkinson(1994). In addition, this type of residual was introduced, under linear regression and logistic regression, byImon(2005) and

Imon and Hadi(2008).

4.2 Identifying multiple outliers using group-deleted versions of LD, DFFITS and DFBETAS

The generalized LD (GLD), the generalized DFFITS (GDFFITS) and the generalized DFBETAS (GDFBETAS), based on group deletion techniques, are proposed by the single-case influence detection measures LD, DFFITS and DFBETAS.

GLD. For point i from the [R] set, the internal LD is computed by

ILDi = _β_[R]₋_β(−i) [R] T φ[R]XT_[R]W_[R][R]X_[R] _β_[R]₋_β(−i) [R]

whereas, for point i from the D set, the external LD is computed by

ELDi =

_β_[R+i]₋_β_[R]T_φ_[R+i]X_[R+i]T W[R+i][R]X[R+i] β[R+i]− β[R] where W_[R+i][R]is an(n−d+1)×(n−d+1) diagonal matrix, in which the first (n−d) diagonal units are relative to observations from the [R] set and the last diagonal unit is relative to the point i from the D set. GLDi is derived by combining ILDi and ELDi.

Through (11) and using XT_[R+i]W_[R+i][R]X_[R+i] = XT_[R]W_[R][R]X_[R] + xiwi[R]xT_i

and β_[R]≈ β(−i)_[R] + XT_[R]W[R][R]X[R] ₋₁ xiw_i1_[R]/2 1− hi[R] r_i∗_[R], (16)

(8)

GLDiis expressed in terms of I r_iww_[R]and hi_[R]as GLDi = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ (Irww i[R])2 hi[R] 1−hi[R] if i ∈ [R] I r_iww_[R] 2₍₁₋_h i[R])hi[R] 1+hi[R] φ[R+i] φ[R] if i /∈ [R].

GDFFITS. The internal DFFITS (IDFFITS), for point i from the [R] set, is defined

by IDFFITSi = w1_/2 i[R]xTi _β_[R]₋_β(−i)_[R] hi[R]/φ_[R](−i)

whereas the external DFFITS (EDFFITS), for point i from the D set, is defined by

EDFFITSi = w1/2 i[R]xTi β[R+i]− β[R] hi[R+i]/φ[R] .

Together these IDFFITSi and EDFFITSi give GDFFITSi for point i in the entire

data set. Using (16) in conjunction with (11) and (14), it turns out that GDFFITSi is

displayed in terms of I r_iww_[R]and hi[R]as

GDFFITSi = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ I r_iww_[R] φ_[R](−i)hi[R] √ φ[R](1−hi[R]) if i ∈ [R] I r_iww_[R] √ (_√1−hi[R])hi[R] 1₊hi[R] if i /∈ [R].

After some algebraic manipulation, it is observed

GLDi = ⎧ ⎪ ⎨ ⎪ ⎩ (GDFFITSi)2 φ[R] φ_[R](−i) if i ∈ [R] (GDFFITSi)2 φ_φ[R+i] [R] if i /∈ [R].

GDFBETAS. For point i from the [R] set, we define the internal DFBETAS

(IDF-BETAS), forβj, as IDFBETASi j = βj[R]− β(−i)j_[R] XT_[R]W_[R][R]X_[R] ₋₁ j /φ (−i) [R] .

For point i from the D set, we define the external DFBETAS (EDFBETAS), forβj, as

EDFBETASi j =

βj[R+i]− βj[R]

XT_[R+i]W_[R+i][R]X_[R+i]

₋₁ j φ[R] .

(9)

Collecting these IDFBETASi j and EDFBETASi j, for any point i in the data set,

GDFBETAS, forβj, is exhibited in terms of Iriww_[R]and hi[R]as

GDFBETASi j = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ I r_iww_[R] φ(−i)_[R] √ φ[R](1₋hi[R]) cj i[R] i∈Rc2j i[R] if i ∈ [R] I r_iww_[R] √ 1₋hi[R] 1+hi[R] cj i[R] i∈Rc2j i[R]− c2_{j i}_[R]/(1+hi[R]) if i /∈ [R]

where cj i[R]is the j th component of the p-vector(X_[R]T W[R][R]X[R])−1xiw1i[R]/2.

Similar diagnostic measures are derived byImon(2005) who proposed the group deletion versions of LD and DFFITS for the identification of multiple influential observations in linear regression.Nurunnabi et al.(2010) developed the generalized DFFITS in logistic regression for the same purpose.

4.3 The diagnostic algorithm

In this subsection, attention is given to the introduction of a test procedure for the detection of multiple outliers through the group deletion diagnostic measures,

GSWRi, GLDi, GDFFITSiand GDFBETASi, respectively. The main idea of the

pro-posed method is to first form a basic subset of one fourth of the data which is possibly free from potential outliers and then employ the group deletion diagnostics, based on the basic subset, in identifying outlying observations. The detailed diagnostic algo-rithm is illustrated with GSWRi as follows.

Step 0 Fit the regression model to the full data and compute r_iwwfor i= 1, . . . , n.

Step 1 Arrange the n observations in ascending order according to the absolute values

of r_iww, |r_iww|, i = 1, . . . , n. Then the first {n/4} observations form the original [R] set, where{n/4} is the integer part of n/4 and represents the initial size of the [R] set. If the initial subset is not of full rank, increase the initial subset by as many observations as needed for the initial subset to become full rank (the observations are added according to their ranked order).

Step 2 Fit a regression model to the current [R] set and compute GSWRi.

Step 3 Arrange observations in accordance with an increasing order of|GSWRi| and

let GSWR_(s+1)be the (s+1)th order statistic of |GSWRi|, where s is the current size

of the [R] set. (a) If GSWR_(s+1)>2c(s) with c(s)= φ[R]/φ ni=1|GSWRi|/ n i=1|SWRi|,

then declare all members satisfying|GSWRi| > 2c(s) as outliers, and stop.

(b) Otherwise, the current [R] set is replaced by the first(s + 1) ordered observa-tions. If s+ 1 = n, then go to Step 4; otherwise return to Step 2.

Step 4 GSWRi is recalculated with all data points involved into the [R] set. Then

declare all observations satisfying|GSWRi| > 2 as outliers and stop.

Here we suggest to use the cut-off values of 2 and 2c(s) for |SWRi| and |GSWRi| ,

respectively. That is because, SWRi and GSWRi are standardized quantities. Hence

(10)

c(s) plays the role as an adjustment, which adjusts the effect from the differences

between the sample size, n, and the current size of the [R] set, s. It should be explained that, in practice, the criteria of|SWRi| and |GSWRi| can be flexibly defined by other

cut-off values, e.g., 2.5 and 2.5c(s), 3 and 3c(s), etc.

Note that if there are a few outliers involved in the initially basic subset of size

{n/4}, these outliers are still revealed as outlying because these observations should be

gradually excluded from the subsets with sizes{n/4}+1, {n/4}+2, . . . , respectively. The diagnostic algorithms for GLDi, GDFFITSi and GDFBETASi j are similar,

except the off values, and hence they are omitted. Their respective suggested cut-off values are as follows.

Under the assumption that the weighted matrix W1/2X is of full rank, the average

of hiis p/n, and then LDi in (5) becomes

LDi =

r_iww2 p

n− p.

Hence we suggest to use the cut-off values of 4 p/n and 4c2(s)p/s for LDiand GLDi,

respectively. Under the same assumption, DFFITSi reduces from (6) to

DFFITSi = riww φ(−i) φ p n− p.

Hence we suggest to use the cut-off values of 2√p/n and 2c(s)√p/s for |DFFITSi|

and|GDFFITSi|, respectively.

On the other hand, in the special case of location, i.e., in the special case with

W1/2X that is an n× 1 vector of ones, DFBETASi j reduces from (7) to

DFBETASi = y_i∗− μ∗_i √ vi φ(−i) φ √ n n− 1

where the quantity(y_i∗− μ∗_i)/√vi is a standardized quantity of yi∗, since the

expec-tation and variance of the random variable Y_i∗= log(Yi/(1 − Yi)) are E(Yi∗) = μ∗i

and V ar(Y_i∗) = vi, respectively. Thus, the cut-off values of DFBETASi j and

GDFBETASi j are suggested as 2/

√

n and 2c(s)/√s, respectively.

Here it should be pointed out that the proposed diagnostic method has a resemblance to the forward search (FS) algorithm that was introduced byHadi and Simonoff(1993) andAtkinson(1994) for identifying multiple outliers in linear models and in multi-variate data, respectively. The proposed method and the FS algorithm have two similar stages. In the first stage, it is attempted to form a basic subset that is presumably free from potential outliers. In the second stage, it uses an appropriate diagnostic measure such as the adjusted residual or Cook distance to examine the potential outliers to see how extreme they are related to the basic subset. The possible outliers are then declared as outliers if they are greatly inconsistent with the majority of the data. Such a FS diagnostic technique was also applied to GLMs. For details, seeAtkinson and Riani(2000, Chapter 6). A recent survey of theoretical development in work on the FS diagnostic was given inAtkinson et al.(2010), who tried to get suitable envelopes

(11)

for the outlier tests as the size of the current [R] set grows. The modified FS diagnostic procedure has not been applied to beta regression models.

5 Simulations

In this section, simulation studies are carried out to investigate the finite sample performance of the proposed group deletion diagnostic measures. We consider the two beta models

Model 1: logit(μi) = log (μi/(1 − μi)) = β1+ β2xi,2+ β3xi,3

and

Model 2: (μi) = β1+ β2xi,2+ β3xi,3

with the logit and probit link functions, respectively. To show the different perfor-mance characters between GSWRi, GLDi, GDFFITSi and GDFBETASi, numerical

studies are performed for sample sizes n= 50 and 150 with the first 0.2n observations generated as outlying observations that have unusual residuals, the next 0.1n observa-tions generated as outlying data points that are influential on β2, and subsequently the next 0.1n observations generated as outlying data points that have an influence on β3. To further contrast the differences between the performances based on the single-case and group deletion diagnostics, we consider two different outlier scenarios under each model. These scenarios are chosen because they are situations in which groups of outliers are influential for regression analyses but they are not easily identified by the single-case deletion diagnostics.

Let logit−1(·) and −1(·) be the inverses of logit(·) and (·), respectively. Now consider Model 1 with n = 50 and φ = 30. Let β1 = β2 = β3 = −0.01 and ηi = −0.01 − 0.01xi,2− 0.01xi,3. Then the usual observations y21, . . . , y50 are independently generated from the beta distribution, Beta(μiφ, (1 − μi)φ), with μi = logit−1(ηi) in which xi_,2 and xi_,3 are independently from the uniform

distrib-ution Uniform (0.85, 1.15). Under the outlier scenario (A), the outlying observations

y1, . . . , y10 that have extra small residuals are independently generated as the usual

data points, except their means set byμi = logit−1(ηi)−0.35. The outlying data points

y11, . . . , y15that are influential on β2are independently from the beta distribution with

μi = logit−1(−0.01+0.85xi,2−0.01xi,3) where xi,2and xi,3are independently from

Uniform (1.3, 1.4) and Uniform (0.85, 1.15). The outlying data points y16, . . . , y20that have an influence on β3are independently generated by the same processes, except their means set byμi = logit−1(−0.01−0.01xi,2+ 0.85xi,3) with xi,2and xi,3respectively

from Uniform (0.85, 1.15) and Uniform (1.3, 1.4). On the other hand, under the outlier scenario (B), the outliers y1, . . . , y20are generated as under the outlier scenario (A), except their respective corresponding means reset byμi = logit−1(ηi) + 0.41, μi =

logit−1( − 0.01 − 0.6xi,2− 0.01xi,3) and μi = logit−1( − 0.01 − 0.01xi,2− 0.7xi,3)

for observation i, i = 1, . . . , 10, i = 11, . . . , 15 and i = 16, . . . , 20, respectively. Data for n = 150 are similarly generated by the same steps.

Next consider Model 2 with n = 50 and φ = 30. The usual observations

(12)

where xi_,2 and xi_,3 are independently from Uniform (0.85, 1.15). Under the outlier

scenario (C), the outlying observations y1, . . . , y10 that have extremely large resid-uals are independently generated as the usual data points, except their means set by

μi = −1(ηi) + 0.38. The outlying data points y11, . . . , y15 and y16, . . . , y20 that

respectively have an effect on β2and β3are independently from the beta distribution

withμi = −1(−0.01−0.7xi,2−0.01xi,3) and μi = −1(−0.01−0.01xi,2−0.7xi,3),

respectively, where xi,2and xi,3 are generated as in Model 1. Under the outlier

sce-nario (D), the outliers y1, . . . , y20 are generated as under the outlier scenario (C), except their respective corresponding means reset by μi = −1(ηi) − 0.4, μi =

−1_{(−0.01 + 0.4x}

i,2− 0.01xi,3) and μi = −1(−0.01 − 0.01xi,2+ 0.5xi,3),

respec-tively. Data for n= 150 are also generated in the same way.

Fig. 1 Scatter plots of the averages of 2000μiagainst 2000(yi− μi) for each outlier scenario for sample

size n = 50. Triangle indexes the outlying observations that have unusual residuals; square indexes the outlying data points that are influential on β2; plus indexes the outlying data points that have an influence

(13)

Two thousand simulation runs are carried out for each case, with the xi_,2’s and

xi,3’s being regenerated after every 50 simulations. Figure1shows scatter plots of the

averages of 2000μiagainst 2000(yi−μi) for each outlier scenario for sample size n =

50. Because the results of sample sizes n= 50 and 150 are similar, we omit the results for n= 150 to reduce space. From Fig.1, it is evident that the mean(yi− μi) values

of the outlying observations that have unusual residuals are greater or less than that of the usual observations and the outlying data points that are influential on β2or β3. On the contrary, the mean(μi) values of the outlying data points that have an influence

on β2 or β3are bigger or lower than that of the usual data points and the outlying observations that have extreme residuals. The indication reveals that the former have extremely unusual residuals but they don’t have significant impacts onμi, whereas

the latter have influences onμibut they don’t have extraordinarily unusual residuals.

Tables1,2,3and4display the averages of the proportions of observation i, i =

1, . . . , 0.2n, observation i, i = 0.2n + 1, . . . , 0.3n, and observation i, i = 0.3n +

1, . . . , 0.4n, claimed by the single-case and group deletion diagnostics as outlying

observations in the two thousand simulated data sets. Also displayed are the averages of the proportions of observation i, i = 1, . . . , n, identified as outliers in the 2000 replications, in which no observations are generated as outlying observations in the simulated data sets. In addition, Tables1,2,3and4also exhibit the results based on the different diagnostic criteria, in order to compare the results from the suggested cut-off values.

From Tables 1, 2, 3 and 4, it is clear that, in the cases where no observations are generated as outliers in the simulated data sets and the diagnostic criteria are based on the suggested cut-off values, the averages of the proportions of observation

i, i = 1, . . . , n, detected by the single-case and group deletion diagnostics as the

unusual observations in the 2000 replications gradually approach 0.05 as the sample size n increases. The indications imply that, on the basis of the suggested cut-off values, the probability that the usual observations are misdiagnosed as unusual observations by the single-case and group deletion diagnostics is around 0.05. This, in turn, means that, in the cases where no observations are generated as outliers in the simulated data sets, the results from the single-case and group deletion diagnostics, based on the suggested cut-off values, are reasonable.

On the other hand, from Tables1,2,3 and4, it is clear that, in the cases with observations generated as outliers in the simulated data sets, the group deletion diag-nostics are more effective than the single-case deletion diagdiag-nostics, in picking up the outliers from the data sets. As in the case in Model 2 with n = 150 under the outlier scenario (C), when the cut-off values of DFBETASi 2 and GDFBETASi 2are

considered by 2/√n and 2c(s)/√s, respectively, the average of the proportions of

observation i, i = 31, . . . , 45, highlighted as outlying data points in the 2000 simu-lations by GDFBETASi 2is 0.9438, bigger than the corresponding result 0.3661 from

DFBETASi 2. Similar results are also obtained by comparing differences between the

results from GDFBETASi 3and DFBETASi 3or from GLDi and LDi, etc.

From Tables1,2,3and4, it is also noted that SWRiand GSWRiidentify outlying

observations that have extra unusual residuals as atypical observations, whereas the outlying data points that have influences on β2or β3are not flagged by SWRi and

(14)

Table 1 Model 1: logit(μi) = −0.01 − 0.01xi,2− 0.01xi,3, i = 1, . . . , 50

Diagnostic Cut-off value Data without outlying observations

Data with outlying observationsa

Observations Observations Observations Observations Observations

1–50 1–10 11–15 16–20 21–50 Scenario (A) SWRi 2 0.0518 0.3205 0.0116 0.0109 0.0008 GSWRi 2c(s) 0.0424 0.3434 0.0037 0.0037 0.0002 LDi 4 p/n 0.0658 0.1027 0.1601 0.1562 0.0033 GLDi 4c2(s)p/s 0.0488 0.0431 0.2786 0.2854 0.0005 DFFITSi 2√p/n 0.0714 0.1319 0.1616 0.1587 0.0004 GDFFITSi 2c(s)√p/s 0.0623 0.1033 0.4049 0.3972 0.0011 DFBETASi 2 2/√n 0.0810 0.0898 0.3769 0.0134 0.0093 GDFBETASi 22c(s)/√s 0.0787 0.1202 0.8435 0.0061 0.0019 DFBETASi 3 2/√n 0.0816 0.0855 0.0205 0.3829 0.0126 GDFBETASi 32c(s)/√s 0.0800 0.1100 0.0136 0.8257 0.0024 SWRi 2.5 0.0142 0.1235 0.0015 0.0008 0.0000 GSWRi 2.5c(s) 0.0131 0.1228 0.0012 0.0006 0.0000 LDi 6.25p/n 0.0259 0.0203 0.0728 0.0639 0.0002 GLDi 6.25c2(s)p/s 0.0225 0.0142 0.1096 0.1054 0.0001 DFFITSi 2.5√p/n 0.0307 0.0337 0.0767 0.0672 0.0006 GDFFITSi 2.5c(s)√p/s 0.0288 0.0309 0.2733 0.2791 0.0001 DFBETASi 2 2.5/√n 0.0425 0.0344 0.2570 0.0044 0.0020 GDFBETASi 22.5c(s)/√s 0.0423 0.0458 0.7400 0.0018 0.0003 DFBETASi 3 2.5/√n 0.0446 0.0319 0.0065 0.2606 0.0035 GDFBETASi 32.5c(s)/√s 0.0443 0.0447 0.0039 0.7312 0.0009 Scenario (B) SWRi 2 0.0518 0.3861 0.0020 0.0016 0.0002 GSWRi 2c(s) 0.0424 0.4485 0.0004 0.0004 0.0000 LDi 4 p/n 0.0658 0.1426 0.0742 0.0819 0.0014 GLDi 4c2(s)p/s 0.0488 0.1004 0.1055 0.1483 0.0005 DFFITSi 2√p/n 0.0714 0.1727 0.0755 0.0845 0.0017 GDFFITSi 2c(s)√p/s 0.0623 0.1876 0.1569 0.2074 0.0004 DFBETASi 2 2/√n 0.0810 0.1213 0.2585 0.0045 0.0056 GDFBETASi 22c(s)/√s 0.0787 0.1602 0.6769 0.0019 0.0013 DFBETASi 3 2/√n 0.0816 0.1131 0.0069 0.2886 0.0081 GDFBETASi 32c(s)/√s 0.0800 0.1470 0.0045 0.7397 0.0014 SWRi 2.5 0.0142 0.1609 0.0001 0.0000 0.0000 GSWRi 2.5c(s) 0.0131 0.1726 0.0001 0.0000 0.0000 LDi 6.25p/n 0.0259 0.0351 0.0254 0.0243 0.0002 GLDi 6.25c2(s)p/s 0.0225 0.0317 0.0322 0.0350 0.0002

(15)

Table 1 continued

1–50 1–10 11–15 16–20 21–50 DFFITSi 2.5√p/n 0.0307 0.0519 0.0266 0.0260 0.0003 GDFFITSi 2.5c(s)√p/s 0.0288 0.0531 0.0744 0.1054 0.0002 DFBETASi 2 2.5/√n 0.0425 0.0537 0.1464 0.0008 0.0009 GDFBETASi 2 2.5c(s)/√s 0.0423 0.0683 0.5066 0.0003 0.0002 DFBETASi 3 2.5/√n 0.0446 0.0461 0.0023 0.1681 0.0019 GDFBETASi 3 2.5c(s)/√s 0.0443 0.0625 0.0011 0.5812 0.0006

The diagnostic results based on single-case and group deletion diagnostics are larger than 0.1, which are respectively highlighted as the significance of bold and underline values

a_{observations 1–10 have unusual residuals; observations 11–15 and 16–20 respectively have influences on}

β2and β3; observations 21–50 are usual data points

residuals of yi − μi, as the results shown in Fig. 1, in which the mean(yi − μi)

values of the latter are closer to that of the usual observations, in comparison with the former. On the other hand, it is seen that due to the latter generated as having impacts on regression parameter estimates, they are more easily identified by DFBETASi and

GDFBETASi as unusual data points.

In addition, it is observed from Tables1,2,3and4that, in general, the averages of the proportions of the outlying data points that have influences on β2or β3detected as unusual data points are higher than that of the outlying observations that have unusual residuals. In other words, it is observed that the latter are identified harder than the former. This is because, the latter are not outlying enough. As compared the results from Model 2 under the outlier scenarios (C) and (D), it is shown that the latter become easier to be identified when their residuals become more excessive.

Obviously, the group deletion diagnostics are able to provide the more valid infer-ences in the regression diagnostic analyses, under the data sets with multiple outliers.

6 Examples

Two practical applications are presented in this section to illustrate the usefulness of the proposed group deletion diagnostic measures.

6.1 Example 1: Reading accuracy data

The first application uses the data analyzed by Espinheira et al. (2008a) from 44 children (19 children with dyslexia and 25 controls) recruited from primary schools in the Australian Capital Territory. The ages of the children ranged from eight years five months to twelve years three months. The response (y) gives the scores on a test of reading accuracy, and the explanatory variables represent dyslexia versus non-dyslexia

(16)

Table 2 Model 1: logit(μi) = −0.01 − 0.01xi,2− 0.01xi,3, i = 1, . . . , 150

1–150 1–30 31–45 46–60 61–150 Scenario (A) SWRi 2 0.0481 0.3096 0.0088 0.0087 0.0006 GSWRi 2c(s) 0.0382 0.3290 0.0021 0.0018 0.0001 LDi 4 p/n 0.0547 0.1040 0.1173 0.1198 0.0013 GLDi 4c2(s)p/s 0.0377 0.0512 0.1680 0.1618 0.0002 DFFITSi 2√p/n 0.0566 0.1120 0.1183 0.1206 0.0015 GDFFITSi 2c(s)√p/s 0.0470 0.0765 0.3752 0.3575 0.0003 DFBETASi 2 2/√n 0.0697 0.0858 0.3299 0.0096 0.0058 GDFBETASi 22c(s)/√s 0.0628 0.0978 0.8951 0.0028 0.0003 DFBETASi 3 2/√n 0.0688 0.0805 0.0103 0.3301 0.0062 GDFBETASi 32c(s)/√s 0.0622 0.0902 0.0021 0.8973 0.0003 SWRi 2.5 0.0136 0.1226 0.0009 0.0006 0.0000 GSWRi 2.5c(s) 0.0121 0.1235 0.0005 0.0004 0.0000 LDi 6.25p/n 0.0199 0.0228 0.0460 0.0442 0.0001 GLDi 6.25c2(s)p/s 0.0161 0.0169 0.0500 0.0471 0.0000 DFFITSi 2.5√p/n 0.0213 0.0269 0.0470 0.0455 0.0001 GDFFITSi 2.5c(s)√p/s 0.0191 0.0215 0.1666 0.1497 0.0001 DFBETASi 2 2.5/√n 0.0345 0.0307 0.2064 0.0023 0.0008 GDFBETASi 22.5c(s)/√s 0.0320 0.0342 0.7709 0.0007 0.0001 DFBETASi 3 2.5/√n 0.0343 0.0290 0.0025 0.2067 0.0008 GDFBETASi 32.5c(s)/√s 0.0319 0.0317 0.0005 0.7838 0.0001 Scenario (B) SWRi 2 0.0481 0.3715 0.0009 0.0012 0.0001 GSWRi 2c(s) 0.0382 0.4288 0.0000 0.0001 0.0000 LDi 4 p/n 0.0547 0.1429 0.0421 0.0503 0.0004 GLDi 4c2(s)p/s 0.0377 0.1127 0.0315 0.0039 0.0001 DFFITSi 2√p/n 0.0566 0.1519 0.0426 0.0507 0.0005 GDFFITSi 2c(s)√p/s 0.0470 0.1605 0.0591 0.1038 0.0001 DFBETASi 2 2/√n 0.0697 0.1176 0.1987 0.0031 0.0028 GDFBETASi 22c(s)/√s 0.0628 0.1383 0.6972 0.0008 0.0002 DFBETASi 3 2/√n 0.0688 0.1060 0.0024 0.2222 0.0033 GDFBETASi 32c(s)/√s 0.0622 0.1241 0.0006 0.7654 0.0004 SWRi 2.5 0.0136 0.1588 0.0000 0.0001 0.0000 GSWRi 2.5c(s) 0.0121 0.1724 0.0000 0.0000 0.0000 LDi 6.25p/n 0.0199 0.0369 0.0100 0.0128 0.0000 GLDi 6.25c2(s)p/s 0.0161 0.0338 0.0092 0.0114 0.0001

(17)

1–150 1–30 31–45 46–60 61–150 DFFITSi 2.5√p/n 0.0213 0.0424 0.0104 0.0131 0.0000 GDFFITSi 2.5c(s)√p/s 0.0191 0.0428 0.0133 0.0247 0.0000 DFBETASi 2 2.5/√n 0.0345 0.0487 0.0955 0.0005 0.0002 GDFBETASi 2 2.5c(s)/√s 0.0320 0.0541 0.4334 0.0004 0.0000 DFBETASi 3 2.5/√n 0.0343 0.0430 0.0003 0.1122 0.0003 GDFBETASi 3 2.5c(s)/√s 0.0319 0.0471 0.0001 0.5277 0.0001

status (x2), non-verbal IQ scores converted to z-scores (x3) and an interaction variable (x4). The variable x2is coded as 1 when the child is dyslexic and otherwise it is coded as−1. The non-dyslexic readers’ mean accuracy score is 0.900 whereas the mean for readers who have dyslexia is 0.606. The overall mean score is 0.773. Following the suggestions inEspinheira et al.(2008a), we analyze the data set using the GBLM with the logit link function as

logit(μi) = β1+ β2xi,2+ β3xi,3+ β4xi,4, i = 1, . . . , 44.

Table5 displays diagnostic results based on the single-case and group deletion diagnostics, respectively. Looking at Table5, some interesting findings are as follows. First, from GSWRi and SWRi, it is observed that GSWRi suggests observations 8, 9,

15 and 22 as the outlying observations that have unusual residuals, whereas, among them, only observation 8 is highlighted by SWRi as an unusual observation on the

residual. Then comparing the results from LDiand GLDi, it is noted that LDi detects

observations 6 and 8 as the outlying data points that have an impact on the regression parameter estimates, β1, β2, β3 and β4, while GLDi identifies observations 8 and

15 as the atypical observations that are influential on β1, β2, β3 and β4. The joint deletion of observations 6 and 8 shows that the relative changes in β1, β2, β3and β4 are−1.773, −2.761, 32.77 and 24.36%, whereas the relative changes in β1, β2, β3and

β4are−11.32, −16.03, 117.5 and 86.82% with observations 8 and 15 deleted. The indication implies that observations 8 and 15 jointly have more powerful influences on the regression parameter estimates, in contrast with observations 6 and 8.

Similarly, comparing the results from DFFITSi and GDFFITSi, it is also observed

that GDFFITSi detects observations 5, 8, 9, 15 and 22 as the atypical points on the

weighted fits. Among these atypical points, observations 5, 9, 15 and 22 are obscured by DFFITSi. Jointly removing observations 5, 8, 9, 15 and 22 from the data results in

the apparently relative variations,−28.10, −39.86, 269.1 and 199.0% in β1, β2, β3 and β4.

(18)

Table 3 Model 2:(μi) = −0.01 − 0.01xi,2− 0.01xi,3, i = 1, . . . , 50

1–50 1–10 11–15 16–20 21–50 Scenario (C) SWRi 2 0.0517 0.3082 0.0193 0.0182 0.0003 GSWRi 2c(s) 0.0423 0.3242 0.0087 0.0082 0.0001 LDi 4 p/n 0.0657 0.0828 0.1917 0.1907 0.0030 GLDi 4c2(s)p/s 0.0489 0.0231 0.3460 0.3614 0.0003 DFFITSi 2√p/n 0.0714 0.1488 0.1099 0.1040 0.0037 GDFFITSi 2c(s)√p/s 0.0624 0.1440 0.2729 0.2739 0.0010 DFBETASi 2 2/√n 0.0425 0.0761 0.4132 0.0189 0.0091 GDFBETASi 22c(s)/√s 0.0424 0.1017 0.8840 0.0079 0.0015 DFBETASi 3 2/√n 0.0815 0.0723 0.0295 0.4245 0.0122 GDFBETASi 32c(s)/√s 0.0799 0.0981 0.0173 0.8841 0.0016 SWRi 2.5 0.0142 0.1084 0.0028 0.0017 0.0000 GSWRi 2.5c(s) 0.0132 0.1051 0.0025 0.0014 0.0000 LDi 6.25p/n 0.0258 0.0136 0.0908 0.0856 0.0005 GLDi 6.25c2(s)p/s 0.0225 0.0080 0.1467 0.1434 0.0002 DFFITSi 2.5√p/n 0.0307 0.0403 0.0451 0.0384 0.0006 GDFFITSi 2.5c(s)√p/s 0.0287 0.0393 0.1636 0.1530 0.0003 DFBETASi 2 2.5/√n 0.0811 0.0260 0.2940 0.0064 0.0019 GDFBETASi 22.5c(s)/√s 0.0787 0.0358 0.8121 0.0013 0.0002 DFBETASi 3 2.5/√n 0.0446 0.0248 0.0109 0.3007 0.0035 GDFBETASi 32.5c(s)/√s 0.0442 0.0369 0.0049 0.8163 0.0005 Scenario (D) SWRi 2 0.0517 0.3891 0.0024 0.0025 0.0001 GSWRi 2c(s) 0.0423 0.4513 0.0002 0.0002 0.0000 LDi 4 p/n 0.0657 0.1448 0.0843 0.1009 0.0018 GLDi 4c2(s)p/s 0.0489 0.0923 0.1229 0.2048 0.0004 DFFITSi 2√p/n 0.0714 0.1933 0.0553 0.0601 0.0020 GDFFITSi 2c(s)√p/s 0.0624 0.2295 0.0746 0.1109 0.0005 DFBETASi 2 2/√n 0.0425 0.1192 0.2803 0.0071 0.0056 GDFBETASi 22c(s)/√s 0.0424 0.1537 0.7037 0.0036 0.0014 DFBETASi 3 2/√n 0.0815 0.1069 0.0083 0.3224 0.0087 GDFBETASi 32c(s)/√s 0.0799 0.1431 0.0045 0.7903 0.0013 SWRi 2.5 0.0142 0.1704 0.0001 0.0001 0.0000 GSWRi 2.5c(s) 0.0132 0.1820 0.0001 0.0000 0.0000 LDi 6.25p/n 0.0258 0.0368 0.0290 0.0314 0.0001 GLDi 6.25c2(s)p/s 0.0225 0.0317 0.0386 0.0535 0.0001

(19)

1–50 1–10 11–15 16–20 21–50 DFFITSi 2.5√p/n 0.0307 0.0619 0.0168 0.0155 0.0001 GDFFITSi 2.5c(s)√p/s 0.0287 0.0661 0.0323 0.0457 0.0001 DFBETASi 2 2.5/√n 0.0811 0.0533 0.1638 0.0015 0.0008 GDFBETASi 2 2.5c(s)/√s 0.0787 0.0663 0.5440 0.0005 0.0001 DFBETASi 3 2.5/√n 0.0446 0.0425 0.0027 0.1996 0.0023 GDFBETASi 3 2.5c(s)/√s 0.0442 0.0578 0.0010 0.6485 0.0005

On the other hand, DFBETASi 3and DFBETASi 4detect observations 6, 8 and 15

as having a large impact on β3and β4, while GDFBETASi 3and GDFBETASi 4spot

observations 5, 8, 9, 15 and 22 as influential on the results for β3and β4. The exclusion of observations 5, 8, 9, 15 and 22 causes an apparent change 269.1 and 199.0 % in

β3 and β4, whereas the relative change in β3 and β4 are 86.31 and 63.88 % with observations 6, 8 and 15 omitted together. This indicates that observations 5, 8, 9, 15 and 22 have a relatively larger effect on β3and β4, compared with observations 6, 8 and 15. In addition, it should be explained that, in this example, we don’t consider the results of GDFBETASi 2and DFBETASi 2. This is because, x2is a binary variable

whose value equals−1 or 1. Thus we are not very interested that which observation is influential on β2.

Evidently, in reading accuracy data, the group deletion diagnostics detect some mul-tiple outlying observations that are influential on the results of the regression analysis, while these unusual observations are ignored by the single-case deletion diagnostics.

6.2 Example 2: Stress, depression, and anxiety

The second example uses data collected from a sample of 166 nonclinical women in Townsville, Queensland, Australia. The data were analyzed bySmithson and Verkuilen

(2006) using beta regression. The response variable (y) represents the scores on a test of anxiety symptoms, and the explanatory variable (x2) gives the corresponding scores on the test of stress symptoms. We followSmithson and Verkuilen(2006) in considering the model

logit(μi) = β1+ β2xi,2, i = 1, . . . , 166.

Table6 presents diagnostic results based on the single-case and group deletion diagnostics, respectively. Careful inspection of Table 6 indicates some interesting findings as follows.

(20)

Table 4 Model 2:(μi) = −0.01 − 0.01xi,2− 0.01xi,3, i = 1, . . . , 150

1–150 1–30 31–45 46–60 61–150 Scenario (C) SWRi 2 0.0481 0.2960 0.0147 0.0146 0.0002 GSWRi 2c(s) 0.0381 0.3111 0.0038 0.0049 0.0000 LDi 4 p/n 0.0547 0.0843 0.1448 0.1459 0.0011 GLDi 4c2(s)p/s 0.0377 0.0277 0.2434 0.2172 0.0000 DFFITSi 2√p/n 0.0566 0.1267 0.0716 0.0723 0.0012 GDFFITSi 2c(s)√p/s 0.0470 0.1153 0.1806 0.1648 0.0004 DFBETASi 2 2/√n 0.0700 0.0731 0.3661 0.0160 0.0050 GDFBETASi 22c(s)/√s 0.0629 0.0807 0.9438 0.0023 0.0001 DFBETASi 3 2/√n 0.0689 0.0677 0.0159 0.3636 0.0054 GDFBETASi 32c(s)/√s 0.0623 0.0744 0.0027 0.9466 0.0001 SWRi 2.5 0.0135 0.1088 0.0019 0.0022 0.0000 GSWRi 2.5c(s) 0.0120 0.1085 0.0011 0.0010 0.0000 LDi 6.25p/n 0.0200 0.0154 0.0606 0.0608 0.0000 GLDi 6.25c2(s)p/s 0.0161 0.0098 0.0717 0.0675 0.0000 DFFITSi 2.5√p/n 0.0213 0.0310 0.0222 0.0212 0.0001 GDFFITSi 2.5c(s)√p/s 0.0191 0.0297 0.0458 0.0473 0.0000 DFBETASi 2 2.5/√n 0.0345 0.0230 0.2409 0.0041 0.0005 GDFBETASi 22.5c(s)/√s 0.0321 0.0252 0.8718 0.0008 0.0001 DFBETASi 3 2.5/√n 0.0343 0.0220 0.0043 0.2384 0.0005 GDFBETASi 32.5c(s)/√s 0.0319 0.0231 0.0005 0.8763 0.0000 Scenario (D) SWRi 2 0.0481 0.3723 0.0010 0.0022 0.0001 GSWRi 2c(s) 0.0381 0.4271 0.0000 0.0002 0.0000 LDi 4 p/n 0.0547 0.1450 0.0487 0.0657 0.0006 GLDi 4c2(s)p/s 0.0377 0.1092 0.0355 0.0650 0.0001 DFFITSi 2√p/n 0.0566 0.1698 0.0259 0.0316 0.0007 GDFFITSi 2c(s)√p/s 0.0470 0.1940 0.0208 0.0353 0.0002 DFBETASi 2 2/√n 0.0700 0.1156 0.2196 0.0044 0.0029 GDFBETASi 22c(s)/√s 0.0629 0.1370 0.7267 0.0001 0.0002 DFBETASi 3 2/√n 0.0689 0.1013 0.0030 0.2653 0.0036 GDFBETASi 32c(s)/√s 0.0623 0.1196 0.0006 0.8432 0.0002 SWRi 2.5 0.0135 0.1678 0.0000 0.0001 0.0000 GSWRi 2.5c(s) 0.0120 0.1826 0.0000 0.0001 0.0000 LDi 6.25p/n 0.0200 0.0398 0.0118 0.0167 0.0000 GLDi 6.25c2(s)p/s 0.0161 0.0355 0.0110 0.0154 0.0000

(21)

1–150 1–30 31–45 46–60 61–150 DFFITSi 2.5√p/n 0.0213 0.0514 0.0050 0.0006 0.0000 GDFFITSi 2.5c(s)√p/s 0.0191 0.0539 0.0044 0.0078 0.0000 DFBETASi 2 2.5/√n 0.0345 0.0479 0.1091 0.0008 0.0003 GDFBETASi 2 2.5c(s)/√s 0.0321 0.0530 0.4839 0.0004 0.0000 DFBETASi 3 2.5/√n 0.0343 0.0408 0.0004 0.1401 0.0003 GDFBETASi 3 2.5c(s)/√s 0.0319 0.0447 0.0002 0.6547 0.0000

a_{Observations 1–30 have unusual residuals; observations 31–45 and 46–60 respectively have influences}

on β2and β3; observations 61–150 are usual data points

Table 5 The single-case and group deletion diagnostics for example 1

Diagnostic Cut-off value Case i claimed as an outlier Cut-off value Case i claimed as an outlier Cut-off value Case i claimed as an outlier SWRi 2 8 2.5 No observationsa 3 No observationsa GSWRi 2c(s) 8, 9, 15, 22 2.5c(s) 8 3c(s) No observationsa LDi 4 p/n 6, 8 6.25p/n 8 9 p/n 8 GLDi 4c2(s)p/s 8, 15 6.25c2(s)p/s 8 9c2(s)p/s 8 DFFITSi 2√p/n 6, 8 2.5√p/n 8 3√p/n 8 GDFFITSi 2c(s)√p/s 5, 8, 9, 15, 22 2.5c(s)√p/s 8, 15, 22 3c(s)√p/s 8 DFBETASi 3 2/√n 6, 8, 15 2.5/√n 6, 8 3/√n 8 GDFBETASi 3 2c(s)/√s 5, 8, 9, 15, 22 2.5c(s)/√s 5, 8, 9, 15, 22 3c(s)/√s 8, 9, 15, 22 DFBETASi 4 2/√n 6, 8, 15 2.5/√n 6, 8 3/√n 8 GDFBETASi 4 2c(s)/√s 5, 8, 9, 15, 22 2.5c(s)/√s 5, 8, 9, 15, 22 3c(s)/√s 8, 9, 15, 22 a_{No observations are claimed as outliers}

From the results of SWRi and GSWRi, it is shown that diagnostic results based

on SWRi and GSWRi are similar. Observations 10, 89, 116 and 136 are flagged by

SWRi and GSWRias the outlying observations with extra large or small residuals.

Similarly, from the results of LDiand GLDi, it is shown that GLDiand LDisuggest

observations 10, 55, 77, 89, 116, 125, 132, 151, 152 and 164 as the outlying data points that are influential on the regression parameter estimates β1and β2. It is also interesting to note that, on the basis of cut-off values 9c2(s)p/s and 9p/n, GLDisuggests cases

89, 116, 151 and 152 as having an influence on β1and β2, while LDi identifies cases

55, 89, 116 and 152 as influential on the results for β1and β2. The joint deletion of observations 89, 116, 151 and 152 causes the relative changes in β1and β2as 5.21 and 13.80 %, whereas the relative changes in β1and β2 are 3.47 and 7.74 % with

(22)

Table 6 The single-case and group deletion diagnostics for example 2 Diagnostic Cut-off value Case i claimed as an outlier Cut-off value Case i claimed as an outlier Cut-off value Case i claimed as an outlier SWRi 2 10, 89, 116, 136 2.5 10, 89, 116 3 89 GSWRi 2c(s) 10, 89, 116, 136 2.5c(s) 10, 89, 116 3c(s) 89 LDi 4 p/n 10, 55, 77, 89, 116, 125, 132, 151, 152, 164 6.25p/n 55, 77, 89, 116, 125, 132, 151, 152, 164 9 p/n 55, 89, 116, 152 GLDi 4c2(s)p/s 10, 55, 77, 89, 116, 125, 132, 151, 152, 164 6.25c2_(s)p/s _{55, 77, 89,} 116, 125, 132, 152, 164 9c2_(s)p/s _{89, 116, 151,} 152 DFFITSi 2√p/n 10, 55, 77, 89, 116, 125, 132, 151, 152, 164 2_.5√p_/n 55, 77, 89, 116, 125, 132, 151, 152, 164 3√p_/n 55, 89, 116, 152 GDFFITSi 2c(s)√p/s 10, 51, 55, 77, 89, 116, 125, 132, 133, 151, 152, 164 2.5c(s)√p/s 10, 55, 77, 89, 116, 125, 132, 151, 152, 164 3c(s)√p/s 55, 77, 89, 116, 125, 132, 152, 164 DFBETASi 2 2/√n 55, 77, 89, 116, 125, 132, 151, 152, 164 2.5/√n 55, 77, 89, 116, 125, 132, 151, 152, 164 3/√n 55, 77, 89, 116, 125, 132, 151, 152, 164 GDFBETASi 2 2c(s)/√s 10, 51, 55, 77, 89, 116, 117, 125, 132, 133, 151, 152, 164 2.5c(s)/√s 51, 55, 77, 89, 116, 125, 132, 151, 152, 164 3c(s)/√s 55, 77, 89, 116, 125, 132, 151, 152, 164

observations 55, 89, 116 and 152 eliminated. The indication shows that GLDiis more

effective than LDi, in pointing out the outlying data points that have jointly bigger

impacts on β1and β2.

From the results of DFFITSi and GDFFITSi, it is observed that DFFITSi misses

observations 51 and 133 that are discovered by GDFFITSi as the atypical points on

the weighted fits. The joint deletion of observations 10, 55, 77, 89, 116, 125, 132, 151, 152 and 164 leads to the relative variations in β1and β2as 2.60 and 2.00 %, whereas the joint deletion of observations 10, 51, 55, 77, 89, 116, 125, 132, 133, 151, 152 and 164 leads to the relative variations in β1and β2as 2.76 and−1.93 %. It is evident that observations 51 and 133 that are omitted by DFFITSi are influential, because deleting

the atypical points with observations 51 and 133 results in a relative change of β2from 2.00 (positive) to−1.93% (negative).

From the results of DFBETASi 2and GDFBETASi 2, it is shown that GDFBETASi 2

detects observations 10, 51, 55, 77, 89, 116, 117, 125, 132, 133, 151, 152 and 164 as having a large impact on β2, while DFBETASi 2suggests observations 55, 77, 89, 116,

125, 132, 151, 152 and 164 as influential on the results for β2. When observations 10, 51, 55, 77, 89, 116, 117, 125, 132, 133, 151, 152 and 164 are simultaneously discarded, β2 has a positive jump of 0.83 %, whereas the relative change in β2 is a 1.51 % reduction, when observations 55, 77, 89, 116, 125, 132, 151, 152 and 164 are

(23)

jointly deleted. This indicates that the outlying data points, observations 10, 51, 117 and 133, are influential on β2, but they are not found by DFBETASi 2.

Clearly, in this illustration, the group deletion diagnostics suggest some multiple outlying observations that have joint effects on the regression outcome for further considerations, while these potential outlying data points are camouflaged by the single-case deletion diagnostics.

7 Conclusions

In this article, we provide a diagnostic way for the identification of multi-ple outliers in GBLMs. We suggest the group deletion diagnostic measures,

GSWRi, GLDi, GDFFITSi and GDFBETASi, respectively, and propose a test

pro-cedure for detecting multiple outlying observations using these. Simulation studies and analysis of two practical examples show that our proposed methods can assist the analyst in detecting multiple outlying observations in GBLMs.

Finally, we note that the GBLMs with the constant precision parameter were recently improved bySimas et al.(2010) who allowed a regression structure for the precision parameter. In future work we will extend the group deletion diagnostic tech-niques to beta regression models with regressors for the precision parameter.

Acknowledgments The author is deeply indebted to the associate editor and two referees for their helpful comments and suggestions that substantially improve this present version of the paper.

References

Atkinson AC (1994) Fast very robust methods for the detection of multiple outliers. J Am Stat Assoc 89:1329–1339

Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer, New York

Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis (with discussion). J Korean Stat Soc 39:117–134

Belsley DA, Kuh E, Welsch RE (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York

Cribari-Neto F, Zeileis A (2010) Beta regression in R. J Stat Softw 34:1–24

Espinheira PL, Ferrari SLP, Cribari-Neto F (2008a) On beta regression residuals. J Appl Stat 35:407–419 Espinheira PL, Ferrari SLP, Cribari-Neto F (2008b) Influence diagnostics in beta regression. Comput Stat

Data Anal 52:4417–4431

Ferrari SLP, Cribari-Neto F (2004) Beta regression for modeling rates and proportions. J Appl Stat 31: 799–815

Hadi AS, Simonoff JS (1993) Procedures for the identification of multiple outliers in linear models. J Am Stat Assoc 88:1264–1272

Imon AHMR (2005) Identifying multiple influential observations in linear regression. J Appl Stat 32: 929–946

Imon AHMR, Hadi AS (2008) Identification of multiple outliers in logistic regression. Commun Stat Theory Methods 37:1697–1709

Kieschnick R, McCullough BD (2003) Regression analysis of variates observed on (0, 1): percentage, proportions, fractions. Stat Model 3:193–213

Nurunnabi AAM, Imon AHMR, Nasser M (2010) Identification of multiple influential observations in logistic regression. J Appl Stat 37:1605–1624

Simas AB, Barreto-Souza W, Rocha AV (2010) Improved estimators for a general class of beta regression models. Comput Stat Data Anal 54:348–366

Smithson M, Verkuilen J (2006) A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychol Methods 11:54–71