Estimation and prediction in linear mixed models with skew-normal random effects for longitudinal data

(1)

Published online 20 August 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/sim.3026

Estimation and prediction in linear mixed models with

skew-normal random effects for longitudinal data

Tsung I. Lin

1,∗,†

and Jack C. Lee

2

1_{Department of Applied Mathematics}_{, National Chung Hsing University, Taichung 402, Taiwan} 2_{Graduate Institute of Finance}_{, National Chiao Tung University, Hsinchu 30050, Taiwan}

SUMMARY

This paper extends the classical linear mixed model by considering a multivariate skew-normal assumption for the distribution of random effects. We present an efficient hybrid ECME-NR algorithm for the computation of maximum-likelihood estimates of parameters. A score test statistic for testing the existence of skewness preference among random effects is developed. The technique for the prediction of future responses under this model is also investigated. The methodology is illustrated through an application to Framingham cholesterol data and a simulation study. Copyrightq 2007 John Wiley & Sons, Ltd. KEY WORDS: ECME algorithm; maximum-likelihood estimation; prediction; random effects; SNLMM

1. INTRODUCTION

The normal linear mixed model (NLMM), originally introduced by Laird and Ware [1], has become the most frequently used analytic tool for longitudinal data analyses with continuous repeated measures. The specification of both random effects and the error term is automatically taken to be normal for mathematical convenience. Unfortunately, such normality assumptions are too restrictive and suffer from the lack of robustness against departure from the normality assump-tion. Meanwhile, it may yield invalid statistical inferences when the underlying normalities are violated.

To reduce unrealistic normality assumptions in NLMM, various authors concentrated on more flexible approaches using a broader family of distributions. Pinheiro et al.[2] proposed a (multi-variate) t linear mixed model and showed that it would perform well in the occurrence of outliers.

∗_{Correspondence to: Tsung I. Lin, Department of Applied Mathematics, National Chung Hsing University, Taichung}

402, Taiwan.

†_{E-mail: [email protected]}

Contract/grant sponsor: National Science Council of Taiwan; contract/grant number: NSC95-2118-M-005-001-MY2

(2)

Verberk and Lesaffre [3] introduced a heterogeneous linear mixed model with random effects distributed according to finite normal mixtures. They provided a simple goodness-of-fit test for the investigation of heterogeneity in random effects.

An alternative way of enhancing the justification of the normality assumption in the presence of strong skewness is via the Box–Cox transformation; see e.g.[4, 5]. Although such transformation-based methodology may yield reasonable empirical results, the achievement of joint normality is rarely satisfied and the transformed variables are difficult to interpret. From a practical viewpoint, one needs to seek an appropriate theoretical model from a robustification of normal theory in regulating skewness.

The multivariate skew-normal (SN) distribution was introduced by Azzalini and Dalla Valle[6], and some further attractive features as well as applications are given in Azzalini and Capitaino [7]. For this class of distributions, a number of extensions and alternative formulations have been proposed during the last decade. Sahu et al.[8] defined a new class of multivariate SN distributions and stated that it is very convenient to implement within a Bayesian hierarchical framework. Arellano-Valle and Genton[9] studied the family of fundamental skew normal (FUSN) distributions obtained by a convolution scheme that generalizes[8] and leads to a fairly general procedure to obtain the multivariate SN densities starting from symmetric ones. For more discussions of FUSN and other proposals of the original SN distribution such as the closed SN of[10] and the hierarchial SN of[11] along with their connections and related properties, see [12, 13].

Recently, Arellano-Valle et al. [14] proposed a skew-normal linear mixed model (SNLMM) based on the multivariate SN distribution introduced by Azzalini and Dalla Valle[6] and Azzalini and Capitaino[7]. They assumed that both random effects and within-subject errors are distributed as SN for the sake of completeness. To the best of the authors’ experiences, however, in this setting the SN distributional assumption for within-subject errors is hard to justify even when the number of observations is sufficiently large. Moreover, it is irrational to assume that the skewness parameters for within-subject errors are identical for the parsimony of model building. As can be seen in the illustrative example of [14], such representation departs from the true realities for the fitted residuals and in turn leads to less significance for the estimated skewness parameter.

In this paper, we aim at developing additional tools for a simplified version of SNLMM, more directly linked to the work of [14], in which merely the random effects are assumed to follow multivariate SN distributions. A key feature of this model is that it can be formulated as a flexible normal-truncated normal hierarchy, which is useful for random number generation and for theoretical derivations. In view of the computational perspective, we present a hybrid of EM-type and gradient-based methods alternative to [14], which is used for accelerating the calculation of maximum-likelihood (ML) estimates with the standard errors as a by-product at convergence.

In Section 2, we describe the model, define the notations and derive a few distributional prop-erties under the complete data framework. In Section 3, we discuss how to compute ML esti-mates of model parameters in an efficient manner. In Section 4, a score test statistic is obtained for testing the skewness preference among random effects. Such a test can serve as a prelim-inary check before fitting a complex SNLMM. Inferences for the estimation of random effects and the prediction of future values are discussed in Section 5. We illustrate these results us-ing the famous Framus-ingham cholesterol data in Section 6. One simulation study is presented in Section 7. Some concluding remarks are given in Section 8 and technical derivations are given in the Appendices.

(3)

2. MODEL AND NOTATION

A p-dimensional random vector Y is said to have a multivariate SN distribution with a p× 1 location vectorl ∈ Rp, a p× p positive-definite dispersion matrix R and a p × 1 skewness vector

k ∈ Rp, say Y∼ SNp(l, R, k), if its probability density function (pdf) is

f(Y) = 2_p(Y | l, R)(kTR−1/2(Y − l)) (1) where_p(·|l, R) is the pdf of Np(l, R) and (·) is the cumulative distribution function of N(0, 1).

Note that ifk = 0, the pdf of Y in (1) corresponds to Np(l, R) density. Some essential distributional

properties with regard to (1) are referred to[6, 7, 15].

We consider a generalization of NLMM in which the random effects are assumed to follow multivariate SN distributions within the family defined in (1). The model can be written as

Yi= Xib + Zibi + ei, bi∼ SNq(0, 2C, k)

ei∼ Nni(0,

2_C

i), bi ⊥ ei (i = 1, . . . , N)

(2)

where the subscript i is the subject index, Yi is an ni-dimensional random response vector, Xi and Zi are known full-rank matrices of dimensions ni× p and ni× q, respectively, b is a p × 1 vector

of unknown fixed effects describing the population mean, bi is a q× 1 vector of unobservable

random effects and ei is an ni× 1 vector of residual errors assumed to be independent of bi.

Moreover, C is a q × q unstructured symmetric positive-definite matrix, Ci is an ni× ni scaled

matrix, which is a function of a small set of autocorrelation parameters q = (1, . . . , g) and

depends on i only through its dimension ni, and k = (1, . . . , q) is a q × 1 vector of skewness

parameters. In what follows, we reparameterize C = F2 for ease of computation and theoretical derivation, where F is the square root ofC, i.e. C1/2, containing q(q + 1)/2 distinct elements.

According to Proposition 1 of[14], model (2) can be hierarchically represented as

Yi | i ∼ Nni(Xib + dii, 2 Wi) i ∼ TN(0, 2; (0, ∞)) (3) where Wi= ZiF(Iq− ddT)FTZTi + Ci= Ki− didTi Ki= ZiCZTi + Ci, di= ZiFd, d = d(k) = k 1+q_j₌₁2_j

and TN(, 2; (a, b)) denotes the truncated normal distribution for N(, 2) lying within a trun-cated interval(a, b).

It follows from (3) that the density of Yi is

f(Yi) = 2ni(Yi|Xib, 2 Ki) ⎛ ⎝dTiW−1i (Yi− Xib) 1+ dT_iWi−1di ⎞ ⎠ (4)

(4)

From (4), it is straightforward to observe that Yi are independent and marginally distributed as Yi∼ SNni(Xib, 2Ki, ai), where ai= (1 + dT_iW−1_i di)−1/2K 1/2 i Wi−1di. Consequently, simple algebra yields i | Yi∼ TN(i, 2 i; (0, ∞)) where i= (1 + d T iW−1i di)−1dTiWi−1(Yi − Xib), 2i= 2_{(1 + d}T iW−1i di)−1 (5)

Note that, in particular,

E(i | Yi) = _i+ ii and E( 2 i | Yi) = 2_i + 2_i+ i_ii (6) where i=( i) ( i) and _i=i i =dTiWi−1(Yi − Xib) 1+ dT_iW−1i di (7)

Denoting by Y=(Y1, . . . , YN) the observed data, the log-likelihood function of h=(b, 2, F, q, k) is (h|Y) = −n 2log( 2_{) −}1 2 N i=1 log|Ki| − 1 22 N i=1 eT_iK−1i ei+ N i=1 log( _i) (8) where n=_iN₌₁ni denotes the total number of observations, Ki= Ki(F, q) and ei= Yi − Xib.

Explicit expressions for the score vector s_h and the observed information matrix I_hh are derived in Appendix A.

3. MAXIMUM-LIKELIHOOD ESTIMATION

Since the ML estimators of parameters in model (2) cannot be written in closed forms, the popular Newton–Raphson (NR) algorithm can be applied to calculate or approximate ML estimates iteratively. Starting from an initial point ˆh(0), the NR procedure proceeds according to

ˆh(k+1)= ˆh(k)+ ˆI−1(k)_hh ˆs(k)_h (9) where ˆs(k)_h and ˆI(k)_hh are the score vector and the observed information matrix evaluated at ˆh(k), respectively. Other related iterative algorithms such as steepest ascent, Davidon–Fletcher–Powell, conjugate gradient, quasi-Newton and Marquardt–Levenberg, see e.g. [16], which are gradient methods such as the NR algorithm, can lead to equivalent calculations.

An oft-voiced complaint of the NR algorithm is that it may not converge unless good starting values are used. The EM algorithm[17], which takes advantage of being insensitive to the starting values as a powerful computational tool that requires the construction of unobserved data, has been well developed and has become a broadly applicable approach to the iterative computation of ML estimates. One of the major reasons for the popularity of the EM algorithm is that the

(5)

M-step involves only complete data ML estimation, which is often computationally simple. More-over, the EM algorithm is stable and straightforward to implement since the iterations converge monotonically and no second derivatives are required.

When the M-step of EM turns out to be analytically intractable, it can be replaced with a sequence of conditional maximization (CM) steps. Such modification is referred to as the ECM algorithm [18]. The ECME algorithm [19], a faster extension of EM and ECM, is obtained by maximizing the constrained Q-function (the expected complete data function) with some CM steps that maximizes the corresponding constrained actual likelihood function, called the CML steps. However, computation to the solution of the CML step may involve a high-dimensional search problem for which the likelihood is not guaranteed to increase. In practice, a modified NR maximization procedure such as the step-halving method can be used to guarantee that each step increases the likelihood.

Lets = (1, . . . , N) be the latent vector, and let ˆh

(k)

= (ˆb(k), ˆ2(k)_{, ˆx}(k)_{), where ˆx}(k)_{= (vech} ( ˆF(k)), ˆq(k), ˆk(k)), denote the estimates of h at the kth iteration. Given the hierarchical representation

(3), the complete data log likelihood is

c(h|Y, s) = − n+ N 2 log( 2_{) −}1 2 N i=1 log|Wi| − 1 22 N i=1 {(Yi − Xib − dii)TW_i−1(Yi − Xib − dii) + 2i} (10)

Given the current estimateh(k), the E-step calculates Q(h|h(k)) = E(c(h|Y, s)|Y, h(k)). From

(10), calculation of Q(h|h(k)) requires the expressions of ˆ(k)_i = E(i|Y, h(k)) and ˆu(k)_i = E(2_i|Y, h(k)), which are readily evaluated according to (6). That is,

ˆ(k)_i = ˆ(k)_i + ˆ_i(k)ˆ(k)_i and ˆu(k)_i = ˆ2(k) i + ˆ 2(k) i + ˆ (k) i ˆ(k)i ˆ (k) i (11)

where ˆ(k)_i and ˆ(k)_i in (11) are_i andi in (5) evaluated ath = ˆh

(k)

, and ˆ(k)_i = (ˆ (k)_i )/(ˆ (k)_i ) with ˆ (k)_i = ˆ(k)

i /ˆ

(k)

i . The CM steps then conditionally maximize Q(h|ˆh

(k)

) with respect to h,

obtaining a new estimate ˆh(k+1). To update ˆb(k) and ˆ2(k), the closed-form maximizers for ˆb(k+1)

and ˆ2(k+1)are given by ˆb(k+1)i = _N i=1 XT_i ˆW−1(k)_i Xi −1_N i=1 XT_i ˆW−1(k)_i (Yi− ˆd(k)i ˆ(k)i ) ˆ2(k+1)₌ 1 n+ N _N i=1 ê_i(k+1)TW−1(k)_i ê(k+1)_i − 2 N i=1 î(k)ˆd (k)T i ˆW −1(k) i ê(k+1)i + N i=1 û(k)_i (1 + ˆd(k)_i TˆW−1(k)i ˆd(k)i ) whereê(k+1)_i = Yi− Xiˆb (k+1) i , ˆd(k)i = ZiˆF(k)ˆd (k) and ˆW(k)i = ZiˆF2(k)ZTi + Ci(ˆq(k)) − ˆd(k)i ˆd(k) T i .

(6)

Note that the maximization of Q(h|ˆh(k)) with respect to x = (F, q, k) does not have analytical solutions. We then perform a CML step by maximizing the constrained actual likelihood function with respect tox using the following one-cycle NR procedure:

ˆx(k+1)= ˆx(k)+ r(k)ˆI−1(k)_xx ˆs(k)_x

where ˆs(k)_x and ˆI(k)_xx are evaluated by using the current estimates, and the scaling number r(k), 0<r(k)1, is chosen at each iteration such that the actual likelihood function is non-decreasing in k.

The iterations are repeated until a suitable convergence rule is satisfied, e.g.ˆh(k+1)− ˆh(k) is sufficiently small. Unfortunately, there are instances in which this method is notoriously slow to converge. To accelerate the convergence, a faster hybrid ECME-NR algorithm is performed by running a moderate number of ECME iterations from a poor starting value until arriving in a near neighborhood of the ML estimates, and then switching to the NR algorithm (9) to speed up the convergence. Under some regularity conditions, the asymptotic variance–covariance estimates can be approximated by plugging the ML estimate ˆh = (ˆb, ˆ2, ˆF, ˆq, ˆk) into the inverse of the observed

information matrix.

4. A SCORE TEST FOR THE SKEWNESS PARAMETER

To assess whether the skewness parameterk is significantly different from zero, we use the score

test for testing the null hypothesis H0 : k = 0 against the alternative H1 : k = 0. The principal

advantage of using the score test in comparison with other competitive testing procedures such as the likelihood ratio or Wald tests is that it does not require the more complex model to be fitted. To obtain the score test statistic, we first calculate ML estimates under the null hypothesis, denoted by ˆh0, and then evaluate the score vector and observed information ath = ˆh0, denoted by[sh]_ˆh₀

and[I_hh]_ˆh

0, respectively. Note that

s_h_ˆh

0 has all components equal to 0 except for the derivative

with respect tok.

Letf = (b, 2, vech(F), q), so that h = (f, k). The score vector and observed information matrix

can be reexpressed as s_h= (sT_f, sT_k)T and I_hh= I_ff I_fk I_kf I_kk

respectively. The score test statistic Usis

Us= [sh]T_ˆh₀[Ihh]−1_ˆh 0[sh]ˆh0= [sk] T ˆh0[Ikk·f] −1 ˆh0[sk]ˆh0 (12)

where s_k= (_iN₌₁i* i/*1, . . . ,iN=1i* i/*q)Tand Ikk·f= Ikk− IkfI−1_ff Ifk. Explicit

expres-sions for s_kand I_kk·f are shown in Appendix A.

Under H0, Us is asymptotically a chi-squared random variable with q degrees of freedom. A

disadvantage of the score test statistic is that it tends to give too few significant results since its asymptotic distribution is approached more slowly than that of the likelihood ratio test statistic in small samples. This is due to the fact that the score test is a first-order approximation to the likelihood ratio test. For a practical perspective, we suggest that one performs the likelihood

(7)

ratio test to yield more accurate results when the score test gives a weak indication that the null hypothesis is inappropriate, say at the 10 per cent significance level.

5. INFERENCE FOR RANDOM EFFECTS AND PREDICTION

We consider an empirical Bayes inference for the random effects that is useful for evaluating subject-specific quantities of interest such as individual intercepts and slopes. From model (2), it implies that Yi | bi∼ Nni(Xib + Zibi,

2_C

i) and bi∼ SNq(0, 2C, k). The conditional density of bi given Yi is f(bi|Yi) = f(Yi|bi) f (bi) f(Yi|bi) f (bi) dbi = q(lbi|Yi, Rbi|Yi)(k T bbi) wherelbi|Yi= CZ T iK−1i (Yi − Xib), Rbi|Yi= 2_(C−1_{+ Z}T iC−1i Zi)−1 andkb= −1C−1/2k.

After some algebraic manipulations, the minimum mean-squared error (MSE) estimator of bi

obtained by the conditional mean of bi given Yi is given by

In practice, the empirical Bayes estimator of bi, ˆbi, can be obtained by substituting the ML

estimate ˆh into (13). As a consequence, it leads to ˆbi= ˆbi(ˆh). We mention in passing that the

detailed expression for the error covariance matrix of ˆbi(h) is somewhat complicated and hence is

not shown here. Furthermore, we are interested in the prediction of yi, a future × 1 vector of

mea-surements of Yi, given the observed measurements Y=(YT_(i), YT_i)T, where Y(i)=(YT₁, . . . , YT_i₋₁, YT_i₊₁, . . . , YT_N)T.

Let xi and zi denote q× m1and q× m2matrices of prediction regressors corresponding to yi.

Consequently, we assume that Yi yi ∼ SNni+(X∗ib, Xi, a∗i) where X∗_i = (XT_i, xT_i)T, Z∗_i = (ZT_i, zT_i)T,Xi= 2(Z∗iCZ∗Ti + C∗i) and ai∗= (1 + d∗Ti W∗−1i d∗i)−1/2 K∗1/2_i W∗−1_i d∗_i. Moreover, we note that d∗_i = Z∗_iFd, K∗i = Zi∗CZ∗Ti + C∗i, Wi∗= Ki∗− d∗id∗Ti and C∗_i = C∗_i(q) is an (ni + ) × (ni + ) expanded autocorrelation matrix. Let ti= X−1/2a∗_i and let ti, C∗_i andXi be partitioned conformably with Y∗_i = (YT_i, yT_i)T. Then,

ti = (t(1)Ti , t(2)Ti )T, C∗i = C11 C12 C21 C22 Xi = X11 X12 X21 X22 = ZiCZTi + C11 ZiCzTi + C12 ziCZiT+ C21 ziCzTi + C22 (14)

(8)

Here the indices for the partitioned matrices in (14) are omitted for notational simplicity. Also, we note that C11= Ci, C12= CT₂₁,X11= Ki andX12= XT21. Making use of the fact that

f(yi|Yi, h) ∝

f(yi|Yi, i, h) f (Yi|i, h) f (i|h) di

the conditional distribution of yi given Yi is

f(yi|Yi, h) = (l_2·1, X22·1)(t T i(Y∗i − X∗ib)) (˜tT i(Yi − Xib)) (15) where l2·1= xib + X21X−1₁₁(Yi − Xib), X22·1= X22− X21X−1₁₁X12 ˜ti = t (1) i + X−111X12t(2)i 1+ t(2)T_i X22·1t(2)_i

The minimum MSE predictor of yi is the conditional expectation of yi given Yi, i.e.

ˆyi(h) = E(yi|Yi, h) = l_2·1+ (˜t T i(Yi − Xib)) (˜tT i(Yi− Xib)) X22·1t(2)_i 1+ t(2)T_i X22·1t(2)i (16)

Meanwhile, the MSE covariance matrix of predictor (16) is given by

E[(ˆyi(h) − yi)(ˆyi(h) − yi)T]=E[cov(yi|Yi)]

where cov(yi|Yi) = X22·1− ˜tT i(Yi − Xib) + (˜t T i(Yi − Xib)) (˜tT i(Yi − Xib)) ×(˜tTi(Yi − Xib)) (˜tT i(Yi− Xib)) X22·1t(2)i ti(2)TX22·1 1+ t(2)T_i X22·1t(2)i (17)

Notice that expression (17) does not account for errors due to the estimation of unknown parameters and hence it tends to underestimate the true error covariance matrix. The prediction of

yi can be obtained by substituting the ML estimate ˆh into (16), leading to ˆyi= ˆyi(ˆh). Proofs of

(16) and (17) are sketched in Appendix B.

6. AN ILLUSTRATIVE EXAMPLE

The Framingham heart study has examined the role of serum cholesterol as a risk factor for the evolution of cardiovascular disease. Zhang and Davidian[20] proposed a semi-parametric approach to analyze a subset of the Framingham cholesterol data, which consist of gender, baseline age and

(9)

0 2 6 8 10 1.5 2.0 2.5 3.0 3.5 4.0 Years Cholesterol levels Male Female 4

Figure 1. Trajectories of cholesterol levels for 133 participants. The thicker solid line indicates the mean profile in the study.

cholesterol levels measured at the beginning of the study and then every 2 years over 10 years, for 200 randomly selected participants. Arellano-Valle et al.[14] analyzed the same data set by fitting a linear mixed-effects model to it with both random effects and within-subject errors following SN distributions. Both of them come to the same conclusion that the distribution of subject-specific intercepts is non-normal.

In this section, we revisit the Framingham cholesterol data with the aim of providing additional inferences for the use of SNLMM that are not considered in[14]. To simplify the illustration, we select 133 participants (60 men and 73 women) whose trajectories of cholesterol levels as well as covariates of interest are completely observed at the duration of follow-up time. Figure 1 displays the cholesterol profiles for the 133 participants and suggests that the underlying trend seems to be linear with time and increases with a slowly moving speed.

Assuming a linear growth model with subject-specific random intercepts and slopes, we fit a linear mixed model to the data as specified by[20]

Yi j= 0+ 1sexi+ 2agei+ 3ti j+ b0i+ b1iti j + i j, i = 1, . . . , 133, j = 1, . . . , 6 (18)

whereb = (0, 1, 2, 3)T are the fixed effects of explanatory variables and the random effects bi= (b0i, b1i)Tare assumed to have a bivariate SN distribution SN2(0, 2C, k) with k = (1, 2)T.

Moreover, the error terms ei= (i 1, . . . , i 6)T are assumed identically and independently N6(0,

2_I

(10)

Table I. ML estimation results for model (19) with the associated standard errors in parentheses, where F is the square root ofC such that C = F2.

Number of ˆ

b ˆF ˆ (ˆh) parameters AIC BIC

1.5740 1.6869 0.7068 0.2079 −100.89 8 217.78 240.90 (0.1713) (0.1253) (0.1238) (0.0104) −0.0338 0.9403 (0.0619) (0.1936) 0.01860 (0.0040) 0.2787 (0.0274)

over time. For numerical stability as mentioned in[20], Yi j are the cholesterol level divided by 100

at the j th time point for the i th participant, sexi is a gender indicator(0 = female, 1 = male), agei

is the i th participant’s age at baseline and ti j is taken as(time − 5)/10, with time measured in

years from the baseline.

To test for the existence of skewness preference for random effects, we start by fitting an ordinary NLMM as follows:

Yi= Xib + Zibi + ei, bi∼ N2(0, 2C)

ei∼ N6(0, 2I6), bi⊥ei, i = 1, . . . , 133

(19)

The resulting ML estimates, together with the maximized log-likelihood value (ˆh), and two penalized likelihood information criteria, AIC and BIC values (AIC =−2(ˆh)+2m; BIC = −2(ˆh)+

m log(N)), where m is the number of parameters, corresponding to the fitted NLMM (19) are listed

in Table I. The score test statistic for testing evidence of skewness among random effects gave a value of 12.17, which is highly significant compared with the 2₂ distribution. The score test suggests that the distribution of random effects is probably skewed. Figure 2 depicts histograms and corresponding normal quantile plots of the empirical Bayes estimates of bi, ˆbi= ˆCZT_i(ZiˆCZT_i + I6)−1(Yi− Xiˆb), and shows that the subject-specific intercepts are positively skewed. Conversely,

there are no apparent non-normal patterns for subject-specific slopes. We thus consider an alternate model

Yi= Xib + Zibi+ ei, bi∼ SN2(0, 2C, k)

ei∼ N6(0, 2I6), bi⊥ei, i = 1, . . . , 133

(20)

The summarized ML results corresponding to the fitted SNLMM (20) are shown in Table II. All estimates are significant (compared with their standard errors) with the exception of2. The results

reveal that the joint distribution of random effects may, to some extent, depart from normality. Accordingly, the fitted model (20) produces smaller AIC and BIC values than those of model (19), indicating that model (20) for characterizing a more plausible assumption on the random effects is our preferred choice.

After fitting model (19) to the Framingham cholesterol data, the individual random effects, ˆbi= (bi 1, bi 2), i = 1, . . . , 133, can be estimated by using (13) as stated in Section 5. Figure 3

(11)

-0.5 0.0 0.5 1.0 0 5 10 15 20 25 30

Empirical Bayes estimates of subject specific intercepts

Frequency

Quantiles of Standard Normal

-2 -1 0 1 2 -0.5 0.0 0.5 1.0 (a) -0.4 -0.2 0.0 0.2 0.4 0 10 20 30 40

Empirical Bayes estimates of subject specific slopes

Frequency

Quantiles of Standard Normal

-2 -1 0 1 2 -0.4 -0.2 0.0 0.2 0.4 (b)

Figure 2. Histogram and normal Q–Q plots of empirical Bayes estimates of: (a) subject-specific intercepts and (b) subject-specific slopes.

Table II. ML estimation results for model (20) with the associated standard errors in parentheses, where F is the square root ofC such that C = F2.

Number of ˆ

b ˆF ˆ kˆ (ˆh) parameters AIC BIC

1.2898 2.6799 0.2752 0.2077 4.7052 −94.18 10 208.36 237.26 (0.1731) (0.1977) (0.0883) (0.0103) (1.1702) −0.0382 0.9398 0.0000 (0.0614) (0.1569) (0.5173) 0.0151 (0.0036) 0.2341 (0.0273)

presents a scatter plot of estimated random effects overlaid on a set of contour lines, together with summary histograms of the marginal densities. A visual inspection of Figure 3 signifies that there is considerable asymmetry among the estimated random effects. Moreover, the contour level curves appear to follow somewhat satisfactorily the trend of a scattergram, reflecting the adequacy of using a bivariate SN distribution for the random effects.

(12)

random intercepts random slopes 0 0.3 −0.3 −0.3 0 0.3 0.6 0.9 1.2 1.5 1.8 0.6

Figure 3. Scatter plot of estimated random effects overlaid on a set of contour lines, together with summary histograms of the marginal densities.

7. SIMULATIONS

The prediction problem for longitudinal data is of great importance in a number of practical applications. As pointed out by Rao[21] and Lee [22], the predictive accuracy of future observations can be taken as an alternative measure of ‘fitness’ of the data. In this section, we present a simulation study in which attention is focused on comparing the predictive abilities of the NLMM and SNLMM models. In the simulations, we generate 500 Monte Carlo data sets from the model:

Yi j= 0+ ti j1+ wi2+ bi+ i j i= 1 . . . , N, j = 1, . . . , 6 (21)

where ti j= j, wi= 1 if iN/2 and is zero otherwise, 1= 2, 2= 1, i j∼ N(0, 22) and 0+ bi is

taken to have the Gamma(2, 1) distribution with density f (x) = x exp(−x) for yielding a highly right skewed distribution. For each generated data set, model (21) will be fitted twice under the model assumptions outlined in Section 2, with the density of bi represented by a SN distribution

and a normal distribution, respectively.

We use the pseudo-cross-validation approach to assess their predictive performances. This ap-proach of comparing forecasts with the corresponding actual values is in the spirit of cross validation of Stone[23] and Geisser [24]. The technique proceeds as follows: (i) drop out the last three mea-surements yi 4, yi 5, yi 6 on the i th participant; (ii) compute ML estimates using the remaining data

as the sample; (iii) prediction of yi= (yi 4, yi 5, yi 6), denoted by ˆyi= ( ˆyi 4, ˆyi 5, ˆyi 6), is made by

using formula (16). For each of the 500 generated Monte Carlo data sets, the procedure is repeated across subjects i= 1, . . . , N.

To evaluate prediction accuracies, the quantity of the empirical mean-squared forecast error

MSFE= 1 3N N i=1 (ˆyi− yi)T(ˆyi− yi)

is calculated as a discrepancy measure. On the basis of this measure, the best predictor corresponds to having the smallest MSFE from a given set of alternatives.

We consider the cases with sample sizes N= 30, 50, 100, 150 and comparisons of both predic-tors based on 500 replications are summarized in Table III. As expected, the numerical results indicated that the SNLMM predictor performs encouragingly well in all cases. In all 2000

(13)

Table III. Comparison of prediction accuracy of NLMM and SNLMM predictors in terms of mean and standard deviation of 500 simulated MSFEs.

N= 30 N= 50 N= 100 N= 150

Predictor Mean SD Mean SD Mean SD Mean SD

NLMM 1.2719 0.2010 1.2920 0.1593 1.2974 0.1116 1.3016 0.0876 SNLMM 1.2184 0.1936 1.2359 0.1516 1.2380 0.1063 1.2413 0.0826 0 2 4 6 8 10 12 0 10 20 30 40 N=50 % improvement (b) 0 2 4 6 8 10 12 0 10 20 30 40 N=30 % improvement (a) 0 2 4 6 8 10 12 0 10 20 30 40 N=100 % improvement (c) 0 2 4 6 8 10 12 0 10 20 30 40 N=150 % improvement (d)

Figure 4. Percentage improvement of the SNLMM predictor over the NLMM predictor for 500 simulated MSFEs: (a) N= 30; (b) N = 50; (c) N = 100; and (d) N = 150.

generated simulation data, the SNLMM predictors yield better forecasts (with lower MSFEs) than the classical NLMM predictors. Judging from this, it is important to emphasize that model specification with regard to random effects is noteworthy in prediction. Figure 4 displays the per-centage improvement of the SNLMM predictor over the NLMM predictor. It is readily seen that the gains in predictive accuracies obtained from the SNLMM model appear striking.

8. CONCLUSIONS

We have proposed an approach to the analysis of linear mixed models using a multivariate SN distribution for the random effects, which will allow practitioners to fit longitudinal data in

(14)

a wide variety of considerations. We have described a normal-truncated normal hierarchy for the SNLMM model and presented a hybrid ECME-NR algorithm for dealing with ML estimation in a flexible complete data framework. Moreover, the score test as well as prediction procedures considered in this paper are easy to implement once the ML estimates are obtained. Numerical results illustrated in Section 6 indicate that the SNLMM model for the Framingham cholesterol data is evidently more adequate than the conventional NLMM model. The simulation study shows that an appropriate specification of random effects can enhance predictive abilities.

There are a number of possible extensions of the current work. A natural generalization is to robustify the skew-normal distribution using a broader family such as the skew t [25–27] and the skew-elliptical[8, 28] distribution. Another worthwhile task is to develop workable Markov chain Monte Carlo algorithms in a Bayesian framework of hierarchical SNLMM models. Finally, one referee pointed out that it is also of interest to compare SNLMM and t linear mixed models [2, 29, 30] with respect to their robust inferences.

APPENDIX A: THE SCORE FUNCTION AND EMPIRICAL INFORMATION MATRIX

The score vector s_h is the vector of the first derivatives of (8) with respect toh = (b, , x), where x = (vech(F), q, d) with vech(F) being the distinct elements of F. Expressions for the elements

of s_h are s_b=*(h|Y) *b = 1 2 N i=1 XT_iK−1i ei− 1 N i=1 ini s=*(h|Y) * = − n + 1 3 N i=1 eT_iK−1i ei − 1 N i=1 i i [sx]r=*(h|Y) *r = − 1 2 N i=1 tr(K−1_i ˙Kir) + 1 22 N i=1 eT_iK−1_i ˙KirK−1_i ei+ N i=1 i˙ ir

wherei and i are as in (7),

ni= X T iW−1i di 1+ dT_iW−1_i di ˙Kir=*Ki *r and ˙ _ir= * i *r Note that ˙Kir= ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ Zi *F * fab FZT_i + ZiF *F * fab ZT_i if r= fab (a = 1, . . . , q; b = a, a + 1, . . . , q) *Ci *a if r= a (a = 1, . . . , g) 0 if r= b (b = 1, . . . , q) and ˙ ir= (˙dT irWi−1− dTiW−1i ˙WirW−1i )ei 1+ dT_iW−1_i di − i(2˙dTirWi−1di− dTiW−1i ˙WirW−1i di) 2(1 + dT_iW_i−1di)

(15)

where ˙dir= *di *r = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ Zi *F * fab d if r= fab 0 if r= a ZiF*d *b if r= b

and ˙Wir= *Wi/*r= ˙Kir− ˙dirdTi − di˙dTir. Here,*F/* fabis clearly a binary indicator matrix and

the elements of*Ci/*a and*d/*b can be obtained straightforwardly.

The observed information I_hh, obtained by the negative of the second derivative of (8), has the following components: I_bb = 1 2 N i=1 (XT iKi−1Xi + i( i + i)ninTi) I_b = 2 3 N i=1 XT_iK−1_i ei − 1 2 N i=1 i(1 − i i− 2i)ni [Ibx]r = 1 2 N i=1 XT_iK−1i ˙KirK−1i ei+ 1 N i=1 (˙irni + i˙nir) I = −n 2 + 3 4 N i=1 eT_iK−1i ei+ 1 2 N i=1 i i( 2i + i i − 2) [Ix]r = 1 3 N i=1 e_iTK_i−1˙KirK−1_i ei + 1 N i=1 (˙ir i+ i˙ ir) [Ixx]r s = 1 2 N i=1 [tr(K−1i ¨Kir s) − tr(K−1i ˙Ki sK−1i ˙Kir)] + 1 2 N i=1 e_iTK_i−1˙KirK−1_i ˙Ki sK−1_i ei − 1 22 N i=1 eT_iK−1i ¨Kir sK−1i ei − N i=1 (˙ir˙ ir+ i¨ ir s) where ˙ir = *i *r= − i( i+ i)˙ ir ˙nir = *ni *r= − XT_i(W−1_i ˙WirW−1_i di− W_i−1˙dir) 1+ dT_iW−1_i di − ni(2˙dTirW−1i di− dTiW−1i ˙WirW−1i di) 2(1 + dT_iW−1_i di) ¨ ir s = *2 i *r*s = Aei 1+ dT_iW−1_i di −(˙dTirW−1i − dTiW−1i ˙WirWi−1)ei(2˙dTi sW−1i di − dTiW−1i ˙Wi sWi−1di) 2(1 + dT_iW−1i di)3/2

(16)

− ˙ i s(2˙dTirW−1i di − dTiW−1i ˙WirW−1i di) 2(1 + dT_iW−1_i di) − iB 2(1 + dT_iW−1_i di) − i(2˙dTirW−1i di− dTiW−1i ˙WirWi−1di)(2˙di sTWi−1di− dTiW−1i ˙Wi sW−1i di) 2(1 + dT_iW−1_i di)2 with A= ¨dT_{ir s}W−1i − ˙dirTW−1i ˙Wi sW−1i − ˙dTi sW−1i ˙WirW−1i + dTiWi−1˙Wi sW−1i ˙WirW−1i − dT iW−1i ¨Wir sW−1i + dTiW−1i ˙WirW−1i ˙Wi sWi−1 B= 2(¨dT_{ir s}W−1_i di − ˙dirTW−1i ˙Wi sW−1i di + ˙dTirW−1i ˙di s− ˙dTi sW−1i ˙WirW−1i di + dT iW−1i ˙Wi sW−1i ˙WirW−1i di) − d T iW−1i ¨Wir sW−1i di ¨Wir s= ¨Kir s − ¨dir sdTi − ˙dir˙dTi s− ˙di s˙dTir− di¨dir sT ¨Kir s= *Ki *r*s= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ *2 Ci *ab if r= a and s= b 0 o.w. and ¨dir s= * 2 di *r*s = ⎧ ⎪ ⎨ ⎪ ⎩ Zi *F * fab *d *c ifr= fab and s= c 0 o.w.

APPENDIX B: PROOFS OF (16) AND (17)

By applying Proposition 4 of[6], the moment generating function of the conditional density in (15) is given by Myi|Yi(t) = exp tTl_2·1+1₂tTX22·1t (˜tT i(Yi− Xib)) (t(1)Ti (Yi − Xib) + t (2)T i (yi − xib)) ×(2)−/2|X22·1|−1/2exp{−1₂(yi− l2·1− X22·1t)TX−1_22·1(yi− l2·1− X22·1t)}dyi =exp tTl2·1+12tTX22·1t (˜tT i(Yi− Xib)) ⎛ ⎝˜tT i(Yi− Xib) + t(2)T_i X22·1t 1+ t(2)T_i X22·1t(2)_i ⎞ ⎠ (t ∈ R₎

(17)

Differentiating Myi|Yi(t) with respect to t we get M_y i|Yi(t) = (l2·1+ X22·1t) exp(tTl2·1+12tTX22·1t) (˜tT i(Yi − Xib)) ⎛ ⎝˜tT i(Yi − Xib) + t (2)T i X22·1t 1+ t(2)T_i X22·1t(2)_i ⎞ ⎠ +exp tTl_2·1+1₂tTX22·1t (˜tT i(Yi − Xib)) ⎛ ⎝˜tT i(Yi − Xib) + t(2)T_i X22·1t 1+ t(2)T_i X22·1t(2)i ⎞ ⎠ × X22·1t (2) i 1+ t(2)T_i X22·1t(2)_i Substituting t= 0 into M_y

i|Yi(t), we complete the proof of (16). It therefore suffices to verify

that E(yiyTi|Yi, h) = * *tTM yi|Yi(t) t=0 = X22·1+ l2·1lT2·1+ (˜t T i(Yi − Xib)) (˜tT i(Yi − Xib)) ⎡ ⎣ X22·1t(2)i lT2·1 1+ t(2)T_i X22·1t(2)_i + l2·1t (2)T i X22·1 1+ t(2)T_i X22·1t(2)i − ˜tT i(Yi− Xib) X22·1t(2)_i t(2)T_i X22·1 1+ t(2)T_i X22·1t(2)_i ⎤ ⎦

Using the fact that cov(yi|Yi, h) = E(yiyiT|Yi, h) − E(yi|Yi, h)E(yi|Yi, h)T, the result (17) is

then proved.

ACKNOWLEDGEMENTS

The authors would like to thank Prof. Arellano-Valle for providing the Framingham cholesterol data at an early stage of this work. We are also grateful to the editor and two anonymous referees for their valuable comments and suggestions, which led to substantial improvements in the presentation of this work. This research was partly supported by the National Science Council of Taiwan (grant no. NSC95-2118-M-005-001-MY2).

REFERENCES

1. Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics 1982; 38:963–974.

2. Pinheiro JC, Liu CH, Wu YN. Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate t distribution. Journal of Computational and Graphical Statistics 2001; 10:249–276.

3. Verberk G, Lesaffre E. A linear mixed-effects model with heterogeneity in the random-effects population. Journal

of the American Statistical Association 1996; 91:217–221.

4. Lin TI, Lee JC. On modelling data from degradation sample paths over time. Australian and New Zealand

Journal of Statistics 2003; 45:257–270.

5. Lee JC, Lin TI, Lee KJ, Hsu YL. Bayesian analysis of Box–Cox transformed linear mixed models with ARMA( p, q) dependence. Journal of Statistical Planning and Inference 2005; 133:435–451.

(18)

6. Azzalini A, Dalla Valle A. The multivariate skew-normal distribution. Biometrika 1996; 83:715–726.

7. Azzalini A, Capitaino A. Statistical applications of the multivariate skew-normal distribution. Journal of the

Royal Statistical Society, Series B 1999; 61:579–602.

8. Sahu SK, Dey DK, Branco MD. A new class of multivariate skew distributions with application to Bayesian regression models. Canadian Journal of Statistics 2003; 31:129–150.

9. Arellano-Valle RB, Genton MG. On fundamental skew distributions. Journal of Multivariate Analysis 2005; 96:93–116.

10. Gonz´alez-Farias G, Dom´ınguez-Molina A, Gupta AK. Additive properties of skew normal random vectors. Journal

of Statistical Planning and Inference 2004; 126:521–534.

11. Liseo B, Loperfido N. A Bayesian interpretation of the multivariate skew-normal distribution. Statistics and

Probability Letters 2003; 61:395–401.

12. Azzalini A. The skew-normal distribution and related multivariate families (with discussion). Scandinavian Journal

of Statistics 2005; 32:159–200.

13. Arellano-Valle RB, Azzalini A. On the unification of families of skew-normal distributions. Scandinavian Journal

of Statistics 2006; 33:561–574.

14. Arellano-Valle RB, Bolfarine H, Lachos VH. Skew-normal linear mixed models. Journal of Data Science 2005; 3:415–438.

15. Genton MG. Skew-elliptical Distributions and Their Applications. Chapman & Hall: New York, 2004. 16. Thisted RA. Elements of Statistical Computing: Numerical Computation. Chapman & Hall: New York, 1988. 17. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with

discussion). Journal of the Royal Statistical Society, Series B 1977; 39:1–38.

18. Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 1993; 80:267–278.

19. Liu CH, Rubin DB. The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence.

Biometrika 1994; 81:633–648.

20. Zhang D, Davidian M. Linear mixed models with flexible distributions of random effects for longitudinal data.

Biometrics 2001; 57:795–802.

21. Rao CR. Prediction of future observations in growth curve models. Statistical Science 1987; 2:434–471. 22. Lee JC. Prediction and estimation of growth curve with special covariance structures. Journal of the American

Statistical Association 1988; 83:432–440.

23. Stone M. Cross-validatory choice and assessment of statistical prediction (with discussion). Journal of the Royal

Statistical Society, Series B 1974; 36:111–147.

24. Geisser S. The predictive sample reuse method with applications. Journal of the American Statistical Association 1975; 70:320–328.

25. Jones MC, Faddy MJ. A skew extension of the t-distribution, with applications. Journal of the Royal Statistical

Society, Series B 2003; 65:159–174.

26. Azzalini A, Capitaino A. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society, Series B 2003; 65:367–389.

27. Lin TI, Lee JC, Hsieh WJ. Robust mixture modeling using the skew t distribution. Statistics and Computing 2007; 17:81–92.

28. Branco MD, Dey DK. A general class of multivariate skew elliptical distributions. Journal of Multivariate

Analysis 2001; 79:99–113.

29. Lin TI, Lee JC. A robust approach to t linear mixed models applied to multiple sclerosis data. Statistics in

Medicine 2006; 25:1397–1412.

30. Lin TI, Lee JC. Bayesian analysis of hierarchical linear mixed modeling using the multivariate t distribution.