The Method of Instrumental Variables
for Generalized Linear Models
Fu-Chang Hu, Shu-Hui Lai, Tien-Lung Tsai, and
Wen-Yi Shau
1Division of Biostatistics
Graduate Institute of Epidemiology
College of Public Health
National Taiwan University Hospital
Taipei, Taiwan 100
R.O.C.
December 31, 2002
1Correspondence: Fu-Chang Hu, Division of Biostatistics, Graduate Institute of Epidemiology,
College of Public Health, National Taiwan University, 1 Jen-Ai Road, Section 1, Room 1541, Taipei, Taiwan 100, R.O.C., Phone: +886 2 / 2394-2050 or +886 2 / 2312-3456 × 8349, Fax: +886 2 /
2351-1955, Email: [email protected]. This project was financially supported by the National Science Council of the Republic of China (NSC 89-2118-M-002-008 and NSC 90-2118-M-002-010).
The method of instrumental variables (IVs) has been developed historically and used extensively in econometrics for estimating the structural parameters in statisti-cal models where some regressors are correlated with the model’s error. Thus, as discussed by Bowden and Turkington (1984, pp. 3-10), the IV estimation method has four important applications: (1) the error-in-variable problem, (2) the self-selection problem, (3) the simultaneous equations model, and (4) the time series problem. In this study, we generalize the current IV method for linear and nonlinear equations to allow the structural equations to be in a generalized linear model (GLM) form so that the response variables of those equations can be different kinds of measurements such as continuous, binary, and count. However, when the structural equation is of a GLM form, some additional difficulties arise for the IV method because the link function of a GLM makes the mean of the response variable nonlinear in regression coefficients and the variance function of the response variable in a GLM usually de-pends on its mean. Thus, in general, the IV estimator for a GLM equation derived by minimizing a quadratic form like the one defined for a linear or nonlinear equation (see, e.g., Amemiya 1985, p. 246, Eq. (8.1.2)) would be inconsistent. However, by applying the theory of estimating functions (Godambe and Kale 1991, Sec. 1.5, pp. 9-10, Desmond 1991, p. 140, and McCullagh 1991, Sec. 11.3, esp., Eq. (3.2), pp. 269-271), we develop a general IV methodology for GLMs, which has an applicability especially in the measurement error problem and the simultaneous-equations bias of GLMs. The statistical properties of our IV estimator are examined. The simulation studies show promising results when it is applied to estimate the structural parame-ters in our generalized factor analysis and generalized path analysis (or generalized simultaneous equations model).
Keywords:
Error-in-variable problem, Measurement error problem, Factor analysis, Simultaneous equations model, Structural equation model, Nonlinear model, Discrete response.
Contents
1
Review
1
1.1 Generalized Linear Models (GLMs) . . . 1
1.1.1 Specification . . . 1
1.1.2 Examples . . . 1
1.1.3 Estimation. . . 3
1.1.4 Inference . . . 3
1.2 Instrumental Variables (IVs) . . . 4
1.2.1 Definition . . . 4
1.2.2 Applications. . . 6
2
Problem: IV for GLMs?
7
2.1 Problem . . . 72.2 Solution . . . 7
3
Estimation: The IRIV Algorithm
8
3.1 Rationale . . . 83.2 Remarks . . . 12
4
Example: The GFA Model
13
4.1 Motivation . . . 13 4.2 Specification . . . 14 4.3 Estimation . . . 15 4.4 Simulation . . . 17 4.4.1 Setup . . . 17 4.4.2 Result . . . 195
References
20
6
Tables
24
1
Review
1.1
Generalized Linear Models (GLMs)
1.1.1 Specification
1.1.3 Estimation
As shown by Dobson (1990, Chap. 4 and Appendix B, pp. 36-46 and 145-146) and McCullagh and Nelder (1989, Sec. 2.5, pp. 40-43), the maximum likelihood estimators (MLEs) of the unknown regression coefficients β in a GLM can be obtained by a
unified iterative reweighted least squares (IRLS) algorithm. 1.1.4 Inference
As discussed in Dobson (1990, Chap. 5, pp. 49-66), the key to the statistical inference for GLMs is the sampling distribution of the score vectorU (β), which is asymptotically
multivariate normal based on the central limit theorem (CLT). Then, the inference on the regression coefficientsβ is based on the first-order Taylor expansion of U (β)for
b
1.2
Instrumental Variables (IVs)
1.2.1 Definition
The method of instrumental variables (IVs) has been developed historically and used extensively in econometrics for estimating the structural parameters in statistical mod-els where the regressor is correlated with the error term. For easy computation, let the ”convergence in probability” (i.e., p
−→) be denoted by ”plim” hereinafter.
1.
The IV Estimators for a Linear Equation
:
A linear equation is specified as
Y = Xβ +
where Y is an n × 1 vector of response variables, X is an n × p matrix of covariates,
β is a p × 1 vector of parameters, and is an n × 1 vector of errors. Suppose that
plim 1 nX T 6= 0 so that the
2.
The IV Estimators for a Nonlinear Equation
:
Next, a nonlinear equation can be specified as
Y = f (X; β) + .
One way to generalize the IV estimators from linear equations to nonlinear ones is to replace(Y − Xβ)by Y − f (X; β) in the minimization of the corresponding quadratic
form. Thus, given an instrument Z for X, a consistent IV estimator βeIV is the value
of β that minimizes the following quadratic form S2(β | W) =
1.2.2 Applications
As discussed by Bowden and Turkington (1984, pp. 3-10), the IV estimation method has four important applications:
1. The error-in-variable problem. √
2. The self-selection problem.
3. The simultaneous equations model. √
4. The time series problem.
2
Problem: IV for GLMs?
In this study, we generalize the current IV method for linear and nonlinear equations to allow the structural equations to be in a generalized linear model (GLM) form so that the response variables of such equations can be various kinds of measurements such as continuous, binary, or counts.
2.1
Problem
However, when the structural equation is of a GLM form, some additional difficulties arise for the IV method because
1. the link function of a GLM makes it ”nonlinear” and
2. the variance function of the response variable in a GLM usually depends on its mean.
Thus, in general, the IV estimator for a GLM equation derived by minimizing a quadratic form like the one defined for a linear or nonlinear equation (see, e.g., Amemiya 1985, p. 246, Eq. (8.1.2)) would be inconsistent.
2.2
Solution
Yet, by applying
1. the minimization approach of the IV method and
2. the theory of estimating functions for GLMs (Godambe and Kale 1991, Sec. 1.5, pp. 9-10 and McCullagh 1991, p. 269, Eq. (3.2)),
we develop a general IV for GLMs toward a general IV methodology, which has an applicability especially in the error-in-variable problem and the simultaneous equations of GLMs.
3
Estimation: The IRIV Algorithm
By modifying the iterative reweighted least squares (IRLS) algorithm, we develop the iterative reweighted instrumental variables (IRIV) algorithm for GLMs.
Then, the score function is
U (β) ≡ XTg0(µ)−1A(ATV A)−1AT(y − µ(β)).
Notice that the ”-1” can not be directly applied to the matrixes AT, V, and A inside
the ”( )” since A is not a square matrix.
It can be shown that the mean ofU is E(U ) = 0
and the variance of U, called the information, is
V ar(U ) ≡ E(U2) = − E(U0)
where
U0 ≡ ∂U
∂β
and the elements of the information matrix I are
I ≡ V ar(U (β)) = XTg0(µ)−1A(ATV A)−1ATg0(µ)−1X
On the other hand,
− E(U0) = − E XTg0(µ)−1A(ATV A)−1AT ∂ ∂β AT(y − µ(β)) − E ∂ ∂β XTg0(µ)−1A(ATV A)−1 AT(y − µ(β)) = XTg0(µ)−1A(ATV A)−1AT∂µ ∂β = XTg0(µ)−1A(ATV A)−1ATg0(µ)−1X = XTW?X where W? = g0(µ)−1A(ATV A)−1ATg0(µ)−1.
Thus, we have shown
If the Newton-Raphson method is used to obtain the MLEsβ of β, then at the mth iteration, b β(m) = βb(m−1)− ∂2
Then,
I(m−1)βb(m−1)+ U(m−1) = XTW?(m−1)Z(m−1).
3.2
Remarks
We find that ”the IRLS algorithm can be used to linearize GLMs” and suggest to use it as a tool in developing the estimation method for more complicated statistical models with responses of a mixed type. Three remarks are made in this regard.
1. At each iteration indexed by(m), the IRLS algorithm takes a special
transforma-tion on the original response variable Yi to obtain a pseudo-response variable
Zi, no matter what type of Yi is, to modify the property of the original response
variable such that the original GLM
g(µi) = xTi β
becomes a linear regression model
E h Zi(m) i = xTi β (m)
until the convergence of βb. In the linear regression case, the IRLS algorithm
reduces to the usual weighted least squares (WLS) procedure.
2. Then, instead of writing specific computing programs for a variety of GLMs according to their likelihood functions to obtain MLEs, one can just apply one single unified IRLS algorithm for all GLMs. See, for example, thePROC GENMOD
in the SAS for Windows and theglm function in the S-PLUS for Windows.
3. Most importantly, the derived linear regression model for the pseudo-response variable Zi at the last iteration of the IRLS algorithm provides an ”equivalent”
linear regression model for the original GLM in the sense that they have the same values of the regression coefficients.
4
Example: The GFA Model
4.2
Specification
A GFA model is of the following form
gX1(µX1) = λ11F1+ λ12F2+ · · · + λ1mFm
4.3
Estimation
We introduce the EM-CFA, IV1, and IV2 methods for estimating the factor loadings of a GFA model in the following simulation, among which the IV2 estimator is obtained through the IRIV algorithm.
1. The EM-CFA Estimator
2. The Instrumental Variable #1 (IV1) Estimator:
Notice that this method is the same as that of Carroll and Stefanski (1994). See Carroll and Stefanski (1994) and Carroll, Ruppert, and Stefanski (1995) for the technical details.
3. The Instrumental Variable #2 (IV2) Estimator:
• Step 0: Fit the GLM for the response variable Y2 on the covariate Y1 using
the IRLS algorithm to obtain the initial estimate of the factor loadings λ2
which is the GLM’s slope coefficient, and fit the GLM for the response variable Y3 on the covariateY1 using the IRLS algorithm to obtain the initial
estimate of the factor loadings λ3 which is the GLM’s slope coefficient,
which are the naive estimates of ΛY = [1, λ2, λ3]T.
• Step 1: Obtain the predicted values of Y1|2 and Y1|3,Yb1|2 andYb1|3, as the IV
using the ordinary linear model (LM).
• Step 2: Treat the pseudo-response variables Zi’s derived from the
indica-tor variables Yi’s in the IRLS algorithm of GLM as observed continuous
indicator variables to obtain the updated estimates of the factor loadings
ΛY = [1, λ2, λ3]T using the IRLS algorithm through the following formula,
and iterate Zi and ΛY until ΛY converge:
ΛY2 = [λ20, λ2] T = h Y1Tg 0 (µ)−1Yb1|3( bY1|3TV bY1|3)−1Yb1|3Tg 0 (µ)−1Y1 i−1 × Y1Tg 0 (µ)−1Yb1|3( bY1|3TV bY1|3)−1Yb1|3Tg 0 (µ)−1Z2, where V = V ar(Y2) ΛY3 = [λ30, λ3] T = h Y1Tg 0 (µ)−1Yb1|2( bY1|2TV bY1|2)−1Yb1|2Tg 0 (µ)−1Y1 i−1 × Y1Tg 0 (µ)−1Yb1|2( bY1|2TV bY1|2)−1Yb1|2Tg 0 (µ)−1Z3, where V = V ar(Y3).
4. The Naive Estimator:
For the purpose of comparison, we also compute the naive estimates ofeλ2 and
eλ3 by directly fitting the GLMs with the proxy of the latent variable F to the data,
4.4
Simulation
4.4.1 Setup
Recall that the previously specified one-factor three-indicator GFA model is
Y1 = α1+ 1 · F + δY1
logit(µ2) = α2+ λ2F
log(µ3) = α3+ λ3F
where µ2 and µ3 are the means of the indicator variables Y2 and Y3, the commonly
used ”logit” and ”log” link functions are chosen forµ2iandµ3i,δY1 is the measurement error of the indicator variable Y1, and the latent variable F is a continuous variable
measured by three indicator variables Y1, Y2, and Y3 of a mixed type — continuous,
In the following simulations, the sample size (n) is 1000 and the number of
repetition (m) is 500. We follow the following steps to generate the simulated data.
1. Firstly, we set the true parameter values. (a) For simplicity, we set all the intercepts t
4.4.2 Result
The simulation results are listed in Table 4. We find the following interesting results. 1. The naive estimator always seriously underestimates the true values of (λ2, λ3).
2. General speaking, the EM-CFA, IV1, and IV2 estimators perform well for an ”identity” or a ”log” link function.
3. For logistic link,
(a) EM-CFA estimator performs well except when the true values of λ2 = 1.25:
|bλ2− λ2| = |1.4022 − 1.25| = 0.1522.
(b) IV1 estimator performs well except when the true values of λ2 = 1.0 and
1.25:
• λ2 = 1.0: |1.2453 − 1.0| = 0.2453
• λ2 = 1.25: |2.2607 − 1.25| = 1.0107
(c) IV2 estimator performs well except when the true values of λ2 = 0.25:
|bλ2− λ2| = |0.1882 − 0.25| = 0.0618.
Among them, EM-CFA estimator always has a larger variance.
4. When the values of λ2 and λ3 are different such as (0.5, 1.0)and (1.0, 0.5), the
performance of IV1 has a larger bias.
5. In all these values of (λ2, λ3), IV2 always has less bias and smaller variance
except when (λ2, λ3) = (0.25, 0.25),
• λ2 = 0.25: |0.1882 − 0.25| = 0.0618
5
References
Agresti, A. (1996). An Introduction to Categorical Data Analysis. New York, NY: John Wiley & Sons.
Bickel, P. J. and Doksum, K. A. (1977). Mathematical Statistics: Basic Ideas and Se-lected Topics. Oakland, CA: Holden-Day.
Cox, D. R. and Snell, E. J. (1981). Applied Statistics: Principles and Examples. Lon-don: Chapman and Hall.
Dobson, A. J. (1990). An Introduction to Generalized Linear Models. London: Chap-man & Hall.
Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modelling Based on Gener-alized Linear Models. New York, NY: Springer-Verlag.
Firth, D. (1991). Generalized Linear Models. In: D. V. Hinkley, N. Reid, and E. J. roduc-d(20)Tj/F2 112.58l.51l.51lExamplesG1, and E.In-.
Anderson, T. W. (1956). Statistical inference in factor analysis. The Third Berkeley Symposium in Mathematical Statistics and Probability, 5, pp. 111-150.
Bartholomew, D. J. (1984). The foundations of factor analysis. Biometrika, 71, pp. 221-232.
Bartholomew, D. J. (1987). Latent Variable Models and Factor Analysis. London (UK): Charles Griffin & Company.
Basilevsky, A. (1983). Applied Matrix Algebra in the Statistical Sciences. New York, NY: North-Holland.
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications. New York, NY: John Wiley & Sons.
Bock, R. D. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, pp. 443-459. Bock, R. D. and Lieberman, M (1970). Fitting a response model for n dichotomously
score items. Psychometrika, 35, pp. 179-197.
Bollen, K. A. (1989). Structural Equations with Latent Variables. New York, NY: John Wiley & Sons.
Bowden, R. J. and Turkington, D. A. (1984). Instrumental Variables. New York, NY: Cambridge University Press.
Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Measurement Error in Nonlin-ear Models. London (UK): Chapman & Hall.
Carroll, R. J. and Stefanski, L. A. (1994). Measurement error, instrumental variables and corrections for attenuation with applications to meta-analysis. Statistics in Medicine, 13, pp. 1265-1282.
Cassella, G. and Berger, R. L. (1990). Statistical Inference. Belmont, CA: Duxbury Press. Chambers, J. M. and Hastie, T. J. (1993). Statistical Models in S. London (UK):
Chap-man & Hall.
Cox, D. R. and Hinkly, D. V. etc. (1991). Statistical Theory and Modelling. London (UK): Chapman & Hall.
Dobson, A. J. (1990). An Introduction to Generalized Linear Models. London (UK): Chapman & Hall.
Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modeling Based on General-ized Linear Models. New York, NY: Springer-Verlag.
Godambe, V. P. (1991). Estimating Functions. New York, NY: Oxford University Press. Hamilton, L. C. (1992). Regression with Graphics: A Second Course in Applied
Sta-tistics. Belmont, CA: Duxbury Press.
J¨oreskog, K. G. (1993). Testing Structural Equation Models. In: K. A. Bollen and J. S. Long (Eds), Testing Structural Equation Models. Newbury Park, CA: Sage, pp. 294-316.
Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis, 4th Ed. London (UK): Prentice-Hall International.
Hatcher, L. (1994). A Step-by-Step Approach to Using the SAS System for Factor Analysis and Structural Equation Modeling. Cary, NC: SAS Institute.
Lawley, D. N. and Maxwell, A. E. (1971). Factor Analysis as a Statistical Method. New York, NY: American Elsevier Publishing Company.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd Ed. London (UK): Chapman & Hall.
Mooijaart, A. (1983). Two kinds of factor analysis for ordered categorical variables. Multivariate Behavioral Research, 18, pp. 423-441.
Mooijaart, A. (1985). Factor analysis for non-normal variables. Psychometrika, 50, pp. 323-342.
Mulaik, S. A. (1972). The Foundations of Factor Analysis. New York, NY: Mcgraw-Hall.
Mulaik, S. A. (1986). Factor analysis and Psychometrika: Major developments. Psy-chometrika, 51, pp. 23-33.
Mulaik, S. A. and McDonald, R. P. (1978). The effect of additional variables on factor indeterminacy in models with a single common factor. Psychometrika, 43, pp. 177-192.
Noble, B. and Daniel, J. W. (1977). Applied Linear Algebra, 4th Ed. Englewood Cliffs, NJ: Prentice-Hall.
Rawlings, J. O., Pantula, S. G., and Dickey, D. A. (1998). Applied Regression Analy-sis: A Research Tool, 2nd Ed. New York, NY: Springer-Verlag.
Rubin, D. B. and Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psy-chometrika, 47, pp. 69-76.
Rubin, D. B. and Thayer, D. T. (1983). More on EM for ML factor analysis. Psychome-trika, 48, pp. 253-257.
Schumacker, R. E. and Lomax, R. G. (1996). A Beginner’s Guide to Structural Equa-tion Modeling. Mahwah, NJ: Lawrence Erlbaum Associates.
White, H. (1984). Asymptotic Theory for Econometricians. Orlando, FL: Academic Press.
Table 4: The Simulation Results for a One-Factor
Three-Indicator GFA Model
V ar(δY1) = 0.25. Sample size(n) = 1000. Number of repetitions (m) = 500.
Estimators λ2 λ3 True Value 0.25 0.25 EM-CFA Mean 0.25092649693187 0.25077280393814 SD (0.27188917804651) (0.25779815451420) IV1 Mean 0.23917803559980 0.24175813023387 SD (0.25016339007073) (0.26395594629856) IV2 Mean 0.18820484324542 0.18296473232450 SD (0.18962223003058) (0.18430516904135) Naive Mean 0.20385240382896 0.19896037286348 SD (0.05616941128867) (0.02829132854139) True Value 0.5 0.5 EM-CFA Mean 0.506262456399002 0.52789336850906 SD (0.169496397921747) (0.164492467426508) IV1 Mean 0.479948009028105 0.49864687709039 SD (0.143898300631222) (0.140166490686705) IV2 Mean 0.491471572991417 0.528938137102815 SD (0.14695309863652) (0.14027973639975) Naive Mean 0.394203707107843 0.39879624899906 SD (0.060496465011727) (0.026884175697182) True Value 0.75 0.75 EM-CFA Mean 0.770805776916108 0.803566453545486 SD (0.148548512532787) (0.112550796094448) IV1 Mean 0.755992332797742 0.755170694969821 SD (0.125516560084339) (0.101351033826447) IV2 Mean 0.741523132695775 0.75203472185432 SD (0.122236971276649) (0.10145359485326) Naive Mean 0.584053624312387 0.598897441281704 SD (0.061932729997401) (0.028905630245148) True Value 1.0 1.0 EM-CFA Mean 1.06302978540374 1.0371487334856 SD (0.185307670785744) (0.0905529213065184) IV1 Mean 1.24528805940761 1.00881819975644 SD (0.235194514154233) (0.0949600132097685) IV2 Mean 0.977987102582031 1.01534135675285 SD (0.139790973079505) (0.08842560485614) Naive Mean 0.765614819012559 0.799832361822084 SD (0.071004841008492) (0.0335064195148907)
Estimators True Value 1.25 1.25 EM-CFA Mean 1.40220449509103 1.16619704621377 SD (0.22217889066854) (0.14573630432616) IV1 Mean 2.26072152841269 1.24919145135057 SD (0.59010017172323) (0.09365187995617) IV2 Mean 1.23995950680057 1.27412203789911 SD (0.18319664780851) (0.13824740378273) Naive Mean 0.93681694772132 0.98987065901224 SD (0.07050462839213) (0.04518235185069) True Value 0.5 1.0 EM-CFA Mean 0.51501650180935 1.01830285913052 SD (0.12144561744652) (0.1272447814200) IV1 Mean 0.57788827699316 1.00381246264769 SD (0.16056702818523) (0.16801078221817) IV2 Mean 0.49950397735152 1.06309367337369 SD (0.11158406617320) (0.26484241789319) Naive Mean 0.39673543565992 0.79998715259196 SD (0.06377698891446) (0.03481135326788) True Value 1.0 0.5 EM-CFA Mean 1.02781523803823 0.525066319497778 SD (0.232544586692135) (0.0894179464868579) IV1 Mean 0.881558911810525 0.502297498691286 SD