7 Non-experimental Evaluations 7.1 The Problem of Causal Inference in Non-experimental Eval- uations

(1)

7 Non-experimental Evaluations

7.1 The Problem of Causal Inference in Non-experimental Eval- uations

Without invoking the very nonexperimental methods they seek to avoid, social experiments cannot address many questions of interest to researchers and policymakers. Even if they could, such data are generally not available. As a result, analysts must rely on “observational” or non-experimental methods to address the problem of selection bias resulting from non-random participation of individuals in employment and training programs.

In an experimental evaluation, information from the control group is used to …ll in missing counterfactual data for the treatments. As we have seen, under the assumptions speci…ed in Section 5, an experiment is most successful in generating certain counterfactual means. In a non-experimental evaluation, analysts must replace these missing data with data on non-participants along with assumptions di¤erent from those invoked when using the method of social experiments.

To illustrate this point and to highlight an important distinction between experimental and nonexperimental solutions to the evaluation problem, consider Figure 7.1. It presents a model of potential outcomes in which each outcome takes on one of two possible values. For training participants, Y1 equals one if the individual is employed after completing training and equals zero otherwise. For non-participants Y0is de…ned similarly. As before, D = 1 for persons who select into training (but who may be excluded in an experimental evaluation) and D = 0 otherwise. When program evaluators have access to experimental data, they observe both Y1 and Y0 (but never both at the same time for the same person) for persons who select into training. That is, they observe the row and column totals for the D = 1 table, but not the proportion of persons for whom D = 1 who are in each individual cell.

For example, the experimental controls enable the analyst to estimate the proportion of the persons selecting into training (D = 1) who would not have been employed in the absence of training, denoted P0:1, but not the proportion of persons selecting into training who would not have been employed either with or without training, denoted P001. In order to estimate this proportion, we require another assumption, such as that training did not cause anyone to be non-employed who otherwise would have been employed. This “monotonicity”

assumption (training can only make people better o¤), …rst invoked in Heckman and Smith (1993), allows us to set P101 = 0. In that case we can …ll in the remaining elements of the table using the row and column totals. The proportion of trainees whose employment status changes as a result of training is now given by P011. When the monotonicity assumption is imposed onto the data from experimental evaluations of training, P011 is typically relatively small. (See, e.g., Heckman and Smith, 1993). Training causes a relatively small proportion

(2)

of trainees to switch from the non-employment state to the employment state.

Analysts who have access only to nonexperimental data observe only the column totals in the D = 1 table and the row totals in the D = 0 table. In addition, the proportion of people who take training is known. This can be determined from an experiment that randomizes eligibility but not from an experiment that randomizes among those who apply and are accepted into the program. The remaining elements of both tables, including the other row and column totals, are unknown. The task in observational studies is to …nd a set of conditioning variables and to impose an appropriate set of assumptions so that the row totals in the D = 0 table can be used to estimate the missing row totals in the D = 1 table. Regardless of the conditioning variables used or assumptions imposed, there always exists a set of minimal assumptions necessary to identify the impact of training that cannot be tested with the data. The same is true for the analysis of experimental data; the assumptions of no randomization bias or the unimportance of sample attrition cannot be tested with the data typically generated from experimental evaluations. Both experimental and nonexperimental approaches require assumptions that cannot be tested without collecting data speci…cally designed to test the assumptions of the model.

7.2 Constructing a Comparison Group

All evaluations are based on comparisons between treated and untreated persons. The comparisons may be constructed using the same persons in the treated and untreated states as in the before-after estimator. More commonly, di¤erent persons are compared.

The evaluation literature makes an arti…cial distinction between the task of creating a comparison group and the task of selecting an econometric estimator to apply to that comparison group. In truth, all estimators de…ne an appropriate comparison group and the choice of a comparison group a¤ects the properties of an estimator. The act of constructing or selecting a valid estimator entails assumptions about the samples on which it should be applied.

This simple point is usually overlooked in the empirical literature on program evaluation. It is common to observe analysts …rst constructing a comparison group on the intuitive principle of making the comparison group “comparable” in some way or other to the treatment group, and then to debate the choice of an estimator as if all estimators de…ned for random samples of the population can be applied to a comparison group so constructed. Many econometric estimators are only valid for random samples of the population. When nonrandom samples are generated, the estimators are sometimes no longer valid and have to be modi…ed to account for the impact of the sampling rule used to generate the comparison samples.

The most common instance of this point arises in oversampling participants compared

(3)

to nonparticipants. Program records are often abundant for participants; comparison samples often have to be collected at considerable cost. The ratio of program records to comparison group records is usually much larger than one. Simply pooling the two samples misrepresents the population proportion of persons taking training. In order to use the many conventional econometric methods that assume random sampling on such data, the samples have to be reweighted. (See the discussion in Heckman and Robb, 1985a, 1986a.) A special class of “control function” estimators that we de…ne below does not have to be reweighted. However, instrumental variables estimators have to be reweighted in this case.

Di¤erent classes of estimators exhibit di¤erent degrees of sensitivity to departures from random sampling in constructing comparison groups.

A second example is contamination bias, which we discuss in detail in Section 7.7. Many comparison groups include persons who have actually participated in the program but who have not been recorded as having done so. Again, estimators suitable to random samples without such measurement error on treatment status have to be modi…ed for contaminated samples (Heckman and Robb, 1985a; Imbens and Lancaster, 1996).

A third example concerns the widespread practice of “matching” treatment and comparison group members on dimensions such as pre-program earnings. The literature often distinguishes between “screening” on characteristics and matching. Screening usually refers to the application of certain broad rules (e.g., income below a certain level) to select observations from a source sample into a comparison sample; matching refers to alignment of trainees and comparison group members over narrower intervals. Both are a form of matching as we de…ne it below and the distinction between them is of no practical value.

More serious are the consequences of this type of matching on the performance of econometric estimators. Matching on variables that are stochastically dependent on the errors of the model sometimes alters the stochastic structure of the errors. Econometric estimators that are valid for random samples are often no longer valid when applied to the samples generated by matching procedures.

To illustrate the foregoing point, consider the common-coe¢cient autoregressive estimator introduced into the econometric evaluation literature in Heckman and Wolpin (1976).

Using decision rule (6.3) and assuming that agents make their decisions in an environment of perfect certainty and that enrollment into the program only occurs in period k,

(7.1a) Yt= ¯ + ®D + Ut for t > k;

(7.1b) Yt= ¯ + Ut for t · k;

(7.1c) Ut= ½Ut¡1+ "t;

where "t is an independently and identically distributed error with mean zero. In terms of the model of potential outcomes introduced in Section 3, Yt = DY1t+ (1¡ D)Y^0t and Y1t ¡ Y0t = ®; the parameter of interest. The model is in the form of equation (3.10)

(4)

with an autoregressive error. The assumptions about the error terms are typically invoked about random samples of the population. Selection bias in this model arises because of the covariance between D and Ut. In a model with perfect capital markets, only if ½ = 0 would there be no selection bias.⁴⁵

If we have access to panel data, we can use two post-program observations to estimate

®.⁴⁶ Write

Y_t¡1 = ¯ + ®D + U_t¡1; where t ¡ 1 > k, so that

Ut¡1 = Yt¡1 ¡ ¯ ¡ ®D.

Substituting into (7.1c) and collecting terms, we may rewrite (7.1a) as (7.2) Yt= ¯(1¡ ½) + ®(1 ¡ ½)D + ½Yt¡1+ "t.

Under decision rule (6.3), D is orthogonal to "t even though agents are making their participation decisions under perfect certainty. Least squares applied to (7.2) identi…es ½, and hence ® and ¯. This estimator can be applied to training programs or schooling. Its great advantage is that it can be implemented using only post-program outcome measures provided ½ 6= 1. Properties of this estimator are presented in Section 7.6.

Another way to identify ® is to use instrumental variables or classic selection bias estimators which we describe in detail below. Assuming random sampling, both of these estimators identify ®.

Suppose, however, that we …rst “match” on pre-training earnings, Yt⁰, t⁰ < k, in order to construct a comparison sample of nonparticipants. Consider a simple screening rule: select observations into the sample if Yt⁰ < `. This rule is widely used in constructing comparison samples. How are the error structure (7.1c) and the properties of the three estimators just discussed a¤ected by the application of such screening rules? The autoregressive estimator just presented using post-program observations is una¤ected by these sampling rules. It continues to identify ® and ½. This is immediately seen because E("t j D = 1; Y^t¡1; Yt⁰ <

`) = 0 since "t is independent of Yt⁰; t⁰ < k; and Yk.

However, matching a¤ects the distribution of the errors. This makes a sample selection model based on a distributional assumption appropriate to a random sample inappropriate when applied to a matched sample. In this case, two selection rules generate the outcomes;

45However, this result crucially depends on the perfect capital market assumption as we noted in Section 6.4.46As noted in Heckman and Robb (1985a, 1986a) and below, this estimator can also be applied to repeated cross section data.

(5)

classical selection estimators that only account for agent self-selection do not account for the selection bias induced by the analysts’ matching procedure. Instrumental variables methods appropriate to random samples in general become inconsistent when applied to matched samples for reasons exposited in Section 7.7.

Another strategy for de…ning a comparison group is to use program applicants who drop out of the application and enrollment process before receiving training. Such comparison groups include persons who applied and were rejected from the program, those who were admitted but never showed up for training (“no-shows”), or early program dropouts. (No- shows are used in, e.g., Cooley, et al., 1979, LaLonde, 1984, and Bell, et al., 1995. Heckman, Ichimura and Todd, 1997, consider this comparison group among others.) In samples based on no shows, two decision rules – whether or not to apply to the program and whether or not to stay in the program if accepted – determine which nonparticipants end up in the comparison group sample. The properties of econometric estimators have to be examined to see if they are robust to such sample selection rules. Analytically, this is the same problem as arose in the construction of matched samples, except that in this case the decision rules of agents govern the construction of samples. Estimators valid for samples generated by one decision rule need not be valid for another.

A brief summary of the screening and matching criteria used in several major evaluations is presented in the last row of Table 7.1. Table 7.2, based on Barnow (1987), presents a more exhaustive list of characteristics used to match and control for di¤erences in evaluations of the U.S. CETA program, the immediate predecessor of JTPA. Combining matching and di¤erent nonexperimental evaluation methods that break down when applied to matched samples constitutes an important source of variability across these studies, one that has more to do with the properties of the estimators selected than with the properties of the programs being studied.

In the literature, the act of specifying a comparison group and then making conditional mean comparisons between participants and comparisons is equivalent to de…ning a matching estimator. The matching estimator may be embellished by further adjustments as we note below. A di¤erent comparison group might be speci…ed for each treatment observation. The potential sample from which the comparison group is taken includes all persons who don’t take treatment. Further restrictions on this universe de…ne di¤erent matching rules.

7.3 Econometric Evaluation Estimators

All evaluation estimators are based on the three basic estimation principles introduced in Section 4. They entail making some comparison of treated individuals with the untreated.

The comparison may be between treated and untreated persons at a point in time as in the

(6)

cross-section estimator; it may be between the same persons in the treated and untreated states as in the before-after estimator; or it may be a hybrid of the two principles as in the di¤erence-in-di¤erences estimator. In this section, we extend these basic estimators to allow for conditioning variables and to exploit knowledge of the serial correlation properties of error terms.

The estimators within each class di¤er in the way they adjust, condition or transform the data in order to construct the counterfactual E(Y0t j X; D = 1). Throughout the rest of this section, we consider how the various estimators construct the counterfactual and what assumptions they make about individual decision processes that determine program participation. We motivate this discussion using the simple decision and outcome models of Section 6.3. The …rst class of estimators that we consider are cross-section estimators based on matching methods. These estimators are frequently used in studies by consulting

…rms because they are relatively easy to explain to their clients. A disadvantage of this approach is that it requires strong underlying assumptions about the selection process into training. Although the method is usually applied in a cross-sectional setting, matching can be generalized to apply to panel settings as in Heckman, Ichimura and Todd (1997, 1998).

The second class of cross-section methods we consider are selection bias correction methods developed in Heckman (1976, 1979) or Heckman and Robb (1985a, 1986a). This approach is often used in studies of European training programs. It too can be extended to apply to panel data, but is most frequently applied in a cross-sectional setting.

Program evaluations by academic labor economists in the United States have relied almost exclusively on a third class of estimators: longitudinal methods that extend the before- after and di¤erence-in-di¤erences estimators. An implicit belief shared by the authors of these studies is that longitudinal methods are more robust than cross-section selection bias correction methods, which are sometimes dismissed as being “functional form dependent.”

However, we demonstrate below that as currently utilized in the applied evaluation literature, longitudinal estimators depend on functional form assumptions. Moreover, longitudinal estimators are often much less robust to choice-based sampling and other matching and screening procedures used to produce comparison samples in the empirical literature than are cross-section sample selection estimators. In the remainder of this section, we discuss the identifying assumptions that underlie the main methods used in evaluation research, and sketch out how they are implemented to produce practical estimators.

We remind the reader that throughout this chapter, we use X variables that are not determined by D. Letting X be the vector of conditioning variables and Y^P a vector of potential outcomes, we write Y_t^P = (Y0t; Y1t), and Y^P = (Y₁^P; :::; Y_T^P); X = (X1; ::::; XT).

We de…ne the admissible X on which we condition to de…ne parameters as those X that satisfy

(7.A.1) f(X j D; Y^P j Y^P

(7)

where f(X j D; Y^P) is the density of X given D and Y^P and f(X j Y^P) is the density of X given Y^P.⁴⁷ This assumption says that given the potential outcomes in both states, the actual occurrence of D provides no more information on X (“Does not cause X”).

We maintain this assumption in order to avoid masking the e¤ects of D on outcomes by conditioning on variables that are determined by D. Other de…nitions are possible but we maintain this one to make our analysis interpretable and to avoid certain technical problems in making forecasts with our parameters. Heckman (1998a) presents a more extensive discussion of this condition and relates it to de…nitions of causality and exogeneity in the econometric time series literature.

7.4 Identi…cation Assumptions for Cross-Section Estimators

When participation in training is voluntary, and evaluators have access to cross-sectional data, they can construct the distribution of outcomes for participants, F (Y1 j X; D = 1), and for nonparticipants, F (Y0 j X; D = 0). They use F (Y⁰ j X; D = 0) to approximate F (Y0 j X; D = 1); which runs the risk of selection bias. When using this approximation, the bias in estimating E(Y1 ¡ Y0 j X; D = 1) is given by

(7.3) B(X) = E(Y0 j X; D = 1) ¡ E(Y⁰ j X; D = 0):

Many schemes have been proposed to circumvent this bias. We begin by considering the intuitively-appealing method of matching.

7.4.1 The Method of Matching

The method of matching assumes that analysts have access to a set of conditioning variables, X, such that, within each “strata” de…ned by X , the counterfactual outcome distribution of the participants is the same as the observed outcome distribution of the nonparticipants⁴⁸. The statistical matching literature assumes access to a set of X variables

47Heckman and Borjas (1980) develop t his noncausality condition.

48The …rst published instance of the use of this method of which we are aware is Fechner (1860).

(8)

such that

(7.4) (Y0; Y1) k Dj X;

where “ k " denotes independence and X denotes variables on which conditioning is conducted. As a consequence of (7.4), the distributions of outcomes F (Y0 j D = 1; X) = F (Y0 j D = 0; X) = F (Y0 j X) and F (Y1 j D = 1; X) = F (Y1 j D = 0; X) = F (Y1 j X).

The method appeals to the intuitive principle that it is possible to “adjust away” di¤erences between participants and nonparticipants using the available regressors.

If assumption (7.4) is valid we can use nonparticipants to measure what participants would have earned had they not participated, provided we condition on the variables X. To ensure that this assumption has empirical content, it also is necessary to assume that there are participants and non-participants for each X for which we seek to make a comparison.

More formally, this means that (7.5) 0 < Pr(D = 1j X) < 1

over the set of X values where we seek to make a comparison. To satisfy this condition, at least in large samples, there must be both participants and nonparticipants for each X. In a …nite sample of any size, we replace this condition by the empirical probability:⁴⁹ This condition ensures that the distributions in (7.4) are de…ned for all X that satisfy it. As we demonstrate below in Section 8.2, this assumption has important practical consequences for training evaluations. Failure to satisfy this condition appears to be one major reason why matching methods produce biased estimates of the impact of training in the NJS study.

The treatment parameter E(Y1¡ Y0 j X; D = 1) cannot be identi…ed for values of X where (7.5) is violated.

Under assumptions (7.4) and (7.5), matching produces a comparison group that resem- bles an experimental control group in one key respect: conditional on X, the distribution of the counterfactual outcome, Y0, for the participants is the same as the observed distribution of Y0 for the comparison group. In particular, as long as the means exist, assumptions (7.4) and (7.5) imply that

(7.6a) E(Y0 j X; D = 1) = E(Y⁰ j X; D = 0);

and that

(7.6b) E(Y1 j X; D = 1) = E(Y1 j X; D = 0):

Therefore, for each point in X, the bias B(X) = 0. However, this assumption does not imply no selection bias, i.e., that E(U0 j X; D = 1) = 0. Instead, like experiments, matching balances the bias:

(7.7) E(U0 j X; D = 1) = E(U⁰ j X; D = 0) = E(U⁰ j X).

In an ideal experiment, we obtain a comparison group via randomization among persons

49The support of X consists of those values of X with positive density. Assumptions (7.4) and (7.5) are called “strong ignorability” by Rosenbaum and Rubin (1983).

(9)

for whom D = 1. Matching emulates an experiment by replacing randomization with conditioning on a set of X variables. Conditional on those values, persons randomly select into the program. There are no selective di¤erences in Y0 outcomes between participants and nonparticipants given X. Randomization at the stage where persons enter the program also may be thought of as a form of conditioning (Heckman, 1996). It operates conditional on D = 1. Under the conditions that justify it, randomization generates a control group for each X in the participant population. Similarly, under assumption (7.4), matching generates a comparison group, but only for these X values that satisfy (7.5), which in practice is often a much smaller set of values than would be the case with randomization.

In Section 8.2 below, we draw on the work of Heckman, Ichimura, Smith and Todd (1998) and demonstrate that the reduction in the set of X for which the parameter of interest is de…ned can be substantial. Further, because the impact parameter may depend on X; the parameter estimated by a experimental evaluation and the parameter estimated by matching may be di¤erent.

When the Rosenbaum-Rubin assumptions (7.4) and (7.5) are invoked, it is possible to construct both the “treatment on the treated” parameter E(Y1 ¡ Y⁰ j X; D = 1) and the e¤ect of “nontreatment on the nontreated” E(Y0¡ Y1 j X; D = 0): Only assumption (7.6a) is required if we are interested in the mean e¤ect of treatment on the treated. It permits agents to select into the program on the basis of U1 but not U0. Assuming that E(U0 j X) = 0, it implicitly de…nes the parameter “treatment on the treated” in an asymmetric way:

E(Y1 ¡ Y0 j X; D = 1) = ¹1(X)¡ ¹0(X) + E(U1 j X; D = 1)

because E(U0 j X; D = 1) = E(U0 j X) = 0: This parameter no longer equals the e¤ect of treatment on a randomly selected person as it would if (7.4) held. Assumption (7.6b) allows us to identify the mean e¤ect of nontreatment on the nontreated.

Using representation (3.1a) and (3.1b), (7.4) and (7.5) imply that E(U0 j X; D = 1) = E(U₀ j X; D = 0) = E(U0 j X) = 0 and E(U1 j X; D = 1) = E(U1 j X; D = 0) = E(U1 j X) = 0. Thus conditioning on X, the two parameters “treatment on the treated”

and “the e¤ect of randomly assigning a person with characteristics X to the program”

are the same.⁵⁰ From an economic standpoint, assumption (7.4) rules out selection into the program on the basis of unobservables (U0; U1) that may be partially known to people taking training but are unknown to the observing economist. In terms of the random coe¢cient model of Section 3, it rules out correlation between D and the di¤erence in

50This is also true if Y1= g1(X) + U1 and Y0= g0(X) + U0 and E(U1j X) 6= 0 and E(U0j X) 6= 0: In that case, E(Y₁¡ Y0j X; D = 1) = g1(X)¡ g0(X) + E(U1¡ U0j X) so that the two parameters are the same.

(10)

unobserved components, (U1 ¡ U⁰). It de…nes an implicit economic model that assumes that agents do not enter the program on the basis of gains unobserved by analysts. Thus it is a method congenial with the assumption that ® in (6.3) is a common coe¢cient, or that if ® varies among persons with identical X, then participation in the program is not based on this variation. In the context of that model, the “cost of participation” or any of the variables generating participation, but not outcomes, are valid conditioning variables.

Thus, if the costs of participation are distributed independently of all other variables and if Y0k is independent of Y0t, then conditioning on c or Y0k will satisfy the conditions required to justify the matching estimator. However, as we explained in Section 6.3.1, if we condition on both cost of participation and Y0k, we violate condition (7.5). Matching breaks down if there is too much information and other methods must be used to evaluate the program.⁵¹ To operationalize the method of matching, assume two samples: “t” for treatment and

“c” for comparison group. Unless otherwise noted, observations are statistically independent. Simple matching methods are based on the following idea: For each person i in the treatment group, we …nd some group of “comparable” persons. The same individual may be in both groups if that person is treated at one time and untreated at another. We denote outcomes in the treatment group by Y_i^t and we match these to the outcomes of a subsample of persons in the comparison group to estimate a treatment e¤ect. In principle, we can use a di¤erent subsample as a comparison group for each person.

In practice, we can construct matches on the basis of a neighborhood C(Xi); where Xi

is a vector of characteristics for person i. Neighbors to treated person i are persons in the companion sample whose characteristics are in neighborhood C(Xi). Suppose that there are Ncpersons in the comparison sample and Ntin the treatment sample. Thus the persons in the comparison sample who are neighbors to i, are persons j for whom Xj 2 C(Xi); i.e., the set of persons Ai =fj j X^j 2 C(Xⁱ)g: Let W (i; j) be the weight placed on observation j in forming a comparison with observation i and further assume that the weights sum to one,

Nc

X

j=1

W (i; j) = 1; and that 0 · W (i; j) · 1. Then we form a weighted comparison group mean for person i, given by

(7.8a) Y¹_i^c= ^N^P^c

j=1W (i; j)Y_j^c;

and the estimated treatment e¤ect for person i is Yi¡ ¹Y_i^c:

Heckman, Ichimura and Todd (1997) survey a variety of alternative matching schemes proposed in the literature. Here we brie‡y introduce two widely-used methods. The nearest- neighbor matching estimator de…nes Ai such that only one j is selected so that it is closest

51The regression discontinuity design estimator discussed in Section 4.6 can be applied here as a limit form of the matching estimator that identi…es E(Y₁¡ Y0j X; D = 1) at one point.

(11)

to Xi in some metric:

Ai =fj j Min

j2f1;:::;Ncgk Xⁱ ¡ X^j k g;

where \ k k " is a metric measuring distance in the X characteristics space. The Maha- lanobis metric is one widely used metric for implementing the nearest neighbor matching estimator. The metric used to de…ne neighborhoods for i is

k k = (Xi ¡ Xj)⁰^X^¡1_c (Xi¡ Xj)

where ^X_c is the covariance matrix in the comparison sample. The weighting scheme for the nearest neighbor matching estimator is

W (i; j) =

( 1 if j 2 Ai; 0 otherwise.

A version of nearest-neighbor matching, called “caliper” matching (Cochran and Rubin, 1973), makes matches to person i only if

k Xⁱ¡ X^j k < ";

where " is a pre-speci…ed tolerance. Otherwise person i is bypassed and no match is made to him or her.

Kernel matching uses the entire comparison sample, so that Ai =f1; :::; Ncg; and sets W (i; j) = K(Xj¡ Xⁱ)

Nc

X

j=1

K(Xj ¡ Xⁱ)

;

where K is a kernel. In practice, kernels are typically a standard distribution function such as that for the normal. Kernel matching is a smooth method that reuses and weights the comparison group sample observations di¤erently for each person i in the treatment group with a di¤erent Xi. Kernel matching can be de…ned pointwise at each sample point Xi or for broader intervals.

The impact of treatment on the treated is estimated by forming the mean di¤erence across the i

(12)

(7-8b) m = _N¹

t

Nt

P

i=1(Y_i^t¡ ¹Y_i^c) = _N¹

t

Nt

X

i=1

(Y_i^t¡

Nc

X

j=1

W (i; j)Y_j^c):

We can de…ne this mean for various subsets of the treatment sample de…ned in various ways.

More e¢cient estimators weight the observations accounting for the variance (Heckman, Ichimura, and Todd, 1997, 1998; Heckman, 1998a).

Regression-adjusted matching, proposed by Rubin (1979) and clari…ed in Heckman, Ichimura and Todd (1997, 1998), uses regression-adjusted Yi, denoted by A(Yi) = Yi¡Xi¯;

in place of Yi in the preceding calculations. (See the cited papers for the econometric details of the procedure). Regression-adjusted matching methods were widely used in the controversial CETA evaluations conducted in the early 1980s which we discuss below.

The essence of the idea justifying matching is that conditioning on X eliminates selection bias. Like social experiments, the method requires no functional form assumptions for outcome equations. If, however, a functional form assumption is maintained, as in the econometric procedure proposed by Barnow, et al. (1980), it is possible to implement the matching assumption using regression analysis. Suppose that Y0 is linearly related to observables X and an unobservable U0; so that E(Y0 j X; D = 0) = X¯ +E(U0 j X; D = 0);

and E(U0 j X; D = 0) = E(U0 j X) is linear in X. Under these assumptions, controlling for X via linear regression allows one to identify E(Y0 j X; D = 1) from the data on nonparticipants. Such functional form assumptions are not strictly required to implement the method of matching. Moreover, in practice, users of the method of Barnow, et al.

(1980) do not impose the common support condition (7.5) for the distribution of X when generating estimates of the training e¤ect. The distribution of X may be very di¤erent in the trainee (D = 0) and comparison group (D = 1) samples, so that comparability is only achieved by imposing linearity in the parameters and extrapolating over di¤erent regions.

One advantage of the method of Barnow, et al. (1980) is that it uses data parsimo- niously. If the X are high dimensional, the number of observations in each cell when matching can get very small. Another solution to this problem that reduces the dimension of the matching problem without imposing arbitrary linearity assumptions is based on the probability of participation or the “propensity score,” P (X) = Pr(D = 1 j X). Rosenbaum and Rubin (1983) demonstrate that if assumptions (7.4) and (7.5) hold, then

(7.11) (Y1; Y0) k Dj P (X) for X 2 Âc;

for some set Â_cwhere it is assumed that (7.5) holds in the set. Conditioning on P (X) rather than on X produces conditional independence. Condition (7.11) has the important implication that to construct the desired counterfactual conditional mean E(Y0 j P (X); D = 1), we require only that

(7.12) B(P (X)) = E(Y0 j P (X); D = 1) ¡ E(Y⁰ j P (X); D = 0) = 0.

We also could invoke (7.12) in place of (7.11) to de…ne the conditions required to justify

(13)

matching to estimate mean impacts. Conditioning on P (X) sets B(P (X)) = 0 and reduces the dimension of the matching problem down to matching on the scalar P (X). The analysis of Rosenbaum and Rubin (1983) assumes that P (X) is known rather than estimated.

Heckman, Ichimura and Todd (1998) present the asymptotic distribution theory for the kernel matching estimator in the cases in which P (X) is known and in which it is estimated both parametrically and nonparametrically. They also answer the question, “If P (X) were known would we match on it or on X?” Using the variance of the estimated average impacts as the choice criterion, the answer is “it depends”.

A major advantage of the method of randomized trials over the method of matching is that randomization works for any choice of X. In the method of matching, there is the same uncertainty about which X to use as there is in the speci…cation of conventional econometric models. Even if one set of X values satis…es condition (7.11) or (7.12), an augmented or reduced version of this set may not. Heckman, Ichimura and Todd (1997) discuss tests that can help determine the appropriate choice of X variables. Any convincing application of the method of matching requires a demonstration that an adequate model for P (X) has been selected. Heckman, Ichimura, Smith and Todd (1998) discuss this problem in depth. In the statistics literature, there is no discussion of the choice of X. Implicitly, the advice given there is to use all available regressors. One general rule, already noted in the introduction to this section, is to include in X only variables that are not caused by D given the unobservables. Intuitively, conditioning on variables caused by D masks the true e¤ect of D on outcomes.

The method of matching is sometimes used to estimate E(Y1¡ Y⁰ j X; D = 1) at points of X = x. More commonly, an averaged version of this parameter is estimated over a set S(X) :

(7.13) E(Y1 ¡ Y⁰ j D = 1) =

R

S(X)

E(Y1¡ Y⁰ j D = 1; X)dF (X j D = 1)

R

S(X)

dF (X j D = 1) .

The distinction between the average parameter and the pointwise parameter is an important one. Even though the behavioral motivation and the identifying assumptions are di¤erent, it turns out that both the matching estimator and the classical selection estimator can identify (7.13) under very di¤erent behavioral assumptions. We now turn to consider the classical selection estimator.

7.4.2 Index Su¢cient Methods and the Classical Econometric Selection Model The most troubling feature of the method of matching is the assumption that selection into a program does not occur on the basis of unobservable (to the economist) gains from the program (U1 if (7.6) is assumed; U1¡ U⁰ if (7.4) is assumed). Depending on the quality

(14)

of the data at the analyst’s disposal, it may or may not be attractive to assume that the analyst knows as much as the people being studied. The method of matching is not robust to violations of this assumption.

The traditional econometric approach to the selection problem adopts a more conser- vative approach and allows for selection on unobservables. As currently formulated, it assumes an additively separable model relating outcomes to regressors and additive errors, but does not require the strong behavioral assumptions that justify matching. Thus it trades a behavioral assumption for an additive separability assumption. It allows for selection into the program on the basis of unobserved components of outcomes. This approach is in the spirit of much econometric work that builds models to estimate a variety of counterfactual states, rather than just the single counterfactual state required to estimate the mean impact of treatment on the treated, which is the parameter of interest in most evaluations based on the methods of matching or random assignment.

In the simplest econometric approach, two functions are postulated: Y1 = g1(X; U1) and Y0 = g₀(X; U₀); where U₀ and U1 are unobservables. A selection equation is speci…ed to determine which outcome is observed. Separability between X and (U0; U1) is assumed, so that

Y₁ = g₁(X) + U₁ and Y0 = g₀(X) + U₀;

where for simplicity we assume that E(U1 j X) = E(U0 j X) = 0 so that g1(X) = ¹₁(X) and g₀(X) = ¹₀(X). These exogeneity assumptions are not strictly required but for simplicity we maintain them.⁵² This assumption de…nes functions called “structural equations” that do not depend on unobserved variables. In this notation, the treatment on the treated parameter is

E(Y1¡ Y0 j X; D = 1) = g1(X)¡ g0(X) + E(U1 ¡ U0 j X; D = 1);

which combines “structure” and “error” in a somewhat unusual way.

Much applied econometric work is devoted to eliminating the mean e¤ect of unobservables on estimates of functions like g0 and g1: However, as previously noted, the mean di¤erence in unobservables is an essential component of the de…nition of the parameter of interest in evaluating social programs. In the conventional framework, the selection bias that arises from using a nonexperimental comparison group is given by

B(X) = E(U0 j X; D = 1) ¡ E(U⁰ j X; D = 0):

In the standard evaluation problem, the goal is to set B(X) = 0, not to eliminate dependence between (U0, U1) and X and D.

52Thus we could instead postulate instruments Z such that E(U₁ j X) 6= 0 and E(U0 j X) 6= 0 but

(15)

The conventional econometric approach for addressing selection bias partitions the observed variables X into two not necessarily disjoint sets (Q; Z) corresponding to those variables in the outcome equations and those variables in the participation equation, and then postulates exclusion restrictions. It assumes that certain variables appear in Z but not in Q. The conventional approach further restricts the model so that the bias B(X) only depends on Z through a scalar index. Recall that such exclusion restrictions are not required to justify matching as an estimator.⁵³

The latent index model of program participation introduced in Section 6 motivates the characterization of selection bias as a function of a scalar index. In that model, we de…ned the index IN = H(Z) ¡ V , where H(Z) is the mean di¤erence in utilities or discounted earnings between the training and non-training states, and V is assumed to be independent of Z. The training indicator, D, then equals one when IN > 0 and equals zero otherwise, resulting in Pr(D = 1 j Z) = F^V(H(Z)): The conventional econometric selection model further assumes that the dependence of D and the unobservables U0 and U1 arises only through V and that Q and Z are independent of U0 and U1. These assumptions imply the following:

E(U0 j Z; Q; D = 1) = E(U0 j V < H(Z));

E(U0 j Z; Q; D = 0) = E(U0 j V ¸ H(Z));

E(U1 j Z; Q; D = 1) = E(U¹ j V < H(Z)) and E(U1 j Z; Q; D = 0) = E(U¹ j V ¸ H(Z)):

We could just as well postulate this representation as the starting point for our analysis of the selection estimator. Both the bias B(Z) and the mean gain of the unobservables, E(U1¡U⁰ j Z; Q; D = 1), depend on Z only through the index H(Z). When F^V is assumed to be strictly monotonic almost everywhere, we may write H(Z) = F_V^¡1(Pr(D = 1j Z)) and the bias and mean gain terms depend on Z solely through the probability of participation P . The bias is now given by

(7.14) B(P (Z)) = E(U0 j P (Z); D = 1) ¡ E(U0 j P (Z); D = 0):

This is the “index su¢cient” representation where P (Z); or equivalently H(Z); is the index.⁵⁴ An important question in the program evaluation literature is whether the selection

53Heckman, Ichimura and Todd (1997,1998) extend the theory of matching to consider separable models with exclusion restrictions and discuss the e¢ciency gains from using such restrictions. Exclusion restrictions are natural in the context of panel data models where the variables in the outcome equation are measured in periods after the decision to participate in the program is made.

54See Heckman (1980) for the …rst derivation of this representation or Heckman and Robb (1985a, 1986a).

Multiple decision rules for admission into a program require a multiple index model (Heckman and Robb, 1985a).

(16)

bias can be characterized solely as a function of P (Z) for di¤erent sets of Z, or if a more general conditioning set (Q; Z) is required to characterize this bias. In terms of the behavioral model of program participation and program outcomes presented in Section (6.2), the cost of participation, c, may play the role of V assuming that it is independent of other variables. Y0k also could play that role provided that we condition on observed variables in forming the probability, and that the residual from this conditioning is independent of all the explanatory variables in the model.

Conventional econometric selection models (e.g., Amemiya, 1985) assume that the latent variables V; U0; U₁are symmetrically distributed around zero. The assumption of symmetry for U0 and V implies that the bias B(P (Z)) is symmetric around P (Z) equal to one-half.

As shown by Figure 7.2, in the normal selection model, if P (Z) is symmetrically distributed around one-half, the average bias over symmetric intervals around that value is zero even though the pointwise bias is nonzero. If the values of P (Z) for a sample of trainees were symmetrically distributed around one-half, the pointwise bias would be non-zero and the assumption justifying matching would not hold. Nonetheless, the selection bias would still average out to zero over any symmetric intervals of P (Z) constructed around P (Z) = 1=2.

Hence, the classical selection model justi…es matching as a consistent estimator of (7.13) when it is de…ned over intervals of P (Z) where the bias cancels out, even though it would not justify matching for E(Y1 ¡ Y⁰ j X; D = 1) de…ned pointwise for any points except those where the bias is zero.

To estimate the mean e¤ect of treatment on the treated in the classic econometric selection model, we form the following regression based on equation (3.3):

(7.15) E(Y j Q; D = 1; P (Z)) = E(Y1D + Y0(1¡ D) j Q; P (Z); D = 1)

= g0(Q) + D(g1(Q)¡ g0(Q)) + D(E(U1 j Q; P (Z); D = 1)) +(1¡ D)E(U⁰ j Q; P (Z); D = 0).

The conditional means of the error terms E(U1 j Q; P (Z); D = 1) and E(U0 j Q; P (Z); D = 0) are called control functions. (Heckman and Robb, 1985a, 1986a). Under the assumptions that U0; U1 are statistically independent of Q and Z, these functions may be written as

E(U₀ j Q; P (Z); D = 0) = K0(P (Z));

E(U1 j Q; P (Z); D = 1) = K¹(P (Z)).

Speci…c distributional assumptions about (U0; V ) and (U1; V ) produce speci…c functional forms for K0 and K1. Heckman and MaCurdy (1986) present a catalogue of parametric models including the normal sample selection model of Heckman (1976, 1979).

Under these conditions, equation (7.15) is really just two sample selection bias equations applied to nonparticipants and participants respectively:

(716a) E(Y j Q; P (Z); D = 0) = g (Q) + K (P (Z))

(17)

(7.16b) E(Y1 j Q; P (Z); D = 1) = g¹(Q) + K1(P (Z)).

The most common form of the model writes g0(Q) = Q¯₀ and g1(Q) = Q¯₁ but this is not strictly required. We can use the D = 1 and D = 0 samples to recover the parameters of the model.

Assuming that there is at least one exclusion restriction (a variable in Z not in Q), and that K0(P (Z)) and K₁(P (Z)) are not perfectly collinear with Q; we can identify g₀(Q) and g1(Q) up to intercepts for any K0 and K1 functions. The intercepts are not determined.

Any intercept in g0(Q) can be allocated to K0 and vice versa; the same remark applies to the allocation of intercepts between g1(Q) and K₁. To identify the intercepts, it is necessary to have some Z values, say Z0, such that K0(P (Z0)) = 0 and some Z values, say Z1; such that K1(P (Z1)) = 0. Using such values, one can identify the unique intercepts for g0 and g₁; respectively (Heckman, 1990).⁵⁵ Another way to determine the intercepts is to assume speci…c functional forms for K0 and K1 that exclude intercept terms as in the conventional normal selection bias model.

Many nonparametric and semiparametric selection bias strategies have been proposed that do not impose functional form assumptions on K0 and K1. All of these strategies require that we identify the intercepts on sets Z0 and Z1; respectively. See the compre- hensive surveys by Heckman (1990), Powell (1994), and Honoré and Kyriazidou (1997).

Andrews and Shafgans (1996) extend a method proposed in Heckman (1990) to identify the intercepts.

With g0 and g1 in hand, we can estimate

E(Y1 ¡ Y0 j Q) = g1(Q)¡ g0(Q).

To form E(Y1 ¡ Y0 j Q; P (Z); D = 1) observe that from the preceding analysis we know g₀(Q), g₁(Q) and

E(U₁ j Q; P (Z); D = 1) = K1(P (Z)):

We do not directly estimate E(U0 j Q; P (Z); D = 1). However, under our assumptions about the (mean) independence of U0 and (Q; Z); we can write,

0 = E(U0 j Q; P (Z); D = 1)P (Z) + E(U⁰ j Q; P (Z); D = 0)(1 ¡ P (Z)).

Because we know both the second term in this expression and P (Z), we can form

55This type of identi…cation on limit sets is sometimes called “identi…cation at in…nity” because for some models the values of Z₀ and Z₁that set K₀ and K₁to zero are = § 1.

(18)

E(U0 j Q; P (Z); D = 1) = ¡K0(P (Z))(1¡ P (Z)) P (Z) : Thus we can construct

E(Y₁¡ Y0 j Q; P (Z); D = 1) = g1(Q)¡ g0(Q) + K₁(P (Z)) + K₀(P (Z))

"

1¡ P (Z) P (Z)

#

:⁵⁶ To estimate E(Y1 ¡ Y0 j Q; P (Z); D = 1) we simply integrate out (average out) P (Z) against the density of P (Z) conditional on D = 1 and Q, which can be estimated. Thus, by making separability, exclusion and intercept identi…cation assumptions, we can identify the parameter of interest. (See Heckman, Ichimura, Smith and Todd (1998) for details.)

The control function method parameterizes the bias function B(P (Z)) in terms of K1(P ) and K0(P ) and estimates these functions along with the other parameters of the model.

The dependence induced between U0 and D operating through the V is called “selection on unobservables.” The dependence induced between U0 and D operating through dependence between Z and U0 is termed “selection on observables” (Heckman and Robb, 1985a, 1986a). In this context, the method of matching assumes selection on observables, because conditioning on Z controls the dependence between D and U0, producing a counterpart to (7.6a) for the residuals: E(U0 j Z; D = 1) = E(U0 j Z; D = 0). When selection is on unobservables, it is impossible to condition on Z and eliminate the selection bias. We next turn to the method of instrumental variables which, like matching, assumes that selection only occurs on the observables.

7.4.3 The Method of Instrumental Variables

The method of instrumental variables applied to estimate E(Y1¡Y0 j X; D = 1) is a variant of the method of matching. It augments the X variables in matching with instruments Z so that

(7.17a) E(U1¡ U0 j X; Z; D = 1) = E(U1¡ U0 j X; D = 1);

(7.17b) E(U0 j X; Z) = E(U⁰ j X) and that

(7.17c) P r(D = 1 j X; Z)

depends in a nontrivial way on both X and Z. In particular, there must be at least two values of Z, say Z⁰ and Z⁰⁰; such that for any X where we seek to identify the parameter of interest, Pr(D = 1 j X; Z⁰) 6= Pr(D = 1 j X; Z⁰⁰). We assume that (X; Z) satis…es the noncausality condition (7.A.1) replacing X in that condition with (X; Z).

Condition (7.17a) rules out any dependence between U1¡ U⁰ and Z given X and D. It is implied by the condition

56Björklund and Mo¢tt (1987) construct E(Y₁¡Y0j X; D = 1) in exactly this way for a normal selection

(19)

Pr(D = 1 j X; Z; U¹¡ U⁰) = Pr(D = 1j X; Z).

The second condition (7.17b) says that U0 may depend on X but not on Z. This is not a standard IV condition but it is analogous to the balance of bias condition in matching.

Applying these conditions, we use the law of iterated expectations to write E(Y j X; Z⁰) = g0(X) + [g1(X)¡ g0(X) + E(U1 j X; D = 1)

¡E(U⁰ j X; D = 1)] Pr(D = 1 j X; Z⁰) + E(U0 j X):

We can express E(Y j X; Z⁰⁰) similarly for the same X, but a di¤erent Z = Z⁰⁰. By subtracting the E(Y j X; Z⁰⁰) from E(Y j X; Z⁰), we can form the following expression:

(7.18) E(Y j X; Z⁰)¡ E(Y j X; Z⁰⁰)

Pr(D = 1 j X; Z⁰)¡ Pr(D = 1 j X; Z⁰⁰) =

g1(X)¡ g0(X) + E(U1¡ U0 j X; D = 1) = E(Y1¡ Y0 j X; D = 1):

Condition (7.17a) ensures us that when we further condition on Z; it does not a¤ect the conditioning of U1¡U⁰on D = 1 and X: Condition (7.17c) assures us that the denominator of the expression is not zero.

Observe that if we assume that E(U0 j X) = 0 and E(U1 j X) = 0 (so g0(X) = ¹₀(X) and g1(x) = ¹₁(X));⁵⁷ and if we assume that

(7.19) (U0; U1) ??D j X; Z, then IV also identi…es

E(Y1¡ Y⁰ j X) = g¹(X)¡ g⁰(X) = ¹₀(X)¡ ¹1(X);

the e¤ect of treatment on a randomly chosen person with characteristics X. Under these assumptions, matching and IV are now indistinguishable except that IV augments the original X variables by Z.⁵⁸

If individuals select into the program on the basis of the gain in unobservables, U1¡ U0, or on the basis of variables that are (stochastically) dependent on the gain in unobservables, the conditions required for IV estimators to consistently estimate E(Y1¡ Y0 j X; D = 1) are not satis…ed (Heckman and Robb, 1985a, 1986a,b; Heckman, 1997) unless U1 = U0 or U1¡ U⁰ is unknown or not acted on at the time program participation decisions are made.

If the instrument Z is correlated with the gain in unobservables, and if individuals base their participation decisions at least in part on that gain, then the instrument is correlated with the error in the outcome equation. For the parameter of interest, treatment on the treated, failure of (7.17a) produces:

E(Y1¡ Y0 j X; Z; D = 1) = (g1(X)¡ g0(X)) + E(U1¡ U0 j X; Z; D = 1):

57If E(U₀j X) = 0, then (7.17b) is the more familiar IV condition E(U0j X; Z) = E(U0j X) = 0.

58Observe that even if E(U₀ j X) 6= 0 and E(U1 j X) 6= 0, under conditions (7.17a) to (7.17c), IV identi…es

E(Y₁¡ Y0j X) = g1(X)¡ g0(X) + E(U₁¡ U0j X).

(20)

Because the instrument enters the second term on the right hand side, it is not a valid instrument. The outcome equation may be written as

Y = g0 + DE(Y1¡ Y0 j X; Z; D = 1)+

fU⁰+ D[(U1¡ U⁰)¡ E(U¹¡ U⁰jX; Z; D = 1)]g:

The term in braces is the unobservable when the parameter of interest is the impact of training on the trained. For Z to be a valid instrument, it must be mean independent of this error term. But if the gain in unobservables determines participation, then Z conditional on D = 1 is related to the gain and the expectation of the error term conditional on Z is certainly not equal to zero. The implication of this result is that when the response to training varies among individuals, and the parameter of interest is the impact of treatment on the treated, the method of instrumental variables requires a strong behavioral assumption about how persons make their decisions about program participation.

To make this point more concretely, consider an example in which program evaluators use the distance between a person’s residence and the training center as an instrument.

They assume that the distance to the training center a¤ects outcomes only through the participation indicator in the earnings equation. The problem that arises in the heterogeneous response framework is that we would expect persons who live far away from the training center to participate in training only when their expected gain from training is relatively large – large enough to o¤set their higher cost of participation. By contrast, persons closer to the training center, who therefore face a lower cost of participation, will have smaller average expected gains from training. As a result, if an individual participates in training, their post-training earnings also depend on how far away they live from a training center. Therefore the instrument, distance, is correlated with the unobserved component of the gain from training for those who take training (D = 1) even if it is not for a random sample of persons in the population. Put di¤erently, knowing how far trainees live from a training center tells us something about their expected earnings even conditional on their training status, which means that distance from the training center is not a valid instrument in this case.⁵⁹

7.4.4 The Instrumental Variable Estimator as a Matching Estimator

Heckman (1998c) shows how most evaluation estimators, including IV estimators, can be interpreted as matching estimators using the weighting framework of equations (7.8a) and (7.8b). To see the basic idea, consider the simple random coe¢cient model

59Notice that this is an alternative interpretation that explains the “discount rate bias” recently discussed by Card (1995). Instrumenting by distance to a school or a training center may raise the estimated return to schooling or training if responses to schooling or training are heterogeneous and persons act on this heterogeneity in enrolling in schooling or training programs.

(21)

Y = ¯(X) + ®D + U:

We de…ne ¯ and ® as functions of X where E(U j X; D) 6= 0. Assume a valid instrument Z that satis…es conditions (7.17a) through (7.17c). Then

E(Y j X; Z) = ¯(X) + E(® j X; D = 1)E(D j X; Z) + E(U j X; Z):

Now we can express the outcome equation as follows:

Y = ¯(X) + E(®j X; D = 1)E(D j X; Z)

+U + [®¡ E(® j X; D = 1)][E(D j X; Z) + W ] + E(® j X; D = 1)W

where D = E(D j X; Z)+W and where, under our assumptions, the error terms have mean zero conditional on X and Z.⁶⁰ If we have a valid instrument, then E(U j X; Z) = E(UjX) and E(®jX; Z; D = 1) = E(®jX; D = 1): To identify E(® j X; D = 1) we may form pairwise comparisons between person i and anyone else, provided that the matched partner for i, say i⁰, has the same X but a di¤erent Z = Z⁰, where

E(Dj X; Z) 6= E(D j X; Z⁰):

If this condition is satis…ed, we may match a suitable i⁰ to form the pairwise estimate of the gains as follows:

Yi¡ Yi⁰

E(Di j X; Zⁱ)¡ E(Dⁱ⁰ j X; Zⁱ⁰): Therefore,

E

"

Yi¡ Yⁱ⁰

E(Di j X; Zi)¡ E(Di⁰ j X; Zi⁰)

#

= E(®jX; D = 1):

Accordingly, we can write our estimate of E(® j X; D = 1) as a weighted average of contrasts:

(7.20) ® =b ^P_i;i0

"

(Yi¡ Yi⁰)

E(Di j X; Zⁱ)¡ E(Dⁱ⁰ j X; Zⁱ⁰)

#

W (i; i⁰)

for i,i⁰ such that E(Dij X; Zi)6= E(Di⁰ j X; Zi⁰), and where the weights are given by W (i; i⁰) = (E(D_i j X; Zi)¡ E(Di⁰ j X; Zi⁰))²

P

i;i⁰(E(Di j X; Zi)¡ E(Di j X; Zi⁰))²:

60As we have stressed repeatedly, all we need is that the error terms depend only on X in order to recover E(®(X)j X; D = 1):

(22)

Formally, we set

Yi¡ Yⁱ⁰

E(Di j X; Zi)¡ E(Di⁰ j X; Zi⁰) = 0

for i; i⁰, where E(Di j X; Zⁱ) = E(Di⁰ j X; Zⁱ⁰) and we get the same result summed over all i; i⁰ since for these cases W (i; i⁰) = 0.

Equation (7.20) reveals that propensity score matching estimates E(® j X; D = 1) by taking a weighted average of all i; i⁰ contrasts for values of (X; Z) with distinct probability values. Instrumental variable estimation is just a weighted average of contrasts of conditional means constructed in terms of propensity scores. Observe that this method only requires (7.17b) and not that E(UjX; Z) = 0. Thus, like matching and randomized trials, the IV method does not eliminate conventional econometric exogeneity bias – it just balances the bias.

7.4.5 IV Estimators and The Local Average Treatment E¤ect

Imbens and Angrist (1994) reinterpret the output of IV equation (7.18) as the e¤ect of treatment on those who change state in response to a change in Z. It is a discrete approximation to the marginal treatment e¤ect (3.14) previously discussed in Section 3.4 and de…ned as the e¤ect of a marginal change of a policy on those induced to change state as a consequence of the policy. Keeping the conditioning on X implicit, their parameter is E(Y1¡ Y0 j D(z) = 1; D(z⁰) = 0) where D(z) is the conditional random variable D given Z = z, and where z⁰ is distinct from z, so z 6= z⁰: This conditions on people who switch from “0” to “1” as a consequence of the change in Z. This parameter is termed “LATE”

for Local Average Treatment E¤ect.

The LATE parameter has several nonstandard features. It is de…ned by variation in an instrumental variable that is external to the outcome equation. Unlike the instrumental variables discussed in the preceding section, in LATE, di¤erent instruments de…ne di¤erent parameters. In the traditional IV literature, Z is used to identify the e¤ect of X on outcomes. In LATE, variation in Z de…nes the parameter and no distinction between X and Z is made. When the instruments are indicator variables that denote di¤erent policy regimes, or when the instruments are di¤erent levels of intensity of a policy within a given regime (i.e., the level of ' in terms of the analysis of Section 3.4), LATE identi…es the response to policy changes for those who change their program participation status in response to the policy change. When the instruments refer to personal or neighborhood characteristics used to predict an endogenous variable, say schooling in an earnings equation, LATE has a less clear cut interpretation and its relevance for policy analysis is questionable.

(23)

The measured variation in Z among people could be due to their choices of Z. If distance to the nearest school is the instrument, LATE estimates the e¤ect of variation in the distance to school on the earnings gain of persons who are induced to change their schooling status as a consequence of the di¤erent commuting costs due to distance that vary within a speci…ed range. If a personal characteristic is used as an instrument, for example, family income, the parameter de…nes the marginal change in the outcome with respect to the variation in family income among those who would have changed their state in response to the sample variation in family income.

To de…ne the LATE parameter more precisely, let D(z) be the conditional random variable D given Z = z. (Recall that conditioning on X is kept implicit in this section).

Since D(z) is de…ned conditional on a particular realization of Z = z, it is independent of Z.⁶¹ Imbens and Angrist (1994) assume that

(7.IA.1) (Y0; Y1; D(z)) are independent of Z and P r(D = 1j Z = z) is a nontrivial function of Z, where these random variables are understood to be de…ned conditional on X.

As a consequence of this assumption, for a given person (with …xed Y1; Y₀); and recalling that for Z = z; Y = Y0(1¡ D(z)) + Y¹D(z); we may write

(7.21) E(Y j Z = z) ¡ E(Y j Z = z⁰) = E[D(z)Y1+ (1¡ D(z))Y0 j Z = z]

¡E[D(z⁰)Y₁+ (1¡ D(z⁰))Y₀ j Z = z⁰]

= E((D(z)¡ D(z⁰))(Y1¡ Y⁰)):

The …nal step follows from assumption (7.IA.1) and depends crucially on the conditional independence of Y1; Y₀ and D(z) from Z.

In the Imbens-Angrist thought experiment, all of the random variables in the expression are de…ned for the same person. Thus for di¤erent values of Z = z, Y1 and Y0do not change and fD(Z)g for z in the support of Z is a collection of not necessarily independent random variables produced by changing Z and either not changing any other random variable or changing them only in the way speci…ed in assumption (7.IA.2) below. In terms of the index model of discrete choice theory with index function H(Z; V ), which may be a net pro…t or net utility function, we have

(7.22) D = 1(H(z; V )¸ 0)

and V is a random variable. In the Imbens and Angrist (1994) thought experiment, V stays …xed while z is varied.

From equation (7.21) it follows that

(7.23) E(Y j Z = z) ¡ E(Y j Z = z⁰)

= E(Y1¡ Y⁰ j D(z) ¡ D(z⁰) = 1) Pr(D(z)¡ D(z⁰) = 1)

61For two random variables (J; K) let f be the density (or frequency). Then f(J; K) = f(J j K)f(K) so J given K is statistically independent of K although f(J j K) may be functionally dependent on K: