8 Econometric Practice
8.2 Characterizing Selection Bias
We next draw on the work of Heckman, Ichimura, Smith and Todd (1996, 1998), and Heck-man, Ichimura and Todd (1997), and demonstrate the value of better data in conducting evaluations of active labor market policies. Placing people in the same local labor market and administering them the same survey instrument makes an enormous di¤erence to the quality of an evaluation. So does comparing comparable people. We also summarize the
81Bound, Brown, Duncan and Rodgers (1994) provide evidence of recall e¤ects in labor market survey data and Sudman and Bradburn (1982) discuss the general issue of recall in surveys.
best available evidence on the validity of the widely-used practice of using “no shows” as a comparison group.
The mean selection bias in using nonparticipants to approximate participant outcomes conditional on X is given by
(8.1) B(X) = E(Y0 j X; D = 1) ¡ E(Y0 j X; D = 0):
Selective di¤erences in uncontrolled variables (variables on which the analyst cannot con-dition) produce selection bias. Such di¤erences may arise from self-selection decisions by the agents being studied or from uncontrolled di¤erences between treatments and controls due to the inadequacy of the available data. We argue that much of the bias reported by LaLonde (1986) in his in‡uential study of the e¤ectiveness of econometric estimators arises from the second source-the inadequacy of the data. In ordinary non-experimental evalu-ations, B is unknown. This produces the evaluation problem. Using data from a social experiment conducted under the conditions speci…ed in Section 5, it is possible to estimate the …rst term on the right hand side of (8.1). Using a nonexperimental comparison group it is possible to estimate the second term.
The conventional measure of selection bias, B, used by LaLonde (1986), Ashenfelter (1978) and Heckman and Hotz (1989) is the mean di¤erence between the earnings of controls and the earnings of comparison group members:
B = E(Y0jD = 1) ¡ E(Y0jD = 0):
This is the coe¢cient on D of a regression of Y0 on D in a pooled comparison group control sample Y0 = ®0 + BD + ¿ when E(¿jD = 0). It does not condition on X.
Heckman, Ichimura, Smith and Todd (1996, 1998) estimate the bias term B(X) using nonparametric methods. With their estimated bias, they test the identifying assumptions that justify matching, the classical econometric selection bias estimator and a nonpara-metric version of di¤erence-in-di¤erences. They show that it is possible to decompose the conventional measure of bias, B, which does not condition on X, into three components.
The …rst component of B results from the fact that for certain values of X among par-ticipants there may be no comparison group members, and vice versa – in formal terms the supports (regions of X where the density function is not zero) of X in the participant and comparison groups may not completely overlap. The second component results from di¤erences in the distribution of X between participants and comparison group members within the region of common support – i.e., for those values of X common to the two groups. The third component represents selection on unobservables as de…ned in Section 7. This decomposition is helpful for understanding the sources of selection bias as it is conventionally measured.
To reduce the set of conditioning variables, X, down to manageable size, Heckman, Ichimura, Smith and Todd (1996,1998) condition on the probability of program participa-tion, P (X), rather than directly on X. This is always possible, because we may write the outcome in the absence of training for the experimental controls as follows:
Y0 = E(Y0 j P (X); D = 1) + V1;
where E(V1 j P (X); D = 1) = 0. The corresponding expression for the comparison group members is given by
Y0 = E(Y0 j P (X); D = 0) + V0;
where E(V0 j P (X); D = 0) = 0. The residuals average out to zero within participant (D = 1) and nonparticipant (D = 0) samples.82
Using these methods, this bias B can be decomposed into three components:
(8.2) B = E(Y0 j D = 1) ¡ E(Y0 j D = 0) = B1 + B2+ B3.83
To help de…ne B1, we …rst de…ne SP as the common support – the set of P (X) values common to the D = 1 and D = 0 samples. In addition, let S1P denote the set of P (X) values found in the D = 1 sample and S0P the set found in the D = 0 sample. The …rst bias term is given by
B1 =
Z
S1PnSP
E(Y0 j P (X); D = 1)dF (P (X) j D = 1)
¡
Z
S0PnSP
E(Y0 j P (X); D = 0)dF (P (X) j D = 0);
where S1PnSP is the subset of S1P not in SP, i.e., the set of P (X) values present in the D = 1 sample but not in the D = 0 sample. The set S0PnSP is de…ned comparably for the D = 0 group. The second bias term arises from the di¤erent densities of P (X) in the D = 1 and D = 0 samples:
B2 =
Z
SP
E(Y0 j P (X); D = 0)[dF (P (X) j D = 1) ¡ dF (P (X) j D = 0)]:
The third bias term is the contribution of selection bias rigorously de…ned:
B3 = PXB¹SP;
82This is a valid decomposition whether or not matching is a valid evaluation estimator.
83This decomposition was …rst published in Heckman, Ichimura, Smith and Todd (1996).
where
BSP =
R
SP B(P (X))dF (P (X)j D = 1)
R
SP dF (P (X)j D = 1) ;
is the average selection bias de…ned over the common support set, SP, and B(P (X)) = E(U0 j P (X); D = 1) ¡ E(U0 j P (X); D = 0) is the selection bias at each point.
The …rst term on the right hand side of (8.2) is the di¤erence between the mean earnings of the controls and the comparison group members in the region outside the common support – that is, for those values of P (X) that appear only among controls or only among comparison group members. This is the bias that arises from comparing noncomparable people – persons in D = 1 who have no counterpart in D = 0 and vice versa. The second term gives the bias due to the di¤erent densities of P (X) in the control and comparison groups over the region in which the densities of P (X) for the two groups overlap. This is the bias that arises from weighting comparable people incomparably.
Finally, the third term, or the “true” selection bias, is the weighted (by the distribution of P (X) for controls) average di¤erence between the earnings of controls and comparisons who have the same P (X). If matching is an e¤ective evaluation method, the third term, B3, representing selection on unobservables, should be zero or close to it. Recall from the discussion in Section 7.3 that under the assumptions that justify matching, B(P (X)) = 0 for all P (X). We can interpret estimates of this term as a measure of the extent to which matching does not balance the bias between treatment and comparison group members.
Heckman, Ichimura, Smith and Todd (1996,1998) estimate the components of selection bias using the experimental controls from the NJS and a sample of eligible non-participants (ENPs) from the same sites as well as using other, more traditional comparison groups of the sort discussed in Section 8.1.84 Figure 8.1A plots the densities of P (X) for adult male controls and ENPs. Figure 8.1B plots the densities of the P (X) for adult female controls and ENPs. In both groups, for a substantial range of P (X) values in the control sample, there are few or no corresponding comparison group members. Among the adult males, nearly one half of the controls’ P (X) values are outside the region of overlapping support.
Table 8.1 presents estimates of the decomposition in (8.2) for adult males and females in the NJS. As shown by the second row of Table 8.1, di¤erences in the support of P (X) are an important source of bias. This source of bias is of at least the same order of magnitude as the conventional measure of selection bias presented in the …rst row of the table. The third row of Table 8.1 indicates that di¤erences in the distributions of P (X) between
84Heckman, Ichimura, Smith and Todd (1996, 1998) estimate the E(Y0 j P (X); D = 0) terms using a local linear regression of the outcome Y0on P (X). The estimates of P (X) are obtained from logit models of participation in the JTPA program, but estimates using nonparametric P are very similar.
control and comparison group members in the region of common support are an important source of bias. Finally the fourth, …fth and sixth rows of the table shows that for both groups the selection bias term, B3; is relatively small compared to the other components of B, the bias as conventionally measured. However, B3 is still quite large compared to the estimated program impact. This result indicates that matching on P (X) mitigates but does not eliminate selection bias in the NJS data. Selection on unobservables is a substantial component of the experimentally estimated impact of treatment even using the rich data available in the NJS. It is likely to be even more important in cruder data sets, as we document below.
Eliminating selection bias in most non-experimental evaluations may be even more
dif-…cult than is suggested by Table 8.1. The NJS eligible nonparticipant comparison group was constructed speci…cally for the purpose of conducting a high quality non-experimental evaluation of JTPA. These data contain many more demographic and baseline character-istics than are commonly available to program evaluators. Further, the comparison group members reside in the same labor market as the trainees, are administered the same survey instruments, and are all eligible for JTPA. The encouraging news from the analyses of Heckman, Ichimura, Smith and Todd (1998) and Heckman, Ichimura and Todd (1997) is that less expensive comparison groups that contain limited labor force status histories but still place comparison group members in the same local labor markets as participants and administer the same surveys to both groups should do just as well as the richer data.
Table 8.2 presents the decomposition when no-shows are used as a comparison group.
In the context of the NJS, no-shows are persons randomly assigned to the experimental treatment group who never enroll in JTPA and do not receive JTPA services (these are the dropouts of Section 5). In the absence of an experiment, no-shows are usually persons who enroll in a program but drop out prior to service receipt. Cooley, et al. (1979) and Bell, et al. (1995) advocate the use of no-shows as a comparison group. On a priori grounds, no-shows are not necessarily an attractive comparison group. Selective di¤erences in unobservables between participants and no-shows will make the latter a poor comparison group if selection on unobservables (conditional on applying to and being accepted into the program) is an important component of bias. Yet, at the same time, no-shows are an attractive comparison group because they are located in the same labor market and administered the same questionnaire as participants.
The …rst two columns of Table 8.2 present the decomposition in (8.2) constructed using the experimental controls and the no-shows from the NJS. Figures 8.2A and 8.2B present the densities of P (X) for the same groups. There is much more overlap in the supports of the no show and control groups than there is in the comparison and control groups.
Moreover, the shapes of the distributions of P are closer for no shows and control group members than they are for comparisons and controls. (Compare Figures 8.1A and 8.1B
with 8.2A and 8.2B respectively.)
The evidence on no-shows is mixed. The raw measure of bias B is small for both males and females. In addition, the support and density weighting problems are much smaller than that reported in Table 8.1, though part of this di¤erence results from the smaller set of X’s available in the NJS data to construct P (X) for the no-shows. However, as shown in the …nal row of Table 8.2, the selection bias for the no-shows remains sizeable when measured as a percentage of the treatment impact.
The biases obtained for the no-shows in the NJS or the comparison group are much smaller than the biases that result from comparing the NJS controls to a comparison group constructed from a general survey data set. The last two columns of Table 8.2 present the bias decompositions based on a comparison group of persons eligible for JTPA drawn from the U.S. Survey of Income and Program Participation (SIPP). The SIPP is a national survey data set of the type widely used in evaluating active labor market policies. SIPP data are rich enough to determine program eligibility. The comparison group constructed from it is not drawn from the same local labor markets as the NJS control group due to sample size and con…dentiality limitations. Moreover, the earnings measure in the SIPP di¤ers substantially from that used for the NJS controls due to di¤erences in the respective survey instruments (Smith, 1997a,b).
A comparison of the …rst rows of Tables 8.1 and 8.2 shows that for the SIPP eligible comparison group, the raw bias, B, is actually smaller for adult males than with the ENP comparison group. The raw bias is about the same magnitude for adult females using the two comparison groups, although of a di¤erent sign. However, B3, selection bias, rigorously de…ned, is much larger for the SIPP eligible comparison group than for the NJS eligible non-participant comparison group in Table 8.1. This indicates that mismatch of labor markets and questionnaires between participants and comparison group members is a major source of selection bias.
Heckman, Ichimura, Smith and Todd (1998) examine these issues in greater depth. In particular, using the NJS data on controls and ENPs, they match controls at two sites with ENPs at the two remaining sites. This comparison shows the e¤ect of putting com-parison group members in di¤erent local labor markets while holding constant the survey instrument used to measure earnings in the two groups. They …nd that mismatching the local labor markets creates a substantial bias on the order of 30-40 percent of the estimated treatment e¤ect.85 Overall, comparing the …fth rows of Tables 8.1 and 8.2 suggests that putting participants and comparison group members in the same labor markets and giv-ing them the same questionnaire eliminates a substantial amount (around …fty percent) of
85Friedlander and Robins (1995) report similar …ndings regarding the importance of drawing participants and nonparticipants from the same local labor markets.
selection bias, rigorously de…ned.
Those authors also report that a substantial bias results from using only those obser-vations that fall into the common support of P (X), SP, for the control and comparison group samples to estimate the impact of treatment. Estimating the experimental treatment e¤ect on the common support rather than on the full support of P (X) among the controls increases the estimate by 50 percent. Put di¤erently, the experimental impact estimate is higher for persons whose P (X) lies in the common support.
The failure of the common support condition due to an absence of comparison group members comparable to participants in terms of X (or P (X)) is a major source of bias in conducting nonexperimental evaluations. This motivates one of our major recommenda-tions presented in Section 11 – that nonexperimental comparison groups should be designed so that they have the same set of X or P (X) values present among program participants.
An important advantage of an experimental control group in program evaluations is that randomization ensures that the support of treatment and control observed charac-teristics is the same, up to sampling variation. The results just discussed indicate that non-experimental methods may be able to mitigate major sources of selection bias that arise in the region of common support. Simple principles of using the same questionnaire, locating participants and comparison group members in the same labor markets, comparing comparable people and weighting comparison group members appropriately go a long way toward reducing the conventional measure of selection bias. However, because a signi…cant source of the bias in non-experimental studies is the failure to …nd a comparison group for which the support of the observed characteristics largely overlaps that of the participants, such studies can only provide a partial description of the impact of treatment. Estimates obtained only over the region of common support may be a poor guide to the impact for all participants. We suspect that this source of bias is substantial for other programs besides the JTPA program where it has been studied.
Heckman, Ichimura, Smith and Todd (1998) use the estimated B(P (X)) functions to test among competing identifying assumptions for alternative evaluation estimators using the NJS data. Using a variety of X, they reach the following main conclusions.
(I) They reject the assumption: AM : B(P (X)) = 0 for all X which justi…es matching.
(II) They do not reject the assumption: ASS : B(X) = B(P (X)) which says that the bias can be written as a function of P (X)
which justi…es the index su¢cient classical sample selection model. However, since the support of P (X) is limited, the method cannot recover E(Y1¡ Y0 j X; D = 1) in their data because of the inability to identify the intercepts in the model. They decisively reject the normal sample selection model in their data.
(III) They do not reject the assumption: ADD : Bt(P (X))¡ Bt0(P (X)) = 0 for t > k > t0 that justi…es the nonparametric di¤erence-in-di¤erences estimator introduced in Heckman, Ichimura and Todd (1997) and Heckman, Ichimura, Smith and Todd (1998). This estimator does not require the full support conditions required in the sample selection estimator although if they are not satis…ed, the treatment e¤ect de…ned only over a subset of the support of P (X):
Finally, even though the assumptions justifying matching are rejected, matching, non-parametric di¤erence-in-di¤erences, and sample selection models do about equally well for the average of E(Y1 ¡ Y0 j X; D = 1) over the support where it can be de…ned although matching is somewhat inferior to the other two estimators. Their analysis demonstrates that over intervals where the bias balances out, fundamentally di¤erent estimators based on di¤erent identifying assumptions can identify the same parameter.
Heckman, Ichimura, Smith and Todd (1998) emphasize the importance of using semi-parametric and nonsemi-parametric versions of all three estimators (matching, classical sample selection and di¤erence-in-di¤erences). Whey they use conventional parametric versions of these estimators, they estimate substantial biases.
The evidence presented in this subsection has major implications for the correct interpre-tation of LaLonde’s (1986) in‡uential examination of the e¤ectiveness of nonexperimental evaluation strategies for training programs. As noted in Table 7.1B, LaLonde’s nonexperi-mental comparison groups were constructed from various noncomparable data sources. The comparison groups were located in di¤erent labor markets from program participants and had their earnings measured in di¤erent ways than the participants. His measure of selec-tion bias, B, combines the three factors disentangled in the analyses of Heckman, Ichimura, Smith and Todd (1996,1998) just summarized.86 In addition, like most of the studies sum-marized in Tables 7.1A and 7.1B, he lacked information on recent preprogram labor force status dynamics, which as noted in Section 6.4, are an important predictor of participation in training. A major conclusion of the analysis of Heckman, Ichimura, Smith and Todd (1998) is that a substantial portion of the bias and sensitivity reported by LaLonde is due to his failure to compare comparable people and to weight them appropriately. Further, mismatch of labor markets and questionnaires are also likely important sources of the se-lection bias measured in LaLonde’s study. Overall, the available evidence indicates that simple parametric econometric models applied to bad data do not eliminate selection bias.
Instead, better data, including a rich array of X variables for use in constructing P (X),
86Some of LaLonde’s (1986) measures of B are based on a linear regression model that “partials out”
X in the sense of linear regression.conditions on X. Heckman and Todd (1994) present the appropriate decomposition for this case. When estimated using the NJS controls and eligible nonparticipants, the same qualitative conclusions emerge about the importance of various components of bias.
and more appropriate comparison groups, go a long way toward eliminating the sensitivity problems raised in LaLonde’s (1986) study.