Virtually all estimation methods can be readily adjusted to account for choice-based sam-pling (i.e., oversamsam-pling of trainees relative to comparison group members) or measurement error in training status among the comparison group (some of the comparison group mem-bers have taken training). Some methods require no modi…cation at all.
The data available for analyzing the impact of training on earnings are often nonrandom samples. Frequently they consist of pooled data from two sources: (a) a sample of trainees selected from program records and (b) a sample of nontrainees selected from some national sample. Typically, such samples overrepresent trainees relative to their proportion in the population. This creates the problem of choice-based sampling …rst analyzed in a more general form by Rao (1965, 1986) and applied by Manski and Lerman (1977) and Manski and McFadden (1981).
A second problem, contamination bias, arises when the training status of certain indi-viduals is recorded with error. Many control samples such as the U.S. Current Population Survey or the U.S. Social Security Work History data do not reveal whether or not persons have received training. These sampling situations combine the following types of data:
(A) outcomes, observable characteristics and participation status for a sample of trainees (D = 1);
(B) outcomes, observable characteristics and participation status for a sample of non-trainees (D = 0);
(C) outcomes and observable characteristics for a national comparison sample of the pop-ulation (e.g., CPS or Social Security records) where the training status of persons is not known.
If type (A) and (B) data are combined and the sample proportion of trainees does not converge to the population proportion of trainees, the combined sample is a choice-based sample. If type (A) and (C) data are combined with or without type (B) data, there is contamination bias because the training status of some persons is not known.
We can modify most procedures developed in the context of random sampling to consis-tently estimate E(® j X; D = 1) using choice-based samples or contaminated comparison groups. In some cases, a consistent estimator of the population proportion of trainees is required. We illustrate these claims by showing how to modify the instrumental variables estimator to address both sampling schemes. We brie‡y consider several other methods as well. Heckman and Robb (1985a, 1986b) give explicit case-by-case treatment of these issues for a variety of estimators including all of the panel data estimators considered in this paper.
7.7.1 The IV Estimator and Choice-Based Sampling If condition (7.17b) is strengthened to read
(7.17b0) E(U0jX; Z; D) = E(U0jX) for D = 0; 1;
the IV estimator is consistent for E(® j X; D = 1) in choice-based samples. The important point to notice is that identi…cation condition (7.17b) is written for the population. By contrast, (7.17b0) is written for a subset of the population conditional on D = 1 or D = 0.
If we reformulate the IV condition to apply to the D = 0 and D = 1 subpopulations, it does not matter how we reweight the subpopulations to form samples – the orthogonality conditions apply to any combinations of them.
To see how to form consistent estimators under the assumptions of Section 7.4.3 let D¤ be the event that “a trainee is observed in a choice-based sample.” In a sample generated by choice-based sampling the probability of participation Pr(D¤ = 1) = P¤ 6= P = Pr(D = 1), where P is the probability of participation in the case of random sampling.
Now in the choice-based sample, let U0¤ be the random variable U0 generated from choice-based sampling, so that
E(U0¤jX; Z) = E(U0jX; Z; D¤ = 1)P¤+ E(U0jX; Z; D¤ = 0)(1¡ P¤):
If (7.17b0) applies, then we can write
E(U0¤jX; Z) = E(U0jX; Z)P¤+ E(U0jX; Z)(1 ¡ P¤) = E(U0jX; Z):
Provided P is known, it is possible to reweight the data to secure consistent IV esti-mators for E(® j X; D = 1) under the assumptions of Section 7.4.3. Simply multiply both dependent and independent observations by the weight
! = D P
P¤ + (1¡ D)
µ1¡ P 1¡ P¤
¶
and apply IV to the transformed data. This weighting ensures that (7.17b) applies to the reweighted data. The IV method applied to the reweighted samples consistently estimates the parameters of interest provided that other identifying assumptions are maintained (see Heckman and Robb, 1985a, 1986a).
7.7.2 The IV Estimator and Contamination Bias
For data of type (C), D is not observed. Applying the IV estimator to pooled samples of type (A) and (C) data assuming that all observations in the type (C) data have D = 0 produces an inconsistent estimator if the type (C) data includes some trainees. However, with a minimal amount of additional information, it is possible to identify the estimator in this case.
In terms of the IV equations (7.18) or (7.20), it is possible to generate E(Y jX; Z) from the type sample (C) sample. The type (A) data yield the sample joint distribution of (Y; X; Z) given D = 1 and in particular the joint distribution f (X; ZjD = 1). Since we know
f(X; Z) = f (X; ZjD = 1)P + f(X; ZjD = 0)(1 ¡ P )
we can solve for f(X; ZjD = 0) if we know P. From Bayes’ rule, we can write (denoting
“f” as the density)
Pr(D = 1jX; Z) = f (X; Z; D = 1) f (X; Z) :
The two densities can be constructed from the information in the type (C) and type (A) samples. Thus with knowledge of P , it is possible to estimate Pr(D = 1 j X; Z) for each person and hence to construct the IV estimator for contaminated samples. One can think of this procedure as a data imputation exercise. See Heckman and Robb (1985a, 1986a), Imbens and Lancaster (1996) and Heckman (1998a) for the econometric details.
7.7.3 Repeated Cross-Section Methods with Unknown Training Status and Choice-Based Sampling
The repeated cross-section estimators discussed in Section 7.6.2 are inconsistent when ap-plied to choice-based samples unless additional conditions are assumed.74 For most of the repeated cross section estimators, it is necessary to know the identity of the trainees to weight the sample back to the proportion of trainees that would be produced by a random sample to obtain consistent estimators. Hence, the class of estimators that does not require knowledge of individual training status is not robust to choice-based sampling.
A subset of the estimators that we have examined are robust to choice-based sampling.
Any estimator that is constructed conditional on D has the property of being robust to choice-based sampling. (Recall our discussion of instrumental variables estimators where the condition (7.17b) was modi…ed to hold conditionally on D.) A control function esti-mator constructs
(7.34a) E(U1tjX; Z; D) and (7.34b) E(U0tjX; Z; D)
and works with the purged residuals
U1t¡ E(U1tjX; Z; D) and
U0t¡ E(U0tjX; Z; D)
from the original model. Then the parameters of (7.34a) and (7.34b) are estimated along with the remaining parameters of the model. Identi…cation conditions for control function models are given in Heckman and Robb (1985a).75 The selection bias terms K0(P (Z)) and K1(P (Z)) in equations (7.16a) and (7.16b) are examples of control functions with the inverse Mills’ ratio as the leading example used in empirical work. Likewise, the autore-gressive estimator of Heckman and Wolpin (1976) discussed in Section 7.6.1 is a control function estimator where
Kt= ½t¡t0Ut0 for t > t0 > k
74This is not always true. For example, when the environment is time homogeneous, (Yt¡ Yt0)=P remains a consistent estimator of E(® j X; D = 1) in choice-based samples as long as the same proportion of trainees are sampled in periods t0 and t.
75They present conditions under which it is possible to identify the control functions apart from the parameters of the model. See also Heckman (1990).
and where Yt0 ¡ ¯t0 ¡ ®Dt = Ut0. The higher-order autoregression schemes discussed in Heckman and Robb (1985a, p. 223) are also control functions. They discuss additional control functions based on factor models and optimal forecasting schemes.
The basic principle of the control function is that of constructing conditional means of the errors in each regime (D = 0; 1) and estimating these conditional means and the other parameters of the model. As long as the control function is de…ned to be conditional on D, the control estimator is robust to choice based sampling.