The Counterfactual Framework and Propensity Score Matching As long as we do not have experimental design and have only observational

studies available to examine the effect of cram schooling, what then can be done to handle selection bias caused by omitted variables and explore possible heterogeneous causal effects? The method used in this research is the propensity score matching, which uses observed family, school, and individual characteristics of students to match students who undertake cram schooling and those who do not and then

calculates the average difference in outcomes between these two groups. With proper assumptions, this matching method is able to estimate separately the average causal effect of those who participate in cram schooling and those who do not as well as an overall average estimate of the effect for these two groups combined. The matching method is developed under the framework of the counterfactual causal inference. To explain how this matching method works, it is useful to understand the counterfactual framework of causality with a binary cause which presupposes two potential

outcomes for all members of the population.⁴

4 The discussion of the counterfactual model and propensity score matching in this paper follows the discussion presented in Morgan and Winship (2007). See also Winship and Morgan (1999), Winship and Sobel (2004), and Morgan and Harding (2006).

If the participation in the cramming school is a random variable, D, that is equal to 1 if the student participates in cram schooling and equal to 0 if not. Using the language of the experimental design, those who undertake cram schooling can be described as experiencing the treatment and those who do not is in the control group.

Since every student may or may not undertake cram schooling, for every student there are two potential outcomes student which are denoted as Y¹ and Y⁰. Y¹ is the potential outcome for the individual student participating in cram schooling (D = 1) and Y⁰ is the potential outcome for the individual student with no cram schooling (D = 0). The causal effect of the treatment for every student is defined as: δ = Y¹ – Y⁰, that is, the difference between the outcome if the student undertakes cram schooling and the outcome if the same student does not undertake cram schooling. This causal effect on the individual level is not observable since in reality we can only observe one of the outcomes for each student. If we are willing to make certain assumptions about the joint law of Y¹, Y⁰, and D, then we can identify the average treatment effect on the whole population (ATE):

δ^ATE = E[Y¹– Y⁰] = E[Y¹] – E[Y⁰]. (1) ATE is a commonly investigated causal effect and estimated in the observational study by a random sample of the population of interest. In an observation study, the researcher faces the situation that a proportion of the population of interest taking the treatment would have fairly different characteristics than those who did not take the treatment. If the proportion of the population taking the treatment is π, then Equation (1) becomes

δ^ATE = {πE[Y¹| D = 1] + (1–π) E[Y¹| D = 0]} (2) – {πE[Y⁰| D = 1] + (1–π)E[Y⁰| D = 0]}.

Equation (2) can be rearranged and expressed as

δnaive = E[Y¹| D = 1] – E[Y⁰| D = 0] = δ^ATE (3)

+ {E[Y⁰| D = 1] + E[Y⁰| D = 0]}

+ (1–π){E[δ|D = 1] – E[δ|D = 0]}.

Equation (3) shows that the naïve estimator, δnaive, an estimator often used in the observational study, is a combination of the true average treatment effect, δ^ATE, plus two potential sources of biases. The first potential source of bias, E[Y⁰| D = 1] + E[Y⁰| D = 0], is a baseline bias. This bias exists because in the absence of treatment, the average situation of those in the treatment group would not be the same as those in the no-treatment group. The second source of bias, (1–π){E[δ|D = 1] – E[δ|D = 0]}, is a differential treatment effect bias. This bias exists because the expected treatment effect for those in treatment group and those in the no-treatment group are different.

Equation (2) also shows that δ^ATE is a weighted combination of two conditional average treatment effects: the average treatment effect on the treated (ATT) and the average treatment effect on the untreated (ATU). In the present case, ATT is the average treatment effect for those who typically undertake cram schooling for math and ATU is the average treatment effect for those who typically do not participate in cram schooling. ATT and ATU are defined respectively as

δ^ATT = E[Y¹– Y⁰| D = 1] = E[Y¹| D = 1] – E[Y⁰| D = 1], (5) δ^ATU = E[Y¹– Y⁰| D = 0] = E[Y¹| D = 0] – E[Y⁰| D = 0]. (6)

In order to obtain an unbiased and consistent estimation of ATU, one needs to assume that E[Y¹| D = 1] = E[Y¹| D = 0], i.e., the expected treatment effect for those in treatment group and those in the no-treatment group are the same. To obtain an unbiased and consistent estimate of ATT, one needs to assert that E[Y⁰| D = 1] equals to E[Y⁰| D = 0], i.e., no baseline difference between the treated group and the

untreated group. In general, the necessary assumption for the unbiased and consistent estimation of ATT is less demanding than for ATU. In the case of the present research, the needed condition for estimating ATT is to assume that those participate in cram

schooling would, on average, do no better or no worse without cram schooling than those who actually have no cram schooling. ATT is an commonly examined subject of interest since if there is no treatment effect on the treated, it is reasonable to assume that the same treatment would not benefit the untreated. In the present research, only if cram schooling has a significant impact on those undertaking such an activity would it be necessary for us to consider the possible effect of extra learning outside of schools on those who do not undertake cram schooling.

In an observational study, the assumption of no baseline difference or the same treatment effect for both the treated and untreated group is very unlikely. However, if one is willing to assume the independence between Y⁰, Y¹and D conditionally on a set of observable covariates X, then E[Y⁰| D = 1, X= x] = E[Y⁰| D = 0, X= x] and E[Y¹| D = 1, X= x] = E[Y¹| D = 0, X= x]. If these conditional independence assumptions are satisfied, then one is able to stratify the sample into subgroups conditioning on these covariates to estimate E[Y⁰| D = 1] and E[Y¹| D = 0]. Within each subgroup, the researcher selects a case from the no-treatment group to match a case from the treated group based on observed characteristics of X and calculates the differences in the observed outcomes.

In practice, realization of assumptions of the conditional independence requires a considerable number of observable covariates and the limited sample size makes the conventional method of stratification and matching conditioning on all these variables impossible to implement. Paul R. Rosenbaum and Donald B. Rubin, in a set of papers, developed the method of propensity score matching and solved a variety of practical problems (see Morgan and Harding 2006). The method stipulates that the systematic differences between those who take the treatment and those who do not can be captured completely by a set of observed selection variables S and this conditional independence implies independence conditionally on a specific function of S, called

the propensity score. The propensity score is a one-dimensional summary measure of the probability of being treated and can be noted as Pr (D = 1|S). This probability is between 0 and 1. The method solves the problem of matching individuals on the whole set of conditioning variables. It is enough to match the treated and the untreated on their propensity scores. One potential drawback of the method is the possibility of no suitable matches for all treatment cases. In fact, if the probability of being treated equal to 1, it is not possible to find a counterfactual in the no-treatment group. In short, one can only estimate the treatment effect for matched cases.

The actual procedure of propensity score matching is fairly straightforward (see Caliendo and Kopeinig 2008). First of all, the propensity score for each case in the sample is estimated with all covariates by a logit or probit regression model. The matching of propensity scores between the treated and untreated cases can be performed by one of the many matching algorithms available for several statistical packages. In general, these matching algorithms can be grouped into four types: exact matching, nearest-neighbor matching, interval matching, and kernel matching.

Morgan and Harding (2006: 34) suggested that nearest-neighbor caliper matching with replacement, interval matching, and kernel matching should be preferred to nearest-neighbor matching without replacement. They also recommended matching on both the propensity score and the Mahalanobis metric for achieving balance. After matching is performed, it is important to assess the matching quality. One simple way to assess the matching quality is to perform a two sample t-test to check if there is a significant difference in sample means of each covariate for the treatment and the matched control groups. After the matching quality is verified, one can estimate the specific average treatment effect.

For the present research, the propensity score matching is performed by

“psmatch2,” a Stata 10 routine developed by Edwin Leuven and Barbara Sianesi

(2003). The matching algorithm chosen is kernel matching with Epanechnikov kernel and the matching is on both the propensity score and the Mahalanobis metric to achieve balance.

在文檔中家庭社經地位、父母教養方式及作為對青少年學習表現之影響 (頁 59-64)