Estimation Method - doi:10.6342/NTU201801706

doi:10.6342/NTU201801706

4.2 Estimation Method

There are numerous methods to estimate treatment effect at the threshold according to previous studies. They can be briefly summarized into the following two categories.

(A) Dimension Reduction

To estimate τ_FRD, one may consider compressing the information from multidimensional vector X into one dimension by taking a norm of it. For example, Reardon and Robinson (2012) consider `2

doi:10.6342/NTU201801706

In other words, if the sample falls into the treatment area, we attach a positive distance; otherwise, a negative distance is used. In this way, we may estimate a one-dimensional RDD model with outcome still being y_i, but running variable d_iwith threshold 0. Generally, one may consider other norms such as `1norm or maximum norm (`_∞norm). In such cases, the running variable di= (2zi−1)kXik should be introduced, where k · k is the norm one wishes to use. This method can be similarly generalized to higher dimensional case.

On the other hand, Wong, Steiner and Cook (2013) consider taking d_i= min{X_1i, X_2i}, which they named “centering approach”. In the same paper, Wong, Steiner and Cook also introduce “univariate method”, in which one simply focus on a single running variable, neglecting the other. Specifically, one discard observations with X1i< 0, or observations in the second and the thrid quadrant. The remaining observations then are decided to be treated or not solely according to X₂. Therefore, one dimensional RDD estimation strategies addressed in the previous section can be adopted with running variable being X2. Similarly, one can consider only the observations in the first and the second quadrant; in other words, one neglects those with X_2i< 0.

Nevertheless, we have to stress that by using dimension reduction method, we may falsely simplify the relation between X_1i and X_2i. For instance, if one wish to use `₂ norm to compress the running vector Xi, then one basically has to (implicitly) assume equal treatment effect, equal marginal effect of (X1i, X2i), as well as equal treatment probability for sample points with the same kXik2 in treatment group and control group respectively. Otherwise, misspecification would lead to a biased estimate.

(B) Local polynomial fitting

Instead of compressing existing information, Imbens and Zajonc (2011) consider directly fitting polynomials near the threshold. Specifically, they estimate the following regression models.

minα,β choose, and H a chosen bandwidth around the threshold according to the data (we will discuss the selection procedure later). The estimated coefficientα is then the desired τb _FRD. Graphically, similar to one dimensional case, they fit two different planes to treatment group and control group respectively.

doi:10.6342/NTU201801706

As this method uses the whole sample in estimation, we shall refer to it as union method (Lo, 2017) in the rest of this paper. However, as suggested before, since union method implicitly assumes homogeneous effect of running variables among the control group by locally fitting a single polynomial.

If heterogeneity among the control group exists, then as shown in Lo (2017), Hsu et al. (2018), and our simulation later, union method would produce a biased estimate.

To correctly estimate τFRDwhen heterogeneity is present, Lo and Hsu, Kuan and Lo note that since the bias of union method originates from neglecting the difference of sample points in the second, third, and fourth quadrants, one can simply tackle two quadrants, instead of four, at a time. Specifically, following their idea of dealing with sharp RDD, we may genearalize their method to fuzzy design by implementing the following three instrumental least square regression:

min

One can observe that in (4.2), (4.3), (4.4), we regress with sample points in the first and the second, the first and the third, and finally the first and the fourth quadrants, respectively. Each of the regression gives an unbiased estimation of τFRD (namelyαbi), and hence the unbiasedness of [τFRD.

Lo (2017) and Hsu et al. (2018) name the above estimation procedure average method. In this procedure, one uses all observations near the threshold to estimate τFRD. However, as three instrumental regressions are required, there is considerable computational burden. As a modification, Lo again addresses that as our main goal is to estimate the average treatment effect at the threshold (0, 0), one may consider only the observations in the first and the third quadrant since the two sets intersect at (0, 0). Following his advice, we can do only regression (4.3), and take [τ_FRD =αb₂. This simplified procedure is named intersection method in Lo’s work. As illustrated in figure 2, the only difference between union and intersection method is the scope where sample points are used in estimation.

It has been shown in Lo (2017) and Hsu et al. (2018) that for sharp RDD, intersection method and average method perform better over union method (and other dimension reduction methods) if heterogeneity among control group exists, in the sense that the former gives a unbiased estimate, while the latter does not. As for the comparison of the former two methods, although intersection method

doi:10.6342/NTU201801706

(a) Union Method (b) Intersection Method

Figure 2: Difference in sample points used in both methods. Only the observations in the shaded area are taken into the regression. Note that here we adopt an `1 bandwidth.

makes less computation burben, it neglects the observations from the second and the fourth quadrants.

This information loss leads to a higher standard error and mean square error (MSE) of the estimate compared to that acquired by average method.

Nevertheless, Lo (2017) and Hsu et al. (2018) only consider sharp design with local linear fitting.

As we will further elaborate in the simulation section, for fuzzy RDD, there are actually two sources of bias of union method. The first one, also appearing in sharp design, is heterogeneity of effects of running variables among the control group. To be specific, the (average) marginal effect of X_1i and X2ifor sample points in different quadrants in the control group may not be equal. Union method fits only one polynomial to the whole control group, lacking the flexibility to tackle heterogeneity, hence its poor performance. On the other hand, intersection and average method fits each quadrant separately, allowing a wider class of setting. In particular, the case when heterogeneity of marginal effects is absent is also tractable by the latter two methods.

The other one, which only can be detected in fuzzy design, is the difference of treatment probability in the control group. Even if X1iand X2ihave the same marginal effect on observations in differnt quadrants in the control group, different probabilities of exposure to treatment still contribute to heterogeneity. If the difference of probability of getting treated among quadrants in the control group is large, then the bias originated from misspecification would be amplified. It is also noteworthy that such difference would disappear in sharp design simply because all subjects in the control group are prohibited from being treated by definition.

doi:10.6342/NTU201801706

On the other hand, just as one dimensional case, one may also consider fitting higher order polyno-mial locally instead of fitting linear ones. More specifically, say, if we want to fit quadratic polynopolyno-mials, then we simply replace the linear loss functions in (4.1) to (4.4) with the following quadratic loss func-tion:

in other words, we include quadratic terms in regressions. As one may have observed, mere proceeding from linear fitting to quadratic fitting dramatically raises the number of coefficients to be estimated.

When there is only one running variable, raising the order of fitted polynomial simply means including higher order term of that running variable. For higher dimensional case, however, not only do we have to include higher order terms of the running variables, but also have to consider the interaction terms, which leads to a much more complex fitting procedure.

As for determining the order of fitting polynomial, aforementioned criterions such as AIC, defined in (3.7), can also be applied here. That is,

AIC = N ln(SSR/N ) + 2p. (3.7)

Specifically, the number of parameters in the model p = (d + 1)(d + 2) for the case where two running variables are present and the degree of fitted polynomial equals d.⁵ Eventually, the fitted model with the least AIC would be prefered.

In the simulation section, we shall go beyond Lo (2017) and Hsu et al. (2018), exploring whether the predominance of average method and intersection method persists in fuzzy RDD. On the other hand, we will also extend the estimation procedure by fitting quadratic polynomials instead of the linear ones. Before we move on to simulation, however, we shall conclude this section by briefly discussing the bandwidth selection procedures. Unfortunately, there is little research in choosing bandwidth for RDD in multidimensional case. Therefore we have to resort to dimension reduction methods. We call the adopted method separation method. We simply perform two one-dimensional bandwidth selection procedures, one for X1i, the other for X2i. In each process, we choose bandwidth with respect to one running variable, pretending that the other one does not exist. After getting h1, h2for X1iand X_2i, we keep only the sample points with |X_1i| < h₁ and |X_2i| < h₂. This process gives a rectangular bandwidth.

In fact, originally we have tried another method named norm method. Similar to dimension

5More generally, if there are n running variables and the degree of fitted polynomial is d, then p = 2Cn^d+n

doi:10.6342/NTU201801706

reduction method for RDD, we first standardize each argument in the covariate vector by dividing every component by its standard error. In other words, we introduce the standardized covariate vector Xsi= (X1i/s1, X2i/s2). By doing this, the bandwidth selection procedure would not be disturbed by the scale of X_1iand X_2i. Next we compress Xs_iby its norm, attaching a positive value if Xs_ifalls into the treatment area, negative value otherwise. More specifically, we introduce dsi = (2zi− 1)kXsik.

Note that since standard error is positive, Xi falls into the treatment area if and only if Xsi does.

Finally, we choose the bandwidth h using one dimensional approach (in our work, CCT), and take only the observations such that dsi≤ h into regressions. However, we have tried `1, `2, and `_∞ norm, and none of them gives credible estimate. We therefore mainly choose bandwidth according to separation method in the previous paragraph.

5 Simulation

5.1 Setup

To allow for different treatment probability as well as heterogeneous effect of Xi in the control groups, we consider the following four DGPs:

DGP 1: y_i = 5 + X_1i+ X_2i+ v_1i(5 + X_1i+ 0.5X_2i) + v_2i(5 + 2X_1i+ 2X_2i) + v_3i(5 + 3X_1i+ 4X_2i) + _i DGP 2: y_i = 5 + 5w_i+ X_1i+ w_iX_1i+ X_2i+ 0.5w_iX_2i+ _i

DGP 3: yi = 5 + 5wi+ X1i+ wiX1i+ 0.3w1iX1i+ X2i+ 0.5wiX2i+ 0.3w2iX2i+ i. DGP 4: yi = 5 + 5wi+ X1i+ wiX1i+ 0.3w1iX1i+ X2i+ 0.5wiX2i+ 0.3w2iX2i

+ X_1i² + 3X_2i² + wiX_1i² + wiX_2i² + i.

We generate 500 samples with size 5000 for each DGP.

Xi= (X1i, X2i) are i.i.d. multivariate normal vectors following the distribution:

(X_1i, X_2i)

where ϕ is the quantile function of standard normal variables. Note that instead of setting the mean of X1i, X2i to be 0 (symmetric around the threshold), we adopt an asymmetric design, with respectively 40% of observations passing the threshold for each running variable. We do this because typically, subjects are not symmetrically distributed around the threshold; most of the time, observations in need of treatment is minority (compared to the whole population). Moreover, we follow Lo (2017)

doi:10.6342/NTU201801706

and consider (s1, s2) being (1, 1), (3, 3), (10, 10), (3, 1), (10, 1), (10, 3). On the other hand, i are i.i.d.

standard normal random variables.

In DGP 1, we try to explore whether different marginal effect of X1i and X2i in the control group really makes a difference. The three variables v1i, v2i, v3iare defined as follows:

v_1i=1(u_i ≤ Φ(a)) · 1(X_1i< 0 and X_2i< 0) v_2i=1(u_i ≤ Φ(a)) · 1(X1iX_2i< 0)

v3i=1(ui ≤ Φ(b)) · 1(X1i≥ 0 and X2i≥ 0),

where u_i^iid^∼ N (0, 1) and Φ(.) is the cumulative distribution function of standard normal variable. By v1i, v2i, v3i, we set the treatment probability in the control group and the treatment group to be a and b respectively; however, the treatment effect in the third quadrant is 5 + X1i+ 0.5X2i, while the effect is 5 + 2X_1i+ 2X_2i in the second and the fourth quadrant. In the simulation, we have considered (a, b) = (0, 0.7), (0.15, 0.85), (0.3, 1), (0, 0.9), (0.05, 0.95), (0.1, 1). Moreover, in this DGP, we define the binary variable wi indicating whether the subject receives treatment or not to be wi= v1i+ v2i+ v3i. By DGP 2, we inspect the impact of different treatment probability in the control group. The treatment effect is the same for all subjects, which is 5 + X1i+ 0.5X2i. However, wi is defined as follows:

wi=1(ui≤ Φ(a)) · 1(X1i< 0 and X2i< 0) + 1(ui≤ Φ(b)) · 1(X1i≥ 0 and X2i< 0) +1(ui≤ Φ(b)) · 1(X1i< 0 and X2i≥ 0) + 1(ui≤ Φ(c)) · 1(X1i≥ 0 and X2i≥ 0),

where again ui

iid

∼ N (0, 1). In this way, the treatment probability in the third quadrant, a, would be different from that in the second and the fourth quadrant, b, thus introducing heterogeneity in the control group.

For DGP 3 and DGP 4, we examine the cases where both forces of heterogeneity in the control group are present. The three variables wi, w1i, w2iare of key importance in inducing different treatment effect and different treatment probability among the control group. They are defined as follows:

w_i=1{u_i≤ Φ(a)} · 1{X_1i< 0 or X_2i< 0} + 1{u_i≤ Φ(b)} · 1{X_1i≥ 0 and X_2i≥ 0}

w_1i=1{u_i≤ Φ(a)} · 1{X1i< 0} + 1{u_i≤ Φ(b)} · 1{X1i≥ 0}

w2i=1{ui≤ Φ(a)} · 1{X2i< 0} + 1{ui≤ Φ(b)} · 1{X2i≥ 0},

where u_i^iid^∼ N (0, 1) and Φ(.) is the cumulative distribution function of standard normal variable. By

doi:10.6342/NTU201801706

wi we have created different treatment probability between observations in the first quadrant (namely a) and the others (b). In the simulation, we consider (a, b) to be (0.15,0.85) or (0.05,0.95) to investigate whether the scale of the difference in treatment probability plays a role in estimation. The assignment rule is further complexified by w_1i and w_2i, which introduce difference around the y-axis and x-axis, respectively. In short, the distribution of (wi, w1i, w2i) in each quadrant can be summarized in figure 3, which leads to the treatment effect assignment schedule of DGP 3 in figure 4⁶. Note that the only difference of DGP 3 and DGP 4 is the presence of quadratic terms. for both DGPs, the desired local treatment effect at the threshold (0, 0) equals 5 (the coefficient of wi).

For all the four DGPs, the desired local treatment effect at the threshold (0, 0) equals 5. For the first two DGPs, we have used linear polynomials to do fitting, while for the latter two DGPs, we have considered locally fitting linear and quadratic polynomials, respectively. We also do a quadratic fitting to DGP 3 to observe the effect of redundant regressors. For bandwidth selection we follow CCT. As aforementioned, since CCT is originally designed for one-dimensional RDD, we have used separation method to adopt it in two-dimensional case. After the bandwidth is chosen, we conduct an instrumental regression with triangular kernel, which is κ(x) = (1 − |x|)1(|x| ≤ 1).

Figure 3: Distribution of (Wi, W1i, W2i) in the four quadrants for DGP 3 and DGP 4. The proportion in the parenthesis is the probability for each result.

6For brevity, corresponding figure for DGP 4 is postponed to the appendix.

doi:10.6342/NTU201801706

Figure 4: Distribution of treatment effect assignment in the four quadrants for DGP 3. The proportion in the parenthesis is the probability for each result. Note there is an underlying trend of 5 + X1i+ X2i

for all the observations.

5.2 Results

For each choice of (s1, s2) and (a, b, c), we report the mean, the standard error and the mean square error (MSE) of estimate of union, intersection, and average method⁷. Moreover, we report the number of observations left in the bandwidth (out of total size 5000).

As the main result, we observe that intersection and average method provide a fairly unbiased estimate, whereas union method is less accurate. The bias is further exaggerated as (s₁, s₂) become larger. On the other hand, the standard deviation of estimates by intersection method and average method estimation is slightly bigger than that of union method. This happens due to less remained sample points in intersection method and the more complex estimation process for average method.

An inspection into the result of DGP 1 shows that different marginal effects of X1i and X2i do contribute to the bias in union method. The parameter controlling the scale of heterogeneity in DGP 1 is a. When a is small, even if the treatment effect is largely different for subjects lying in different quadrants, the effect of heterogeneity remains insignificant since not many observations in the control group actually receive treatment. As one may observe from table 2 to table 7, union method produces the largest bias when (a, b) = (0.3, 1) (Table 4). On the other hand, in the extreme case a = 0 (Table 2 and 5), actually there is no heterogeneity in the control group since nobody gets treated. In this case, union method still produces a favorable result.

On the other hand, we observe that the overall standard deviation gets larger as b − a shrinks.

This happens since b − a, the difference in treatment probability, is the denominator of τ_FRD. One may imagine that as b − a approachs 0, the estimate would be highly unstable. Numerically, this

7Other dimension reduction methods have been shown to produce a biased result even for sharp RDD in Lo; we therefore focus on the comparison of these three methods.

doi:10.6342/NTU201801706

corresponds to the case where the instrumental variable is weak. From a differenct perspective, b − a is the proportion of compliers at (0, 0). Small b − a means there are less compliers, hence the difficulty of estimating the local treatment effect for them.

DGP 2 shows that the impact of different treatment probability. We observe that even if the treatment effect is the same among the control group, difference in treatment probability does affect the accuracy of union method. Here the parameter controlling the heterogeneity is b − a. Holding a, c − b fixed, one can see that the bias of union method rises as b − a gets larger as in Table 8 to 10; on the contrary, intersection and average method produces a much more accurate estimate.

Interestingly, when (a, b, c) = (0.15, 0.75, 0.85), not only union method, but also average method performs poorly (Table 11). This happens since c − b is small. Recall that in average method, we do three local IV regressions respectively. However, as the difference between the first and the second quadrant, as well as the difference between the first and the fourth quadrant is not prominent, the estimator coming from regressions concerning those quadrants (namely αb₁ and αb₃ in (4.2) and (4.4)) would be highly unstable, thus the poor performance of average method. In contrast, since the difference of treatment probability between observations in the first and the third quadrant, c − a, is large enough, intersection method still produces a favorable result in this case.

Out of 5000 sample points, around 20% to 42% remain in the bandwidth. The proportion is highly related to the order of fitted polynomial. For instance, when we try to fit data generated from DGP 3 with quadratic polynomials, there are approximately 1.6 times observations left compared to when we fit with linear polynomials. The number of left observations is relatively stable. On the other hand, we have also documented the mean and standard error of the standardized chosen bandwidth (that is, the chosen bandwidth h_j divided by the standard deviation of the corresponding running variable X_ji, or Xji/sj). The result depends basically on the order of polynomial used to fit. If one uses higher order polynomial, then the standardized chosen bandwidth would be larger. It is also worth noting that if s₁ > s₂, then the standardized chosen bandwidth for X_1i (h₁/s₁) is smaller than h₂/s₂ on average.

For brevity, we report the result of chosen bandwidths for DGP 3, (a, b) = (0.15, 0.85), with linear fitting and quadratic fitting respectively.

DGP 3 and DGP 4 gives the result when both forces of heterogeneity are present. The overall trend for DGP 3 and DGP 4 remains the same when b − a rises from 0.7 to 0.9. Moreover, the bias of union method becomes even larger as b − a grows up. This is due to the fact that b − a characterize the difference in treatment probability in the control group. If b − a gets larger, then the difference in treatment effect and probability inflates, amplifing the heterogeneity in the control group. This again establishs the fact that intersection and average method can tackle the heterogeneity in the control group, hence its much better performance.

doi:10.6342/NTU201801706

For DGP 3, we have tried both linear and quadratic fitting. One can observe that bias of union method slightly reduces (but the estimate itself is still biased) when one uses quadratic fitting. On the other hand, the estimate of intersection and average method remains more accurate, but has a higher standard deviation, which is a result of redundant explanatory variables, namely the quadratic terms.

Lastly, for the comparison of average method and intersection method, one can observe that in most cases, the standard deviation of estimates from intersection method is larger than that from average method. This may be a consequence of neglecting sample points in the second and the fourth quadrants in intersection method.

6 Conclusion

In this thesis, we review the basic concept and assumptions needed for multidimensional fuzzy

在文檔中多維模糊斷點迴歸設計下的新估計方法 (頁 21-41)