多維模糊斷點迴歸設計下的新估計方法

(1)

國立臺灣大學社會科學院經濟學系碩士論文

Department of Economics College of Social Sciences

National Taiwan University Master Thesis

多維模糊斷點迴歸設計下的新估計方法 New Estimation Method for Multidimensional

Fuzzy Regression Discontinuity Design

林真 Chen Lin

指導教授﹕管中閔博士 Advisor: Chung-Ming Kuan, Ph.D

中華民國 107 年 7 月

July, 2018

(2)

(3)

銘謝

本篇論文能順利付梓，首先當要感謝我的指導教授，管中閔教授，在我探索研究方向時給予我相當多寶貴的建議。還記得眾多午夜，和老師在通訊軟體上來回討論的情景。在專業知識上有所精進也好、初次撰寫論文，有許多規則不懂也好，老師總不厭其煩地引領我、指導我，讓我了解自己也有完成整篇研究的能力。

我也相當感謝諸位口試委員：江淳芳老師、許育進老師及楊子霆老師。

除了相當有耐心地聽完我的報告外，也對本篇論文及我的口語表達指出不少應該修正和加強的部分，使其更臻完美。

此外，我也想向陳釗而老師、日本九州大學瀧本太郎老師及二位所屬眾指導學生致上謝意，讓我得以在二位主要召集的聯合研討會上報告本篇論文的雛形。在準備報告的過程中，使我好好審視了整篇論文的邏輯和架構。

研究和撰寫論文誠如管老師在確定指導我時，所曾向我提及，是份艱辛且寂寞的工作。然而，每當夜深人靜，懷疑自己之際，總有諸位師長、家人、

同事、朋友的支持縈繞心頭。每段建議、每句評論，乃至每聲加油，共同成為這份作品的基石。沒有各位無條件的支持，本篇論文絕對無法問世。

林真謹誌於國立臺灣大學經濟學研究所 2018 年 7 月 18 日

(4)

摘要

當政策、療程之施行與否取決於受試者是否通過某些特定標準時，研究者可以使用斷點迴歸(Regression discontinuity design；RDD)對局部平均處理效應(LATE)做不偏估計。在本篇論文中，我們將會回顧多維模糊斷點迴歸(Multidimensional RDD)之概念與假設，在其中處置(Treatment)施行與否取決於多個標準，而受試者也不盡然都遵從指示接受或不接受處置。本文第一個貢獻為推廣 Lo (2017)及 Hsu、Kuan 與 Lo (2018)文中概念，並指出傳統的估計方法未考慮資料中潛在的異質性，從而可能導致估計偏誤。此外，我們指出兩個異質性的潛在來源：指標變數(Assignment variable、Running variable)邊際效果不同，以及接受處置的機率不同。由此我們針對多維模糊斷點迴歸提出平均法(Average Method)以及交點法(Intersection Method)，

成功克服資料中的異質性。在模擬中，我們發現我們提出的方法相較於傳統估計法確實能更準確地估計出處置效果，顯示我們的方法能夠在更普遍的環境下進行估計。

關鍵詞：異質性、局部平均處理效應(LATE)、局部多項式迴歸、斷點迴歸、

二階段最小平方法(2SLS)。

JEL 分類：C21、C26、C90。

(5)

Abstract

Regression discontinuity design (RDD) is an easy, yet rigorous setting allowing researchers to unbiasedly estimate local average treatment effect, particularly when the treatment is determined by whether subjects pass certain pre-specified thresholds or not. In this thesis, we shall review basic concepts and assumptions of multidimensional fuzzy RDD, in which there are multiple thresholds, and we do not require all subjects to follow the assignment rule. As the first contribution, we generalize the idea in Lo (2017) and Hsu, Kuan, Lo (2018), pointing out traditional estimation methods fail to take potential heterogeneity in the dataset into account and hence induce biased estimates. In addition, we identify the two potential sources of heterogeneity: different marginal effect of running variables and different treatment probabilities.

With this in mind, we propose average method and intersection method for multidimensional fuzzy RDD, overcoming potential heterogeneity in the dataset. In the simulation study, we find out that our methods do produce a more accruate estimate than traditional methods, showing that our methods can accomodate much more general settings than traditional ones can do.

Keywords: Heterogeneity, Local Average treatment effect, Local polynomial regression, Re- gression discontinuity design (RDD), Two stage least square estimation (2SLS)

JEL classification: C21, C26, C90

(6)

doi:10.6342/NTU201801706

List of Figures

1 An illustration of one dimensional sharp RDD . . . 13

2 Difference in sample points used in both methods . . . 19

3 Distribution of (W_i, W_1i, W_2i) in the four quadrants for DGP 3 and DGP 4 . . . 23

4 Distribution of treatment effect assignment in the four quadrants for DGP 3 . . . 24

5 Distribution of treatment assignment in the four quadrants for DGP 4 . . . 30

(8)

doi:10.6342/NTU201801706

List of Tables

1 Classification of individuals. . . 8

2 Simulation Result: DGP 1 with local linear fitting and (a, b) = (0, 0.7) . . . 31

3 Simulation Result: DGP 1 with local linear fitting and (a, b) = (0.15, 0.85) . . . 31

4 Simulation Result: DGP 1 with local linear fitting and (a, b) = (0.3, 1) . . . 31

5 Simulation Result: DGP 1 with local linear fitting and (a, b) = (0, 0.9) . . . 32

7 Simulation Result: DGP 1 with local linear fitting and (a, b) = (0.1, 1) . . . 32

8 Simulation Result: DGP 2 with local linear fitting and (a, b, c) = (0.15, 0.65, 1). . . 33

9 Simulation Result: DGP 2 with local linear fitting and (a, b, c) = (0.15, 0.5, 0.85). . . 33

10 Simulation Result: DGP 2 with local linear fitting and (a, b, c) = (0.15, 0.75, 1). . . 33

11 Simulation Result: DGP 2 with local linear fitting and (a, b, c) = (0.15, 0.75, 0.85). . . . 34

14 Simulation Result: DGP 3 with local quadratic fitting and (a, b) = (0.15, 0.85) . . . 35

18 Statistics of standardized chosen bandwidth: DGP 3 with local linear fitting and (a, b) = (0.15, 0.85) . . . 36

19 Statistics of standardized chosen bandwidth: DGP 3 with local quadratic fitting and (a, b) = (0.15, 0.85) . . . 36

(9)

doi:10.6342/NTU201801706

1 Introduction

After a new treatment has been introduced, it is common for policy makers or researchers to query whether the treatment is indeed effective or not; if possible, one may even wish to estimate the effect quantitatively. In fact, besides econometricians, many researchers in fields including medicine, pedagogy, and even politics have dug into treatment effect estimation. Among numerous designs aiming at this problem, regression discontinuity design (RDD) has gained much attention recently.

Originally proposed in Thistlethwaite and Campbell (1960), RDD is a facile way allowing for local average treatment effect estimation when the assignment rule is known to the researchers. More specifically, in an RDD, whether a subject is eligible for treatment depends on whether it passes some pre-specified threshold or not. For example, students may require additional math classes if they fail to pass an exam; patients, say, with high blood pressure or cholesterol may be diagonised with certain disease and call for certain kinds of treatment. Despite its potential, however, it is not until four decades after RDD has been introduced, Hahn, Todd, and Van der Klaauw (2001) formalize the setting in the language of Rubin casual model (Rubin, 1974) that RDD begins to receive wide attention¹. Actually, RDD can be adopted in various fields including sociology (Hahn, Todd and Van der Klaauw, 1999), politics (Eggers, Fowler, Hainmueller, Hall, and Snyder, 2015) and epidemiology (Bor, Moscoe, Mutevedzi, Newell and B¨arnighausen, 2014).

The main advantage of RDD is its facility. The basic idea of RDD is to exploit the continuity of the interested outcome variable. One simply fits polynomials for sample points just above and just below the threshold respectively; then attribute the difference of intercept at the cutoff to local treatment effect at the threshold for those who follow the rule. In addition, when randomized experiments are not available, RDD may serve as an alternative, especially in medical studies. As mentioned in Moscoe, Bor and B¨arnighausen (2015), if a treatment has been ubiquitously accepted as indispensable in medical care, it would be hard to conduct randomized experiments. In this case, researchers can utilize data from medical records to do inference. Even when randomized experiment is a possible solution, using previously collected data saves time and cost.

Still another merit of RDD is its flexibility. Although treatment is assigned according whether the observations pass certain cutoff or not, we do not require all subjects to follow the rule. Specifically, if the assignment rule is enforced, then one faces a sharp RDD; otherwise one has to resort to a fuzzy RDD, which in fact includes the former as a special case. Therefore, we shall focus on fuzzy RDD, in which one may use whether the subject actually receives treatment or not as an instrument to unbiasedly estimate the desired local average treatment effect for those who comply with the rule

1For a more detailed review of the history of RDD in the second half of the twentieth century, see Cook (2008).

(10)

doi:10.6342/NTU201801706

at the cutoff.

In pioneering Thistlewaite and Campbell (1960), as well as many researches following its methodol- ogy, the treatment depends on solely one standard. In reality, however, there are also scenarios where multiple standards are present; for example, see Jacob and Lefgren (2004). Many works have hence considered RDD with multiple assignment variables and thresholds, or multidimensional RDD; see Imbens and Zajonc (2011) or Wong, Steiner and Cook (2013). When there is only one standard, it is natural to separate the observations according whether they pass the threshold or not; nonetheless, if there are, say, two standards, one cannot be sure whether those with no, one, or two passed standards have similar characteristics. Putting heterogeneous subjects in the same group na¨ıvely would lead to a biased estimate. Regrettably, as indicated in Lo (2017) and Hsu, Kuan and Lo (2018), most existing estimation method fail to take heterogeneity in the dataset into consideration, thus producing a biased estimate. In contrast, Lo (2017) and Hsu et al. (2018) use information contained in observations carefully, proposing intersection method and average method for sharp RDD, both of which have the flexibility to tackle with heterogeneity in the dataset.

This thesis aims to go beyond Lo (2017) and Hsu et al. (2018), generalizing their idea from sharp RDD to fuzzy RDD. We indicate that the heterogeneity in the dataset may result from different marginal effect of running variables and/or different treatment probabilities, with the latter can only be detected in fuzzy RDD. To be specific, suppose students are required to attend additional classes if they fail one of their reading or math exams or both, and we are interested in the effect of those additional classes on these students’ score on another exam two months later. It may be intuitive to place those fail one subject and those fail both in the same group since they are at least more likely to attend additional classes. However, there is no guarantee that the average marginal effect of original reading (or math) score on performance two months later is the same for both kinds of students; what’s worse, if the additional classes are not compulsory, students failing one subject may have a different attendance rate from those failing two subjects, thus creating heterogeneity in students who are eligible for treatment and potential bias in traditional estimation methods.

To overcome such problem, we modify and genearalize the idea in Lo (2017) and Hsu et al. (2018) to accommodate it in fuzzy RDD. Take the case where two standards are present for example, we treat observations with no, one and two passed standards differently. By comparing subjects with different number of passed standards respectively and average the results, we shall get an unbiased estimate of local average treatment effect. We shall refer to this procedure average method. On the other hand, to lessen computational burden, we can also drop observations which pass exactly one cutoff, and then compare the remaining subjects to get another unbiased estimate. The latter procedure, which we name intersection method, though easier to compute, but in most cases suffer from a

(11)

doi:10.6342/NTU201801706

larger standard deviation due to information loss.

According to our simulation studies, we find out that the aforementioned two sources of heterogeneity do result in less accurate estimate when using traditional estimation technique. In contrast, intersection method and average method still acquire an unbiased estimate even if heterogeneity among dataset exists. On the other hand, as Lo (2017) and Hsu et al. (2018) mainly consider local linear fitting in their work, we explore whether local quadratic polynomial fitting can generate a more desirable estimate. Still, quadratic fitting still gives unbiased estimate, meaning that one can use higher-order polynomials to allow for more flexibility. Hopefully, this work can contribute to the emerging trend of researches into multidimensional RDD, calling attention to scrutinizing observations cautiously and using data at hand more wisely in order to accommodate more general settings.

The rest of this paper is arranged as follows: in section 2, we shall introduce the basic ideas and aims in RDD. In section 3 and section 4, we will formalize one dimensional RDD and multidimensional RDD respectively, presenting assumptions needed, as well as identifying the desired local average treatment effect at the threshold and introducing estimation methods. We will verify our argument by simulation in section 5. Finally, we will conclude our discussion in section 6.

2 Regression Discontinuity Design

Treatment effect estimation has always been of central importance not only in social science, but also in many other fields like medicine or pedagogy. Doctors, for example, may be curious about whether a new medical treatment is effective or not. Teachers may want to know whether a new teaching program really help students in test performance. To estimate treatment effect, ideally, we would like to conduct randomized experiments, in which individuals are randomly assigned being treated or not;

then a comparison between those treated (treatment group) and those not treated (control group) gives the desired treatment effect. In reality, however, randomized experiments are not always feasible or they are just too costly. For example, in medical applications, patients may be ineligible for treatment because of random assignment, which sometimes is controversial.

When randomized experiments cannot be done, researchers turn to quasi-experiment designs. In quasi-experiment designs, samples are assigned to be treated not by randomness, but by arbitrariness of researchers. For instance, to evaluate whether a new teaching program is effective or not, researchers may consider implementing it in one class as treatment group and taking another class as control group.

Unlike randomized experiments, students are not always divided into different classes randomly. In other words, there may be other underlying factors affecting the outcome. For example, students in the treatment group are originally doing better on tests than those in the control group. In such case,

(12)

doi:10.6342/NTU201801706

even if students in the treatment group acheive better scores on exams, it may be hard to attribute their better performance to the new program.

Generally speaking, in a quasi-experiment design, if there exists underlying factors affecting the outcome between groups, or selection bias exists, the two groups would not be comparable, thus leading to a biased result. There have been numerous methods proposed to overcome selection bias according to the treatment assignment rule or features of the data, such as difference in differences or propensity score matching, just to name a few. If the treatment assignment rule is some thresholds based on a set of observable factors, then regression discontinuity design (RDD) may serve as an alternative method to correctly identify the treatment effect.

RDD is first introduced by Thistlethwaite and Campbell (1960) to evaluate a scholarship program.

In an RDD, whether the individual is eligible to treatment is (partly) determined by pre-specified thresholds. Individuals have higher tendency to receive treatment if they pass these thresholds². According to the number of thresholds, we can define the dimension of RDD. If the treatment depends on solely one measure, then the RDD is of one dimension. Otherwise it is a multidimensional RDD.

In this work, we shall focus on the latter more general case. In fact, as we will explain further, multidimensional RDD resembles much its one dimensional counterpart.

In reality, however, unless the assignment is enforced by law or regulation, there is no guarantee that every subject would follow the assignment rule. Depending on whether treatment assignment rule is perfectly followed, RDD can be categorized in two types. The first is called sharp RDD, in which all individuals passing the threshold receive treatment, while those failing to pass do not. The other is fuzzy RDD, in which there may be some individuals somehow do not follow the assignment rule.

In most empirical studies, we do not observe perfect compliance of the assignment rule, hence we face a fuzzy RDD. Moreover, as sharp RDD can be seen as a special case of fuzzy RDD, we would put our emphasis on fuzzy RDD.

The main idea of RDD is that observations just around the threshold are nearly the same, except their treatment status. In other words, the two groups are comparable, without selection bias.

Therefore, with proper continuity assumption, we may estimate the outcome just below and above the threshold, and then attribute the difference between the two estimates to the treatment. In this way, we may analyze the average treatment effect at the threshold.

However, as mentioned before, in fuzzy RDD we allow for some samples not compling assignment rule. We may group the individuals according to whether they follow the rule or not. Suppose the individual follows the rule, that is, he receives treatment if he passes the threshold, and does not

2Conceptually, our discussion may also be applied to the opposite scenerio. That is, individuals receive treatment if they fail to pass the thresholds. For convenience and coherence, we will assume the former case.

(13)

doi:10.6342/NTU201801706

receive treatment if he does not pass the threshold, then he is called a complier. In contrast, suppose the individual does not get treated if he passes the threshold and gets treated if not passing, he is named a defier. There may be some individuals always receiving treatment no matter he passes the threshold or not, who we shall call an always-taker; the last kind of individual, who never receive treatment even if he passes the threshold, is named a never-taker. We may summarize the four kind of individuals in table 1.

Table 1: Classification of individuals.

If pass the threshold If not pass the threshold

Complier O X

Always-taker O O

Never-taker X X

Defier X O

aO means treated, while X means not treated.

bNote we assume the treatment assignment rule as giving the treatment when the individual pass the threshold.

As always-taker and never-taker exhibit the same behavior no matter they pass the threshold or not, actually we cannot identify the treatment effect for them. Conceptually, we cannot observe, or even approximate their outcome if they are in the opposite treatment status since there is no such information. On the other hand, empirically it is very rare to have defiers, who is deliberately against the assignment rule. Therefore, we will impose the condition that there are no defiers (No-defier condition). What we are trying to estimate, as a result, is the average treatment effect for the compliers. To round up the above discussion, we try to estimate local average treatment effect for the compliers at the threshold.

In the following section, we shall formalize RDD and illustrate how to estimate the desired param- eter. First we will start by discussing the case where only one standard is present (one dimensional RDD), and then move on to the more complex cases.

3 1-dim Fuzzy RDD

3.1 Problem Formulation

In (fuzzy) RDD, treatment assignment depends on a pre-specified criterion. A sample point has higher probability to get treated if its observable covariate x (running variable, or assignment variable) exceeds a known threshold value. Such threshold may be determined by regulations or a rule of thumb.

Without loss of generality, we set this threshold value to 0 in this paper unless otherwise specified.

Researchers can also observe interested outcome variable y, and whether or not the sample point

(14)

doi:10.6342/NTU201801706

actually receives treatment, documented in a binary variable w. For instance, in Almond, Doyle, Kowalski and Williams (2011), the authors quantitatively estimate the effect of intensive care on very- low-birth weight newborns. In this case, the outcome variable y is one-year mortality of the newborn (or medical expenses), and the running variable x is the weight of the newborn with known threshold of very-low-birth-weight infant, 1500 gram.

In the language of Rubin casual model (Rubin, 1974), we may write the data generating process as follows:

y_i= y_i(1)w_i+ y_i(0)(1 − w_i) = y_i(0) + w_i(y_i(1) − y_i(0)), (3.1)

where yi(1) and yi(0) gives the status of outcome with and without treatment, respectively; wi is the indicator of treatment status. On the other hand, the treatment status w_i is (at least partially) determined by the covariate x, we introduce another variable indicating whether x exceeds the threshold value or not. Namely, zi= 1(xⁱ≥ 0).³ If zi= 1, we say that this individual is assigned to the treatment group; otherwise, it is assigned to the control group. Then w_i is determined through the following mechanism:

wi= wi(1)zi+ wi(0)(1 − zi) = wi(0) + zi(wi(1) − wi(0)), (3.2)

where wi(1) and wi(0) are binary variables indicating the treatment status when the subject is assigned to the treatment group and the control group, respectively.

As mentioned before, in a fuzzy RDD, we do not require all individuals to follow the treatment assignment rule. There may exist some sample points in treatment group but does not receive treatment;

there also may exist other observations in control group which indeed receive treatment. What we really want to find out is the local average treatment effect for those who truly follows the assignment rule, or the compliers. According to whether the samples follow the assignment or not, they can be divided into the four groups in table 1. Using the notation in the last paragraph, they can be defined as follows:

Definition 3.1 (Classfication of individiduals).

Observations can be categorized into only one of the following four groups depending on whether they follow the assignment rule or not.

1. Complier: w_i(0) = 0, w_i(1) = 1 2. Always-taker: wi(0) = 1, wi(1) = 1 3. Never-taker: wi(0) = 0, wi(1) = 0

3Note we have assumed that subjects have higher tendency to get treated if they pass the threshold. If one assumes the opposite case, then one should define zi= 1(xi≤ 0)

(15)

doi:10.6342/NTU201801706

4. Defier: wi(0) = 1, wi(1) = 0

In other words, compliers are those who follow the assignment rule perfectly. Always-takers and never-takers always receive or do not receive treatment whichever group they fall in. For identification, we usually assume no-defier assumption:

Assumption 3.1 (No-defier Assumption). w_i(.) is a non-decreasing function.

In fuzzy RDD, we allow the presence of always-takers and never-takers. What we only need is different treatment probabilities of the two groups, at least for those observations near the threshold, or those whose covariate x satisfying |x| < , where is a small positive number. Specifically, we make the following assumption:

Assumption 3.2 (Different treatment probabilities).

0 ≤ lim

→0E(wⁱ|zi= 0, |xi| = ) < lim

→0E(wⁱ|zi= 1, |xi| = ) ≤ 1.

That is, the probability of receiving treatment just above the threshold is different from that for just below the threshold. The covariate x has partial, but not necessarily full, impact on receiving treatment. It is also worth noting that with Assumption 3.1, Assumption 3.2 is equivalent to the existence of compliers.

In the special case that E(wi|z_i = 1) = 1 and E(wi|z_i = 0) = 0, whether or not an individual receives treatment totally depends on which group it lies in. In other words, every observation follows the assignment rule perfectly, or more simply, all observations are compliers. In this case, fuzzy RDD reduces to sharp RDD. In many empirical studies, if a treatment is compulsory by law, there should be nearly no ambiguity of receiving the treatment or not, therefore a sharp RDD can be applied.

3.2 Identification

Intuitively, sample points with running variable just below and just above the threshold should be nearly the same except their treatment status. To make treatment group and control group comparable, no other factors except the treatment should affect the outcome. This condition can be characterized by the following assumption:

Assumption 3.3 (Continuity Assumption).

E(yi(1)|x_i), E(yi(0)|x_i), E(wi(1)|x_i) and E(wi(0)|x_i) should be continuous at x_i= 0.

In other words, at least around the threshold, by the continuity assumption on yi, the discontinuity

(16)

doi:10.6342/NTU201801706

observed can be well ascribed to the treatment. With this assumption, even if we do not have sample points exactly at the threshold, we can use sample points whose covarite is near the threshold to estimate the outcome at the cutoff. This assumption would hold if the samples do not have perfect control over x, hence the group they fall in. For example, consider physiological measurements like blood pressure or heart rate. Alternatively, test scores, in most cases, cannot be fully controled by subjects as well. On the other hand, the continuity assumption on wi ensures that the proportion of compliers, always-takers and never-takers does not vary tremendously at the threshold.

However, in fuzzy RDD, not all subjects are compliers, to correctly identify the treatment effect for the compliers at the cutoff, we have to adjust lim_→0E(yi|z_i= 1, |x_i| = )−lim_→0E(yi|z_i= 0, |x_i| = ) by dividing the proportion of compliers, which can be estimated by the difference of proportions of samples treated just below and above the threshold, E(wⁱ|zi = 1, |xi| = ) − E(wi||zi = 0, |xi| = ), where is a small amount. We may summarize the above argument in the following theorem in Hahn (2001):

Theorem 3.1 (Identification). The local average treatment effect of the compliers at the threshold (τ_FRD) can be identified as follows:

τFRD = lim→0E(yⁱ|zi= 1, |xi| = ) − lim→0E(yⁱ|zi = 0, |xi| = )

lim→0E(wⁱ|zi= 1, |xi| = ) − lim→0E(wⁱ|zi= 0, |xi| = ) (3.3)

Proof. First observe that E(yi|zi= 1, |x_i| = ) = E(yi(1)w_i+ y_i(0)(1 − w_i)|z_i= 1, |x_i| = )

= E(yi(1)w_i(1) + y_i(0)(1 − w_i(1))|z_i= 1, |x_i| = ).

Therefore, by continuity assumption,

lim→0E(yⁱ|zi= 1, |xi| = ) = E(yi(1)wi(1) + yi(0)(1 − wi(1))|xi= 0).

Similary, lim_→0E(yi|z_i= 0, |x_i| = ) = E(yi(1)w_i(0) + y_i(0)(1 − w_i(0))|x_i= 0).

Hence, the numerator of τFRD can be simplified as:

E((yⁱ(1) − yi(0))(wi(1) − wi(0))|xi= 0) = E(yⁱ(1) − yi(0)|xi= 0, Complier)P(Complier at xⁱ= 0).

On the other hand, by the no-defier assumption,

lim→0E(wⁱ|zi= 1, |xi| = ) = lim→0E(wⁱ(1)|zi= 1, |xi| = )

= lim→0E(wⁱ(1)|xi = 0) = P(Always Taker at xⁱ= 0) + P(Complier at xⁱ= 0).

Similarly, lim_→0E(wi|zi = 0, |x_i| = )

= lim→0E(wⁱ(0)|xi = 0) = P(Always Taker at xⁱ= 0).

This shows that the denominator of τFRD is P(Complier at xⁱ= 0).

Finally, a slight algebra gives that

τ_FRD = E(yi(1) − y_i(0)|x_i = 0, Complier), or the local average treatment effect for the compliers at the threshold.

(17)

doi:10.6342/NTU201801706 3.3 Estimation Strategy

Further inspection into equation (3.2) shows that τFRD consists of four components, with two, lim→0E(yⁱ|zi = 1, |xi| = ) and lim→0E(yⁱ|zi = 0, |xi| = ), in the numerator and the other two, lim_→0E(wi|zi = 1, |x_i| = ) and lim→0E(wi|zi = 0, |x_i| = ), in the denominator. To estimate τFRD, it is straightforward and tempting to estimate each component in the formula separately. As mentioned in Hahn, Todd, and Van der Klaauw (2001), we can estimate lim→0E(yⁱ|zi= 1, |xi| = ) by doing the following (local) linear regrssion:

min

β0,β1

X

i:0≤x_i≤h

(yi− β0− β1xi)²κ(xi

h),

where h is a chosen bandwidth, and then take β0 to be the estimate. Here κ(.) is a kernel function which the econometricians can freely choose. Note that we have to choose a bandwidth h, or keep only the sample points around the cutoff, since we want to estimate the average outcome and the proportion of treatment receiver for those individuals near the threshold. Other components in (3.2) can be estimated by similar methods.

However, as we do a total of four estimations, the estimate of τFRDwould suffer from a huge amount of sampling variation. To overcome such difficulties, first one can observe that since zi is binary, the denominator and numerator of (3.3) can be estimated by γ₁ and δ₁ in the following two regression models respectively:

wi= γ0+ γ1zi+ γ2xi+ γ3xizi+ ξi, (3.4)

yi= δ0+ δ1zi+ δ2xi+ δ3xizi+ νi, (3.5)

where ξi and νi are error terms. In fact, in (3.4) and (3.5), we shall only consider sample points with

|xi| < h to more precisely capture the local effect around the threshold. Also note that we allow the average marginal effect of x_i to be different for those subjects above the threshold and below by incorporating xizi. Hahn et al. (2001) first notice the numerical equivalence between τFRD = δ1/γ1

and the following two stage least square estimation (2SLS), with the first stage being (3.4), and the second stage being:

yi= β0+ αwi+ β1xi+ β2xizi+ i. (3.6)

One can observe that after inserting estimated wi from (3.4) into (3.6), one have:

yi= β0+ αwi+ β1xi+ β2xizi+ i

= β₀+ α(γ₀+ γ₁z_i+ γ₂x_i+ γ₃x_iz_i) + β₁x_i+ β₂x_iz_i+ _i.

(18)

doi:10.6342/NTU201801706

Finally, comparison of coefficients with (3.5) gives δ1= αγ1, or α = δ1/γ1= τFRD. To sum up the above discussion, we do the following (local) instrumental regression:

min

α,β0,β1,β2

X

i:−h≤x_i≤h

(yi− αwi− β0− β1xi− β2xizi)²κ(xi

h), (3.7)

with zi being the instrument of wi. The estimated coefficient of wi (α) is then equivalent to the 2SLS estimator above, which gives the desired τFRD.

It is also worth noting that in sharp RDD, (3.7) reduces to an ordinary linear regression since the instrumental variable zi of wi is exactly itself in this particular case. One may also explain this result by examining the 2SLS procedure. In the first stage (3.4), as w_i = z_i, the best fit (γ₀, γ₁, γ₂, γ₃) = (0, 1, 0, 0), and hence the first stage is in fact trivial. Graphically, (3.7) is essentially to fit two different lines for those observations above and below the threshold respectively, which we illustrate by figure 1 below.

Figure 1: An illustration of one dimensional sharp RDD. We use two different lines to fit data below and above the threshold respectively.

By using 2SLS, which is incorporated in most statistical software, the standard error of the estimate τFRD can be easily attained. However, one should be aware that the standard error would still be high if the denominator of the estimate, namely lim_→0E(wi|zi = 1, |x_i| = )−lim→0E(wi|zi= 0, |x_i| = ), is small. Numerically, this corresponds to the case with weak instruments, or small γ1in the first stage (3.4). Intuitively, the small difference in treated proportion in treatment group and control group means that the two groups are only slightly different, making it harder to estimate the treatment effect.

(19)

doi:10.6342/NTU201801706

In fact, if one wish, one may also use higher order polynomial to fit the data. However, one should always be aware that higher order polynomial is not a guarantee for better estimate as the estimated coefficient and model would be highly unstable due to overfitting. As Gelman and Imbens (2017) pointed out, the estimate would be easily affected by the degree of the globally-fitted polynomial. In contrast, locally-fitted polynomial provides a relatively more stable estimate. Empirically, one may use the famous Akaike information criterion (AIC) to determine the order of fitted polynomial; namely,

AIC = N ln(SSR/N ) + 2p, (3.8)

where N is the number of sample points used, SSR stands for residual sum of squares, and p is the number of estimated parameters in the model. Particularly, in one-dimensional fuzzy RDD, p = 2(d + 1), where d is the degree of the fitting polynomial.

Since our main goal is to estimate the local treatment effect at the threshold, observations far away from the threshold may mess our estimation up. Therefore, it is better to choose a bandwidth h around the threshold and keep only the sample points in this band. As for existing bandwidth selection procedure, Imbens and Lemieux (2008) considers cross-validation. On the other hand, Imbens and Kalyanaraman (2012) proposes another method to directly estimate the optimal bandwidth. Basically, they tried to choose a bandwidth which minimizes the asymptotic mean squared error of the estimates.

Calonico, Cattaneo and Titiunik (2014, CCT) argue that most methods for choosing bandwidth would actually produce a biased estimate and proposed a robust method to correct such bias. In this work, we mainly follow CCT to choose bandwidth.

4 Multidimensional Fuzzy RDD

In the former framework, we only allow a single running variable x. That is, whether or not an individual receives treatment depends solely on a single factor. In reality, however, there may be cases involving more factors. For example, a patient is diagnosed with hypertension if his systolic pressure or diastolic pressure exceed 140 and 90 mm-Hg, respectively. In Jacob and Lefgren (2004), the authors investigate a policy implemented in Chicago starting from 1996, in which students are required to attend summer school if their math or reading test score are below a certain cutoff.

Although the methods introduced later in this paper can be easily generalized to an arbitrary number of factors, for simplicity, we shall discuss the case where only two factors are present.

(20)

doi:10.6342/NTU201801706 4.1 Problem Formulation and assumptions

Recall DGP (3.1) and (3.2). That is,

yi= yi(1)wi+ yi(0)(1 − wi) = yi(0) + wi(yi(1) − yi(0))

w_i= w_i(1)z_i+ w_i(0)(1 − z_i) = w_i(0) + z_i(w_i(1) − w_i(0))

This time, however, treatment status w_iis related to two (or more) factors, documented in the vector of running variables X. Individuals may have a higher probability receiving treatment if X1 and X2 both exceed certain cutoff (and-rule) or either one of them pass some threshold (or-rule). More specifically, we set both the threshold for X₁and X₂to 0 in this work unless otherwise specified; then for an and-rule, we define zi, the binary variable indicating whether the subject falls in the treatment group or not, as zi = 1(X¹ⁱ > 0)1(X²ⁱ > 0) = min{1(X¹ⁱ > 0), 1(X²ⁱ > 0)}. For an or-rule, we define z_i= max{1(X1i> 0), 1(X2i> 0)}. When z_i= 1, we say the sample lies in the treatment area;

otherwise it lies in the control area.

In fact, both kinds of rule do not make much difference when dimension of the covariate vector X equals to two as the negation of an and-rule leads to an or-rule. In other words, if (zi, X1i, X2i) follows an or-rule, then (1 − zi, −X1i, −X2i follows an and-rule in the sense that 1 − zi = 1 − max{1(X¹ⁱ >

0), 1(X2i > 0)} = min{1(X1i ≤ 0), 1(X2i ≤ 0)} = min{1(−X1i ≥ 0), 1(−X2i ≥ 0)}. Therefore, in this work, unless otherwise specified, we assume that the assignment follows an and-rule. That is, the observation has a higher tendency to be treated if both X1 and X2 are greater than 0, or X lies in the first quadrant of the coordinate plane with X₁ and X₂ being the two axes.

For convenience, we say the observation lies in the first quadrant (of the plane spanned by X1i

and X_2i) if the covariate vector X of observation i satisfies X_1i > 0 and X_2i > 0. In similar ways, we can define what observations lie in the second, the third, or the fourth quadrants mean. One can observe that if we adopt an and-rule, then the treatment group is composed of subjects in the first quadrant, while the control group comprises subjects from all other quadrants⁴.

To estimate the local treatment effect for the compliers at the threshold, or at (0,0), we again need the following assumptions:

Assumption 4.1 (No-defier Assumption). wi(.) is a non-decreasing function.

Assumption 4.2 (Different treatment probabilities).

0 ≤ lim

→0E(wⁱ|zi= 0, |xi| = ) < lim

→0E(wⁱ|zi= 1, |xi| = ) ≤ 1.

4If we consider or-rule, then the control group is composed of subjects in the third quadrant, while subjects from all the other quadrants lie in the treatment group.

(21)

doi:10.6342/NTU201801706

Assumption 4.3 (Continuity Assumption).

E(yⁱ(1)|Xi), E(yⁱ(0)|Xi), E(wⁱ(1)|Xi) and E(wⁱ(0)|Xi) should be continuous at Xi= (0, 0).

As one can see, these assumptions are just analogues of their counterparts in one dimensional case.

In fact, the identification formula of the treatment effect stays the same in high-dimensional case, as summarized in the following theorem:

Theorem 4.1 (Identification). The local average treatment effect of the compliers at the threshold can be identified as follows:

τ_FRD = lim→0E(yⁱ|zi= 1, |Xi| = ) − lim→0E(yⁱ|zi = 0, |Xi| = ) lim_→0E(wi|zi= 1, |X_i| = ) − lim→0E(wi|zi= 0, |X_i| = ).

Recall that in the proof of one-dimensional case, we actually do not use any properties that only holds in one dimension. Thus the proof for this theorem is exactly the same as Theorem 3.1.

In one dimensional RDD, it is natural to separate the observations into two groups according to whether their running variables pass the threshold or not. In two dimensional case, following and-rule, we label those sample points in the second, the third, the fourth quadrant as control group. Implicitly, we have assumed samples in the control group have the same probability of receiving treatment, as well as the same marginal effect of x1 and x2 in the three quadrants. Nevertheless, empirically this may not be the case. For example, although only those whose systolic and diastolic blood pressure are over 140/90 mm-Hg are diagnosed with hypertension, individuals with only one passed standard may have a higher tendency receiving treatment than those who are healthy. The latter two kinds of individuals are both categorized into control group by definition, but obviously there are heterogeneity between them. As we shall see in the simulation, na¨ıvely neglecting the heterogeneity in the control group would often lead to a biased estimate.

4.2 Estimation Method

There are numerous methods to estimate treatment effect at the threshold according to previous studies. They can be briefly summarized into the following two categories.

(A) Dimension Reduction

To estimate τ_FRD, one may consider compressing the information from multidimensional vector X into one dimension by taking a norm of it. For example, Reardon and Robinson (2012) consider `2

(22)

doi:10.6342/NTU201801706

norm and introduce the following variable:

di= zi

q

X_1i² + X_2i² − (1 − zi) q

X_1i² + X_2i² = (2zi− 1) q

X_1i² + X_2i².

In other words, if the sample falls into the treatment area, we attach a positive distance; otherwise, a negative distance is used. In this way, we may estimate a one-dimensional RDD model with outcome still being y_i, but running variable d_iwith threshold 0. Generally, one may consider other norms such as `1norm or maximum norm (`_∞norm). In such cases, the running variable di= (2zi−1)kXik should be introduced, where k · k is the norm one wishes to use. This method can be similarly generalized to higher dimensional case.

On the other hand, Wong, Steiner and Cook (2013) consider taking d_i= min{X_1i, X_2i}, which they named “centering approach”. In the same paper, Wong, Steiner and Cook also introduce “univariate method”, in which one simply focus on a single running variable, neglecting the other. Specifically, one discard observations with X1i< 0, or observations in the second and the thrid quadrant. The remaining observations then are decided to be treated or not solely according to X₂. Therefore, one dimensional RDD estimation strategies addressed in the previous section can be adopted with running variable being X2. Similarly, one can consider only the observations in the first and the second quadrant; in other words, one neglects those with X_2i< 0.

Nevertheless, we have to stress that by using dimension reduction method, we may falsely simplify the relation between X_1i and X_2i. For instance, if one wish to use `₂ norm to compress the running vector Xi, then one basically has to (implicitly) assume equal treatment effect, equal marginal effect of (X1i, X2i), as well as equal treatment probability for sample points with the same kXik2 in treatment group and control group respectively. Otherwise, misspecification would lead to a biased estimate.

(B) Local polynomial fitting

Instead of compressing existing information, Imbens and Zajonc (2011) consider directly fitting polynomials near the threshold. Specifically, they estimate the following regression models.

minα,β

X

i:X_i∈H

(yi− αwi− β0− β1X1i− β2X2i− β3ziX1i− β4ziX2i)²κ(X_1i h1

)κ(X_2i h2

), (4.1)

with zi being the instrument of wi, β = (β0, β1, β2, β3, β4) for brevity, κ(.) a kernel function free to choose, and H a chosen bandwidth around the threshold according to the data (we will discuss the selection procedure later). The estimated coefficientα is then the desired τb _FRD. Graphically, similar to one dimensional case, they fit two different planes to treatment group and control group respectively.

(23)

doi:10.6342/NTU201801706

As this method uses the whole sample in estimation, we shall refer to it as union method (Lo, 2017) in the rest of this paper. However, as suggested before, since union method implicitly assumes homogeneous effect of running variables among the control group by locally fitting a single polynomial.

If heterogeneity among the control group exists, then as shown in Lo (2017), Hsu et al. (2018), and our simulation later, union method would produce a biased estimate.

To correctly estimate τFRDwhen heterogeneity is present, Lo and Hsu, Kuan and Lo note that since the bias of union method originates from neglecting the difference of sample points in the second, third, and fourth quadrants, one can simply tackle two quadrants, instead of four, at a time. Specifically, following their idea of dealing with sharp RDD, we may genearalize their method to fuzzy design by implementing the following three instrumental least square regression:

min

α₁,β

X

i:X_i∈H,X2i>0

(y_i− α1w_i− β0− β1X_1i− β2X_2i− β3z_iX_1i− β4z_iX_2i)²κ(X_1i h1

)κ(X_2i h2

), (4.2)

minα₂,β

X

i:Xi∈H,X1iX2i>0

(yi− α2wi− β0− β1X1i− β2X2i− β3ziX1i− β4ziX2i)²κ(X_1i h1

)κ(X_2i h2

), (4.3)

min

α₃,β

X

i:X_i∈H,X1i>0

(yi− α3wi− β0− β1X1i− β2X2i− β3ziX1i− β4ziX2i)²κ(X1i

h₁ )κ(X2i

h₂ ), (4.4) with zi being the instrument of wi in all three regressions. Finally, we take [τFRD= (αb1+αb2+αb3)/3.

One can observe that in (4.2), (4.3), (4.4), we regress with sample points in the first and the second, the first and the third, and finally the first and the fourth quadrants, respectively. Each of the regression gives an unbiased estimation of τFRD (namelyαbi), and hence the unbiasedness of [τFRD.

Lo (2017) and Hsu et al. (2018) name the above estimation procedure average method. In this procedure, one uses all observations near the threshold to estimate τFRD. However, as three instrumental regressions are required, there is considerable computational burden. As a modification, Lo again addresses that as our main goal is to estimate the average treatment effect at the threshold (0, 0), one may consider only the observations in the first and the third quadrant since the two sets intersect at (0, 0). Following his advice, we can do only regression (4.3), and take [τ_FRD =αb₂. This simplified procedure is named intersection method in Lo’s work. As illustrated in figure 2, the only difference between union and intersection method is the scope where sample points are used in estimation.

It has been shown in Lo (2017) and Hsu et al. (2018) that for sharp RDD, intersection method and average method perform better over union method (and other dimension reduction methods) if heterogeneity among control group exists, in the sense that the former gives a unbiased estimate, while the latter does not. As for the comparison of the former two methods, although intersection method

(24)

doi:10.6342/NTU201801706

(a) Union Method (b) Intersection Method

Figure 2: Difference in sample points used in both methods. Only the observations in the shaded area are taken into the regression. Note that here we adopt an `1 bandwidth.

makes less computation burben, it neglects the observations from the second and the fourth quadrants.

This information loss leads to a higher standard error and mean square error (MSE) of the estimate compared to that acquired by average method.

Nevertheless, Lo (2017) and Hsu et al. (2018) only consider sharp design with local linear fitting.

As we will further elaborate in the simulation section, for fuzzy RDD, there are actually two sources of bias of union method. The first one, also appearing in sharp design, is heterogeneity of effects of running variables among the control group. To be specific, the (average) marginal effect of X_1i and X2ifor sample points in different quadrants in the control group may not be equal. Union method fits only one polynomial to the whole control group, lacking the flexibility to tackle heterogeneity, hence its poor performance. On the other hand, intersection and average method fits each quadrant separately, allowing a wider class of setting. In particular, the case when heterogeneity of marginal effects is absent is also tractable by the latter two methods.

The other one, which only can be detected in fuzzy design, is the difference of treatment probability in the control group. Even if X1iand X2ihave the same marginal effect on observations in differnt quadrants in the control group, different probabilities of exposure to treatment still contribute to heterogeneity. If the difference of probability of getting treated among quadrants in the control group is large, then the bias originated from misspecification would be amplified. It is also noteworthy that such difference would disappear in sharp design simply because all subjects in the control group are prohibited from being treated by definition.

(25)

doi:10.6342/NTU201801706

On the other hand, just as one dimensional case, one may also consider fitting higher order polynomial locally instead of fitting linear ones. More specifically, say, if we want to fit quadratic polynomials, then we simply replace the linear loss functions in (4.1) to (4.4) with the following quadratic loss function:

(yi− αwi− β0− β1X1i− β2X2i− β3ziX1i− β4ziX2i

− β5X_1i² − β6X1iX2i− β7X_2i² − β8ziX_1i² − β9ziX1iX2i− β10ziX_2i²)²κ(X_1i h1

)κ(X_2i h2

);

in other words, we include quadratic terms in regressions. As one may have observed, mere proceeding from linear fitting to quadratic fitting dramatically raises the number of coefficients to be estimated.

When there is only one running variable, raising the order of fitted polynomial simply means including higher order term of that running variable. For higher dimensional case, however, not only do we have to include higher order terms of the running variables, but also have to consider the interaction terms, which leads to a much more complex fitting procedure.

As for determining the order of fitting polynomial, aforementioned criterions such as AIC, defined in (3.7), can also be applied here. That is,

AIC = N ln(SSR/N ) + 2p. (3.7)

Specifically, the number of parameters in the model p = (d + 1)(d + 2) for the case where two running variables are present and the degree of fitted polynomial equals d.⁵ Eventually, the fitted model with the least AIC would be prefered.

In the simulation section, we shall go beyond Lo (2017) and Hsu et al. (2018), exploring whether the predominance of average method and intersection method persists in fuzzy RDD. On the other hand, we will also extend the estimation procedure by fitting quadratic polynomials instead of the linear ones. Before we move on to simulation, however, we shall conclude this section by briefly discussing the bandwidth selection procedures. Unfortunately, there is little research in choosing bandwidth for RDD in multidimensional case. Therefore we have to resort to dimension reduction methods. We call the adopted method separation method. We simply perform two one-dimensional bandwidth selection procedures, one for X1i, the other for X2i. In each process, we choose bandwidth with respect to one running variable, pretending that the other one does not exist. After getting h1, h2for X1iand X_2i, we keep only the sample points with |X_1i| < h₁ and |X_2i| < h₂. This process gives a rectangular bandwidth.

In fact, originally we have tried another method named norm method. Similar to dimension

5More generally, if there are n running variables and the degree of fitted polynomial is d, then p = 2Cn^d+n

(26)

doi:10.6342/NTU201801706

reduction method for RDD, we first standardize each argument in the covariate vector by dividing every component by its standard error. In other words, we introduce the standardized covariate vector Xsi= (X1i/s1, X2i/s2). By doing this, the bandwidth selection procedure would not be disturbed by the scale of X_1iand X_2i. Next we compress Xs_iby its norm, attaching a positive value if Xs_ifalls into the treatment area, negative value otherwise. More specifically, we introduce dsi = (2zi− 1)kXsik.

Note that since standard error is positive, Xi falls into the treatment area if and only if Xsi does.

Finally, we choose the bandwidth h using one dimensional approach (in our work, CCT), and take only the observations such that dsi≤ h into regressions. However, we have tried `1, `2, and `_∞ norm, and none of them gives credible estimate. We therefore mainly choose bandwidth according to separation method in the previous paragraph.

5 Simulation

5.1 Setup

To allow for different treatment probability as well as heterogeneous effect of Xi in the control groups, we consider the following four DGPs:

DGP 1: y_i = 5 + X_1i+ X_2i+ v_1i(5 + X_1i+ 0.5X_2i) + v_2i(5 + 2X_1i+ 2X_2i) + v_3i(5 + 3X_1i+ 4X_2i) + _i DGP 2: y_i = 5 + 5w_i+ X_1i+ w_iX_1i+ X_2i+ 0.5w_iX_2i+ _i

DGP 3: yi = 5 + 5wi+ X1i+ wiX1i+ 0.3w1iX1i+ X2i+ 0.5wiX2i+ 0.3w2iX2i+ i. DGP 4: yi = 5 + 5wi+ X1i+ wiX1i+ 0.3w1iX1i+ X2i+ 0.5wiX2i+ 0.3w2iX2i

+ X_1i² + 3X_2i² + wiX_1i² + wiX_2i² + i.

We generate 500 samples with size 5000 for each DGP.

Xi= (X1i, X2i) are i.i.d. multivariate normal vectors following the distribution:

(X_1i, X_2i)

iid

∼ N









ϕ(0.4) · s1

ϕ(0.4) · s2



,



 s²₁ 0

0 s²₂







,

where ϕ is the quantile function of standard normal variables. Note that instead of setting the mean of X1i, X2i to be 0 (symmetric around the threshold), we adopt an asymmetric design, with respectively 40% of observations passing the threshold for each running variable. We do this because typically, subjects are not symmetrically distributed around the threshold; most of the time, observations in need of treatment is minority (compared to the whole population). Moreover, we follow Lo (2017)

(27)

doi:10.6342/NTU201801706

and consider (s1, s2) being (1, 1), (3, 3), (10, 10), (3, 1), (10, 1), (10, 3). On the other hand, i are i.i.d.

standard normal random variables.

In DGP 1, we try to explore whether different marginal effect of X1i and X2i in the control group really makes a difference. The three variables v1i, v2i, v3iare defined as follows:

v_1i=1(u_i ≤ Φ(a)) · 1(X_1i< 0 and X_2i< 0) v_2i=1(u_i ≤ Φ(a)) · 1(X1iX_2i< 0)

v3i=1(ui ≤ Φ(b)) · 1(X1i≥ 0 and X2i≥ 0),

where u_i^iid^∼ N (0, 1) and Φ(.) is the cumulative distribution function of standard normal variable. By v1i, v2i, v3i, we set the treatment probability in the control group and the treatment group to be a and b respectively; however, the treatment effect in the third quadrant is 5 + X1i+ 0.5X2i, while the effect is 5 + 2X_1i+ 2X_2i in the second and the fourth quadrant. In the simulation, we have considered (a, b) = (0, 0.7), (0.15, 0.85), (0.3, 1), (0, 0.9), (0.05, 0.95), (0.1, 1). Moreover, in this DGP, we define the binary variable wi indicating whether the subject receives treatment or not to be wi= v1i+ v2i+ v3i. By DGP 2, we inspect the impact of different treatment probability in the control group. The treatment effect is the same for all subjects, which is 5 + X1i+ 0.5X2i. However, wi is defined as follows:

wi=1(ui≤ Φ(a)) · 1(X1i< 0 and X2i< 0) + 1(ui≤ Φ(b)) · 1(X1i≥ 0 and X2i< 0) +1(ui≤ Φ(b)) · 1(X1i< 0 and X2i≥ 0) + 1(ui≤ Φ(c)) · 1(X1i≥ 0 and X2i≥ 0),

where again ui

iid

∼ N (0, 1). In this way, the treatment probability in the third quadrant, a, would be different from that in the second and the fourth quadrant, b, thus introducing heterogeneity in the control group.

For DGP 3 and DGP 4, we examine the cases where both forces of heterogeneity in the control group are present. The three variables wi, w1i, w2iare of key importance in inducing different treatment effect and different treatment probability among the control group. They are defined as follows:

w_i=1{u_i≤ Φ(a)} · 1{X_1i< 0 or X_2i< 0} + 1{u_i≤ Φ(b)} · 1{X_1i≥ 0 and X_2i≥ 0}

w_1i=1{u_i≤ Φ(a)} · 1{X1i< 0} + 1{u_i≤ Φ(b)} · 1{X1i≥ 0}

w2i=1{ui≤ Φ(a)} · 1{X2i< 0} + 1{ui≤ Φ(b)} · 1{X2i≥ 0},

where u_i^iid^∼ N (0, 1) and Φ(.) is the cumulative distribution function of standard normal variable. By

(28)

doi:10.6342/NTU201801706

wi we have created different treatment probability between observations in the first quadrant (namely a) and the others (b). In the simulation, we consider (a, b) to be (0.15,0.85) or (0.05,0.95) to investigate whether the scale of the difference in treatment probability plays a role in estimation. The assignment rule is further complexified by w_1i and w_2i, which introduce difference around the y-axis and x-axis, respectively. In short, the distribution of (wi, w1i, w2i) in each quadrant can be summarized in figure 3, which leads to the treatment effect assignment schedule of DGP 3 in figure 4⁶. Note that the only difference of DGP 3 and DGP 4 is the presence of quadratic terms. for both DGPs, the desired local treatment effect at the threshold (0, 0) equals 5 (the coefficient of wi).

For all the four DGPs, the desired local treatment effect at the threshold (0, 0) equals 5. For the first two DGPs, we have used linear polynomials to do fitting, while for the latter two DGPs, we have considered locally fitting linear and quadratic polynomials, respectively. We also do a quadratic fitting to DGP 3 to observe the effect of redundant regressors. For bandwidth selection we follow CCT. As aforementioned, since CCT is originally designed for one-dimensional RDD, we have used separation method to adopt it in two-dimensional case. After the bandwidth is chosen, we conduct an instrumental regression with triangular kernel, which is κ(x) = (1 − |x|)1(|x| ≤ 1).

Figure 3: Distribution of (Wi, W1i, W2i) in the four quadrants for DGP 3 and DGP 4. The proportion in the parenthesis is the probability for each result.

6For brevity, corresponding figure for DGP 4 is postponed to the appendix.