 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: Re: st: Polynomial Fitting and RD Design

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: Re: st: Polynomial Fitting and RD Design Date Fri, 2 Sep 2011 14:50:13 -0400

```Ariel Linden, DrPH <ariel.linden@gmail.com> et al.:
A couple of clarifications are in order. I agree that global
polynomials are generically a bad idea for the RD design, but
"considering a fractional polynomial model instead," one should be
comparing to the standard approach, not a polynomial.  I.e. Maarten's
example graph does not provide evidence for a fractional polynomial
model over local linear regression. Any quick simulation (as
below--increase to 10,000 reps to look at rejection rates or RMSE for
mfp, which look really bad) will probably convince one that local
linear regression is to be preferred to fractional polynomials, at
least as implemented in the example given. I would also take issue
with the assertion that "the optimal bandwidth is an automated process
that remove[s] the burden from the researcher from testing many
alternatives" and the -rd- package on SSC automates showing the
dependence of estimates on bandwidth for that reason (option -bdep-).

clear all
prog rdmfp, rclass
drawnorm z e, n(1000) clear
g t=z>0
g y=t/2+t*z^2-z+e
rd y z
ret scalar rd=_b[lwald]
ret scalar rdse=_se[lwald]
orthpoly z, gen(z*) degree(4)
forvalues i = 1/4 {
gen z`i'l=(1-t)*z`i'
}
forvalues i = 1/4 {
gen z`i'r=t*z`i'
}
reg y z?? t
ret scalar quartic=_b[t]
ret scalar quarticse=_se[t]
gen zl = (1-t)*z
gen zr = t*z
mfp, df(8): reg y zl zr t
ret scalar  mfp=_b[t]
ret scalar  mfpse=_se[t]
eret clear
end
set seed 1
simul, r(1000):rdmfp
su rd mfp quartic, d
tw kdensity rd||kdensity quartic||kdensity mfp, xli(.5)
tw kdensity rd||kdensity mfp, xli(.5) name(woq)

On Fri, Sep 2, 2011 at 10:19 AM, Ariel Linden, DrPH
<ariel.linden@gmail.com> wrote:
> I'd like to add a little more verbiage to Austin's points:
>
> Perhaps one of the most important developments in the RD design in the last
> decade, is to focus attention at the cutoff. In other words, we expect that
> if individuals cannot manipulate their assignment score, then individuals
> close to the cutoff on either side, should be comparable, and thus, the
> design is "as good as randomized".
>
> In order for this to be valid, the primary issue is determining the size of
> the neighborhood around the cutoff (where the model will be performed). As
> Austin points out, this is a function of both kernel (usually the triangle
> kernel is used in the RD context) and bandwidth (a lot of choices available
> for this, but the optimal bandwidth is an automated process that remove the
> burden from the researcher from testing many alternatives).
>
> So why is this important? Well, as we see from this thread, when one uses
> all the data (not restricted to a local neighborhood around the cutoff),
> then model fit away from the cutoff comes into play. This leads to the use
> of polynomials (and all the problems we see with that). Conversely, using a
> local neighborhood allows us to focus on the individuals who are most
> similar.
>
> As a complete aside, Maarten, I love the code you wrote... :-)
>
> Ariel
>
> Date: Thu, 1 Sep 2011 07:43:44 -0400
> From: Austin Nichols <austinnichols@gmail.com>
> Subject: Re: st: Polynomial Fitting and RD Design
>
> Maarten--
> Note that the standard for this design is local linear regression with
> a triangle (AKA edge) kernel, as implemented in -rd- on SSC.  But the
>
> On Thu, Sep 1, 2011 at 5:24 AM, Maarten Buis <maartenlbuis@gmail.com> wrote:
>> On Thu, Sep 1, 2011 at 10:31 AM, Nick Cox wrote:
>>> Sure, but that still leaves the non-numeric issues. I guess the main
>>> issue is just reproducing behaviour with smooth curves, but what
>>> arguments justify any kind of quartic here?
>>
>> No disagreement with you on that point. Actually I think that such
>> high degree polynomial is rather dangerous for this purpose as these
>> curves tend to move rather wildly away from the data at the extreme
>> ends of the curve, and in these models the break is such an extreme
>> end. As a consequence the break dummy may just capture this misfit to
>> the data rather than a real break. Patrick may want to consider a
>> fractional polynomial model instead. Below is an example on how to
>> estimate both models, the graph shows that the quartic curve does show
>> that wild behavior at the break, and the fractional polynomial model
>> shows that that is due to overfitting the curve as in this case two
>> linear curves will do just fine.
>>
>> *--------------- begin example -----------------
>> sysuse uslifeexp, clear
>> drop if year == 1918 // Spanish flu pandemic
>> gen cyear = year - 1950 // center at break
>>
>> // 4th degree polynomial
>> orthpoly cyear , gen(oyear*) degree(4)
>> gen D = cyear > 0 if year < .
>> forvalues i = 1/4 {
>>        gen oyear`i'l = (1-D)*oyear`i'
>> }
>> forvalues i = 1/4 {
>>        gen oyear`i'r = D*oyear`i'
>> }
>>
>> // fit model
>> reg le oyear?? D
>>
>> // predict outcome
>> predict pol
>>
>> // fractional polynomial
>> gen cyearl = (1-D)*cyear
>> gen cyearr = D*cyear
>>
>> // fit model
>> mfp, df(8) : reg le cyearl cyearr D
>>
>> // predict outcome
>> predict mfp
>>
>> // Graph the models
>> twoway line le pol mfp year,            ///
>>       xline(1950)                      ///
>>       lstyle(solid solid solid)        ///
>>       lcolor(black red blue)           ///
>>       legend(order( 1 "data"           ///
>>                     2 "quartic"        ///
>>                     3 "fractional"     ///
>>                       "polynomial"  ))
>> *---------------- end example ------------------
>>  (For more on examples I sent to the Statalist see:
>> http://www.maartenbuis.nl/example_faq )
>>
>> Hope this helps,
>> Maarten

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```