Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Polynomial Fitting and RD Design

From   "Patrick Button" <>
Subject   st: Polynomial Fitting and RD Design
Date   Wed, 31 Aug 2011 18:54:52 -0700

Hello Stata users,

I've been getting some unexpected Stata output when fitting polynomials
using a pretty simple OLS regression.

I am replicating a regression discontinuity design paper (Lee, Moretti and
Butler 2004). The paper is here: Code and data are here: (I am using enricoall2.dta).

I need to run a regression that fits a 4th degree polynomial separately
for points of the running variable, x, below 0.5 and above 0.5. The
regression includes a dummy variable for if x >= 0.5 or not as well. If
there is a discontinuity at 0.5, then this is picked up in the coefficient
on that dummy variable.

In this case the running variable is the vote share that the Democratic
candidate got in U.S. House of Representatives elections, including just
the Democratic and Republican votes. So x < 0.5 means a Republican won,
and >= 0.5 means a Democrat won.

I would like to pool the data instead of running a separate regression for
each side. This is one of the recommended methods in the RD literature.
For some reason this method does not appear in the authors' code so I need
to do it myself.

I'm running and setting up the regression as follows:

gen x = demvoteshare

gen D = 1 if x >=0.5
replace D = 0 if x < 0.5

*Left Side Polynomial
gen xa = (1-D)*x
gen x2a = (1-D)*x^2
gen x3a = (1-D)*x^3
gen x4a = (1-D)*x^4

*Right Side Polynomial
gen xb = D*x
gen x2b = D*x^2
gen x3b = D*x^3
gen x4b = D*x^4

regress realincome D xa x2a x3a x4a xb x2b x3b x4b


Based on what the authors of the paper got, graphical analysis, and logic,
there should be no jump in realincome at 0.5. There is no reason why
income should be suddenly much different for districts that democrats just
barely won or just barely lost. If it is, this invalidates the regression
discontinuity design. So the coefficient on D should be statistically
insignificant. However, I get the following results:

  realincome |      Coef.   Std. Err.      t    P>|t|     [95% Conf.
           D |   497414.5   94802.12     5.25   0.000       311589   
          xa |   34396.25   27783.67     1.24   0.216    -20063.66   
         x2a |  -22571.61   234577.9    -0.10   0.923    -482377.5   
         x3a |  -429659.3   655505.3    -0.66   0.512     -1714542   
         x4a |   667813.9   598416.4     1.12   0.264    -505166.7    
          xb |   -2805647   534665.3    -5.25   0.000     -3853667   
         x2b |    5828381    1112850     5.24   0.000      3647038    
         x3b |   -5281210    1012800    -5.21   0.000     -7266441   
         x4b |    1754682   339914.5     5.16   0.000      1088402    
       _cons |   31536.64   501.1422    62.93   0.000     30554.33   

I have no idea why D is statistically significant, and why only the
polynomial on the right side is statistically significant. This is not
just a problem with this regression. I get messed up results for every
regression I run that has a 4th degree polynomial on each side of 0.5.

However, I do not get weird results like this when I use just one 4th
degree polynomial (one for the entire thing) with the D dummy.

Does anyone know what I am doing wrong? I have no idea but I have a
feeling that i'm missing something obvious.

Thank you very much for your time and consideration.

Patrick Button
Ph.D. Student
Department of Economics
University of California, Irvine

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index