Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Regression Discontinuity (RD) Designs, sharp discontinuity: basic question about implementation with "rd"


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Regression Discontinuity (RD) Designs, sharp discontinuity: basic question about implementation with "rd"
Date   Tue, 11 Oct 2011 21:53:23 -0400

Stefano Lombardi <lombardi_stefano@fastwebnet.it>:
Still not clear about which variables are real individual data and
which are means you have calculated. If nonedur is the individual
outcome variable and tenure is the assignment variable, and treatment
jumps discontinuously from zero to one at tenure=1094 (~36 mos), then
you should be using the two-variable syntax as elaborated in the Stata
Journal article linked from -rd-'s help file.  I find it hard to
believe that severance pay jumps from zero to one, though.  Perhaps
you simply don't have data on actual severance pay amounts?  The
binned scatterplot you seem to want can be constructed separately from
the rest, and added in:

egen m=mean(nonedur), by(ten_cat)
bys ten_cat: replace m=. if _n>1
g Z=tenure-1094
qui rd nonedur Z, mbw(100)
loc w=e(w)
loc o "graph mbw(100) sco(ms(i) xli(0)||sc m ten_cat) line(xti(Tenure
relative to cutoff))"
rd nonedur Z if inrange(tenure,950,1150), bwidth(`w') `o'

The above is similar to
http://www.stata.com/statalist/archive/2010-11/msg00131.html
except I allow rd to calculate the IK optimal bandwidth using the whole sample
before "zooming in" on your desired window.

On Tue, Oct 11, 2011 at 8:13 PM, Stefano Lombardi
<lombardi_stefano@fastwebnet.it> wrote:
> Dear Austin,
>
> Thank you very much for the reply. Here there are some additional
> information about the dataset.
>
> About the forcing variable:
>    "ten_cat" is measured in months (12 - 58). The last 5 categories are full
> of missing values.
>    alternatively, "tenure" is the same variable measured in days. I would
> want to use this one choosing the correct bandwidth.
>
> Just to have a rough idea of the data, here it is the the table of the
> frequencies of "ten_cat":
>
> . tabdisp ten_cat, cell(freq cumfreq)
>
> ----------------------------------
> job       |
> tenure    |
> categorie |
> s         |       freq     cumfreq
> ----------+-----------------------
>       13 |      14296       14296
>       14 |      13989       28285
>       15 |      13564       41849
>       16 |      12595       54444
>       17 |      11629       66073
>       18 |      11269       77342
>       19 |       9735       87077
>       20 |       9441       96518
>       21 |       8897      105415
>       22 |       8426      113841
>       23 |       7735      121576
>       24 |       7407      128983
>       25 |       5672      134655
>       26 |       5451      140106
>       27 |       5486      145592
>       28 |       5224      150816
>       29 |       5041      155857
>       30 |       4631      160488
>       31 |       4516      165004
>       32 |       4277      169281
>       33 |       4049      173330
>       34 |       4059      177389
>       35 |       4190      181579
>       36 |       3601      185180
>       37 |       2938      188118
>       38 |       2937      191055
>       39 |       3006      194061
>       40 |       2790      196851
>       41 |       2680      199531
>       42 |       2609      202140
>       43 |       2417      204557
>       44 |       2414      206971
>       45 |       2257      209228
>       46 |       2221      211449
>       47 |       2300      213749
>       48 |       1725      215474
>       49 |       1682      217156
>       50 |       1809      218965
>       51 |       1730      220695
>       52 |       1602      222297
>       53 |       1579      223876
>       54 |       1464      225340
>       55 |       1486      226826
>       56 |       1458      228284
>       57 |       1384      229668
>       58 |       1375      231043
> ----------------------------------
>
> Severance pay takes two possible values: people are treated at tenure = 1094
> (days) or at ten_cat = 36 (months). What I expect is that after the cut-off
> the mean of nonemployment duration (y_bar, in days) raises.
> Notice however that severance pay is generally delivered within one month of
> job termination, but I have not information about the exact moment in wich
> the sum of money is paid.
>
> Since I have the forcing variable both in months and in days, I have plotted
> the following graphs:
> - y_bar VS tenure: the scatterplot is quite dispersed around the threshold
> but it is clearly evident a decreasing trend before the cut off, then an
> increasing trend starting from the right of the cutoff. By including a
> straight interpolating line to the left and one to the right of the cut-off,
> the average treatment effect is of about 9.5 days.
> - y_bar VS ten_cat: there is a clear jump between 36 and 37 (y_bar is
> respectively 148 and 161). After the jump the observations stay steadily
> higher than the ones to the left of the cut-off.
>
> From the regression you told me to do (using either ten_cat or tenure) comes
> out a R^2 = 1, with the dummy that explains the entire variation of
> severance payment.
>
> Using Z in days and running rd nonedur Z, bdep the problem seems overcame (I
> don't know why, anyway)! I get:
>
> Two variables specified; treatment is
> assumed to jump from zero to one at Z=0.
>
>  Assignment variable Z is Z
>  Treatment variable X_T unspecified
>  Outcome variable y is nonedur
>
> Estimating for bandwidth 14.14255035704279
> Estimating for bandwidth 7.071275178521395
> Estimating for bandwidth 28.28510071408558
> ------------------------------------------------------------------------------
>     nonedur |      Coef.   Std. Err.      z    P>|z|     [95% Conf.
> Interval]
> -------------+----------------------------------------------------------------
>       lwald |   30.76441     8.9709     3.43   0.001     13.18177
>  48.34705
>     lwald50 |   34.90172   14.25218     2.45   0.014     6.967965
>  62.83548
>    lwald200 |   23.17764   6.553702     3.54   0.000     10.33262
>  36.02265
> ------------------------------------------------------------------------------
>
> With bandwidth 7.1 and 14 the estmated effect is not precise, I would go for
> the third one. However, since I have many observations close to the cut-off,
> probably I could also restrict the window of the observations considered
> through the "n(real)" option. Is that sensible?
>
> Also, if I plot the graph though the option "gr" it is not informative: all
> the oservations are plotted (basically the entire graph is completely full
> of dots) and not the means of nonendur. Also, the X-axis range is the entire
> forcing variable range, but I just want a "zoom" near the cut-off (let's
> say, between 950 and 1150). I probably have to work with "scopt", but how
> exactly?
>
> Thank you very much!!
>
> Stefano
>
>
>
>
>
>
>
>
>
>
> Il 11/10/2011 19:43, Austin Nichols ha scritto:
>>
>> Stefano Lombardi<lombardi_stefano@fastwebnet.it>:
>> Apparently there is a problem in your data; if you give us information
>> about the actual data, maybe we can diagnose it.
>> Is ten_cat measured in days, so that it takes on a larger number of
>> discrete values, many of which are close to the threshold, or does it
>> take on a small number of discrete values?
>> Does sevpay take on one of two possible values, or is it more continuous?
>> What happens when you regress sevpay on z=(ten_cat-36) and a dummy for
>> z>=0 (ten_cat>=36), and their interaction?
>> What happens when you type
>> g z=ten_cat-36
>> rd nonedur z, bdep
>> ?
>> The bandwidth calculations assume the data far from the cutoff have
>> NOT "already been manually eliminated" as you have done, so you may
>> want to clarify how you want to estimate the optimal bandwidth.
>>
>> On Tue, Oct 11, 2011 at 1:12 PM, Stefano Lombardi
>> <lombardi_stefano@fastwebnet.it>  wrote:
>>>
>>> Hi Ariel,
>>>
>>> thank you very much for your interest. You got the correct interpretation
>>> for X and the cut-off as well.
>>>
>>> With respect to the treatment ("severance payment"), I wrote a bit
>>> confusingly. The "job tenure" variable is sharply discontinuos at month
>>> 36,
>>> in the sense that if a person is laid off after having worked for 13 or
>>> 14
>>> or ... 35 months in the same place, he is not going to receive any sort
>>> of
>>> lump-sum payment. Otherwise, if one works for 36 months or more and is
>>> laid
>>> off, then the employer is obliged to immediately pay him a fixed amount
>>> of
>>> money (three months of salary of the job just lost).
>>>
>>> Hence, every person in my dataset has been laid off, but only someone
>>> will
>>> receive the lump-sum severance payment (with probability 1 after 36 moths
>>> of
>>> job tenure). The thing which probably can make some confusion is that I
>>> am
>>> not considering any unemployment benefit (which starts at a certain point
>>> and then continue to be received over time), but a "one-time" payment.
>>> Also, we are interested in knowing whether this kind of treatment affects
>>> the duration unemployment (the "nonemployment" duration, which goes from
>>> the
>>> layoff to the start of the new job).
>>>
>>> You are completely right: job position could be a very important issue.
>>> But
>>> the dataset is quite homogenous from this point of view. In any case, in
>>> the
>>> hypotheses checking part of the work I have graphically considered
>>> whether
>>> there is a "jump" at the threshold of this variable. So you are right,
>>> but I
>>> can still check if there is a violation of the continuity assumption at
>>> the
>>> threshold, and actually (at least from a graphical point of view) there
>>> is
>>> not evidence of that.
>>>
>>> Same reasoning for the previous job salary level. Since the severance
>>> payment equals three months of the last job, the size of the payment is
>>> not
>>> the same for every one who receives it. But again, the previous salary
>>> range
>>> is not very wide. There are indeed some extreme cases in both directions,
>>> but from a graphical point of view the "previous salary" variable passes
>>> quite smoothly through the cut-off.
>>>
>>> One main concern could be that employers fire more people "just on the
>>> left"
>>> of the 36 months cut-off (in order to elude the compulsory payment). But
>>> this is not the case: the number of layoffs (vs the previous job tenure)
>>> does not change much at the threshold. For people more used with the
>>> labor
>>> economics framework, my dataset is quite comparable with the one of the
>>> David Card's work of 2007. Of course a certain dose of critic is always
>>> necessary, but I consider that a very good work, and I wanted to start
>>> from
>>> that one.
>>>
>>> Actually, none of the other variables that could give some problems at
>>> the
>>> threshold seem to be discontinuous at the threshold. Hence I would have
>>> liked to proceed with the "rd" command, but I really cannot understand
>>> what
>>> is the syntax/input problem.
>>>
>>> Basically, on the y axis I want the mean nonemployment duration (in
>>> days),
>>> while on the X axis I want the job tenure in months. Hence I computed the
>>> mean of y conditioned to X. I did through:
>>>
>>> egen cond_mean_y = mean(nonedur), by(ten_cat)
>>>
>>> Now I have for each job tenure month between 13 and 52 the correspondent
>>> mean of the nonemployment duration (and I can easily make the plot). But
>>> then why "rd" does not returns the same? Where I got wrong?
>>>
>>> I believe that "rd" should "automatically" do it by (1) including "job
>>> tenure" in days, and (2) choosing the correct bandwidth. The first thing
>>> that I tried was to include the forcing variable as continuous, but I
>>> couldn't manage to have a graph as I mentioned in the above paragraph..
>>>
>>> And apart from the graph itself, I am clearly making some kind of error
>>> somewhere in the "rd" command, since I receive the error which i reported
>>> in
>>> the last post. It is also clear that the error is due to my ignorance,
>>> but
>>> how can I solve this problem?
>>>
>>> Thank you very much,
>>>
>>> Stefano
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I clearly have to make Stata considers just points near to the cut-off in
>>> order to estimate the jump. However, without expliciting that, I think
>>> that
>>> Stata should do it by itself. About the bandwidth, if I am not wrong,
>>> Stata
>>> chooses the optimal one and also tries two others.
>>>
>>> I do not understand
>>>
>>> d if I hav eto insert the average
>>>
>>>
>>> Il 11/10/2011 17:09, Ariel Linden, DrPH ha scritto:
>>>>
>>>> Hi Stefano,
>>>>
>>>> I am a bit confused by your variables. If I understand correctly, your X
>>>> variable is previous job tenure which is ranges from 0-52 months and
>>>> your
>>>> cutoff is 36. However, your "treatment" is whether a person gets
>>>> severance,
>>>> which, I am assuming can be at any point along the X variable continuum?
>>>>
>>>> In the RD design, the cutoff is the treatment assignment, so to make it
>>>> work, you'd have to have everyone at or above 36 months receive
>>>> severance
>>>> and everyone below 36 months not receive severance. I am not sure that
>>>> is
>>>> what you have done here?
>>>>
>>>> I am not an economist (I don’t even play one on television), but I am
>>>> not
>>>> sold on the premise that length of previous tenure is associated the
>>>> outcome
>>>> variable (unless it is mediated vis-à-vis the severance). I also assume
>>>> that
>>>> the size of the severance will be associated with the Y variable, and
>>>> may
>>>> or
>>>> may not have a strong independent association with the X variable (the
>>>> recent CEO of HP just got fired after a year on the job and got a
>>>> multi-million dollar severance). Thus, the type of position (or perhaps
>>>> salary level of previous job) will moderate the relationship.
>>>>
>>>> Therefore, I am not sure you have the right variables, or the right
>>>> modeling
>>>> approach here. Perhaps you should consider switching to a mediation
>>>> (controlling for moderators) approach, or perhaps a time series approach
>>>> with two or three variables, (a) length of previous job tenure, (b)
>>>> length
>>>> of time unemployed thereafter, (c) relative size of severance?
>>>>
>>>> I hope this helps
>>>>
>>>> Ariel
>>>>
>>>>
>>>>
>>>>
>>>> Date: Mon, 10 Oct 2011 21:15:37 +0200
>>>> From: Stefano Lombardi<lombardi_stefano@fastwebnet.it>
>>>> Subject: st: Regression Discontinuity (RD) Designs, sharp discontinuity:
>>>> basic question about implementation with "rd"
>>>>
>>>> Hello everybody,
>>>>
>>>> I have a big problem in computing a sharp regression discontinuity
>>>> design via the "rd" function. I have read a number of papers about the
>>>> underlying theory, but I cannot carry out even a very basic RD design..
>>>> Unfortunately I found very little information on Statalist and on the
>>>> whole Internet as well.. Could you please give a hand?  Every comment
>>>> would be tremendously helpful. Here is my (labor economics) setting:
>>>>
>>>> "tenure_cat":    discrete forcing variable, Z = last job tenure (in
>>>> months = 13, 14, ..., 52)
>>>> "severance":     treatment, X_T = lump-sum severance payment
>>>> "nonendur":     outcome, y = non-employment duration (days between the
>>>> layoff and the start of the new job)
>>>> The cut-off is at Z_0 = 36 months (after three years of job tenure, a
>>>> person who is laid off is going to receive a severance payment with
>>>> probability 1).
>>>> Does the severance payment cause a variation in the job search?
>>>>
>>>> I also have "mean_nonedur" = "nonedur" mean conditioned on "tenure_cat"
>>>> (basically the mean of y for each month between 13 to 52)
>>>>
>>>> My aim is to set a RD design with the mean nonemployment duration in
>>>> days against Z in months. My first best would be to estimate the outcome
>>>> gap through a second or higher order polynomial. All the data "far" from
>>>> the cut-off have already been manually eliminated, hence I simply need
>>>> to run the RD design with all the available data.
>>>>
>>>>
>>>> As very first step, I simply tried to run the following command:
>>>>
>>>> . rd nonedur sevpay ten_cat, z0(36)
>>>> Three variables specified; jump in treatment
>>>> at Z=36 will be estimated. Local Wald Estimate
>>>> is the ratio of jump in outcome to jump in treatment.
>>>>
>>>>   Assignment variable Z is ten_cat
>>>>   Treatment variable X_T is sevpay
>>>>   Outcome variable y is nonedur
>>>>
>>>> Estimating for bandwidth 9.826534218815946
>>>> A predicted value of treatment at cutoff lies outside feasible range;
>>>> switching to local mean smoothing for treatment discontinuity.
>>>> score variables for model __00000P contain missing values
>>>> r(322);
>>>>
>>>> Probably is nonsense, but I also tried to run the same command with
>>>> "mean_nonedur" instead of "nonedur".. same result from Stata.
>>>>
>>>> Could you give me any suggestion about this issue? Is there something
>>>> related to the bandwidth choice?
>>>>
>>>> Thank you very much,
>>>>
>>>> Stefano Lombardi

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index