# Re: st: Sample selection and endogeneity (or, combining heckman and ivreg)

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Sample selection and endogeneity (or, combining heckman and ivreg) Date Thu, 6 Aug 2009 12:21:23 -0400

For GLM and GMM, you can read the Stata 11 manual entries for -gmm-
and -glm- and refs cited therein, or Cameron and Trivedi (two books
available from Stata's bookstore).

You can run a simulation for some independent normally distributed X
variables and get one result, then run for some data that looks like
yours and get a totally different result, so it makes sense to use
data that looks like yours (same covariance structure)--it's easiest
to just start with your data, and modify it as needed.  The
modifications would be: you specify the errors and the coefficients,
so you know the true relationship between X and y, then you try to
estimate it.
The simulation comes in because you specify distributions for error
terms and then you draw all the error terms needed 100 times, or
(better) 10000 times, to assess the distribution of estimated coefs
around true coefs, and rejection rates.

For example (note I don't have your data, so I start by making data up
with -drawnorm-):

clear all
prog pheck, rclass
syntax [, Corr(real .1) ]
matrix C = (1, `corr' \ `corr' , 1)
drawnorm u v, n(2400) corr(C) clear
g long i=mod(_n-1,60)+1
egen mv=mean(v), by(i)
forv i=2/5 {
g x`i'=rnormal()
}
g x1=mv+x2+rnormal()
g y1=(-x1/5-x3/5+u>0)
g y2star=(y1/5+x1/5+x4/5+x5/5+v)
g s=(v+x1/5+x2/5+x3/5>0)
g y2=y2star if s
reg y2 y1 x1 x4 x5, cluster(i)
foreach v of varlist y1 x1 x4 x5 {
return scalar rb_`v'=_b[`v']
return scalar rs_`v'=_se[`v']
}
test x1=.2
return scalar rrej_x1=(r(p)<.05)
probit y1 x1 x3
predict double xbeta1, xb
predict p
gen double im=normalden(xb)/normprob(xb) if y1==1
replace im=-normalden(xb)/(1-normprob(xb)) if y1==0
heckman y2 y1 x1 x4 x5 im, sel(x1 x2 im) cluster(i) iterate(1000)
if e(cmd) == "heckman" {
if e(converged) == 1 {
foreach v of varlist y1 x1 x4 x5 {
return scalar hb_`v'=_b[`v']
return scalar hs_`v'=_se[`v']
}
test x1=.2
return scalar hrej_x1=(r(p)<.05)
}
}
ivreg2 y2 (y1 x1=p x2 x3) x4 x5, gmm2s cluster(i)
foreach v of varlist y1 x1 x4 x5 {
return scalar ib_`v'=_b[`v']
return scalar is_`v'=_se[`v']
}
test x1=.2
return scalar irej_x1=(r(p)<.05)
eret clear
end
set seed 1
pheck
simul,rep(100):pheck
tw kdensity ib_x1 || kdensity hb_x1 || kdensity rb_x1, xli(.2)
su *b_x1 *rej* *b_y1, sep(3)

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
ib_x1 |       100    .1538109    .0398201     .08462   .2944242
hb_x1 |        58    .4007295    .0309691   .3406418    .499454
rb_x1 |       100    .0452054    .0146647   .0097073   .1013072
-------------+--------------------------------------------------------
irej_x1 |       100         .43    .4975699          0          1
hrej_x1 |        58           1           0          1          1
rrej_x1 |       100           1           0          1          1
-------------+--------------------------------------------------------
ib_y1 |       100    1.686989    .4401385   .9865659   3.472792
hb_y1 |        58    2.718954    .3095008   2.231136   3.557839
rb_y1 |       100    .2975349    .0352731   .2353329   .3744754

In this example, IV gets close to the true coef on x1 of 0.2 but
overrejects by a huge margin (IV typically has a fraction of the OLS
bias in finite samples), while both OLS and the ad hoc method using
-heckman- do a terrible job (and -heckman- doesn't converge inside
1000 iterations in many cases, so the code takes forever to run).
OLS looks better than IV and the ad hoc method for the coef on y1, but
none of the methods performs adequately.

For your case, I would forget about the selection problem, and run
some panel data model with instruments.  If you want to take a "more
correct" GMM approach and stack equations for the count of number of
bonds issued in a period and equations for spreads (or log-spreads) on
those bonds, you will need to find a coauthor, I suspect.  But the
-gmm- command in Stata 11 will help, probably.

On Thu, Aug 6, 2009 at 2:30 AM, kokootchke<kokootchke@hotmail.com> wrote:
> Austin, thank you very much for your response. I agree that not having
> a reference would weaken my results and this is why I'm trying to see
> if someone in this Stata group can point in the right direction. I have
> thought about the simulations as well and I'm contemplating doing that,
> but I've never done this before and would like some pointers as to
> where I should start. Would you have any suggestions or do you have a
> reference that could help in that regard? Also, what do you mean by "for samples that look like yours"?

> This is a very good point. I have also never used GLM/GMM in this context before, so could you please be more specific regarding what I need to know or where I should look in order to consider this option and try to implement it?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/