Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: Endogeneity and Panel Data : treatreg, ivregress or .. ? Any suggestion would be really appreciated !

 From John Litfiba To statalist@hsphsun2.harvard.edu Subject st: Endogeneity and Panel Data : treatreg, ivregress or .. ? Any suggestion would be really appreciated ! Date Sun, 13 Nov 2011 15:41:46 +0100

```Dear Stata List,
Dear Mark Schaffer (I guess ;-) )

I have a econometric question related to endogenous variables and
panel data, and I believe that it can be interesting for anyone who
uses longitudinal data.

Here's the context :

I have a panel dataset of individuals who, at any time t, could
endogenously chose the value of a variable E (for endogenous). E is
not ordered and could take few values (in my case, 6 possible
choices).

I am particularly interested in the effect of one of these choices on
a fully continuous outcome variable Y.

That is, at any time and for any individual I would like to estimate

Yit=a+bXit+cZit+eit

where for example, Z is a binary variable that is equals to 1 if
individual i chooses E="the value of interest" at time t, and zero
otherwise. variables in X are assumed to be exogenous.
I believe I have a good instrument for Z, along for other control
demographic variables, and therefore I guess I have basically two
choices in order to take into account the panel nature of my dataset

1) using ivregress2 with the option cluster(id) and correcting for the
endogenous part with (Z= instrument + age + location of birth).
However Z is a dummy variable... I know this should not be a problem
but...
2) using treatreg with the option vce(bootstrap, cluster(id)
reps(400)) and modeling the choice of E=2 (that is Z=1) with treat(Z=
instrument + age + location of birth)
3) I tried to use xtivreg 2 with fixed effects, but location of birth
is time invariant (and I believe very important in order to understand
Z) so it cannot be estimated.

Is my approach correct ? Do you have eventually other ways to tacke
this multiple choice endogenous problem ?

Moreover, in the context of panel data, do I always need to use
clustering on id in order to have correct standard errors ?
My dataset is large, but I have much more time variation than
clusters. About 200 000 individuals and 10 million observations for
the whole dataset.
The period where the instrument is available reduces the dataset
considerably : 1 million observations and about 20 000 individuals.
An important remark : the panel is NOT balanced. So individuals could
come in and out of the dataset during the 10 year period covered by my
dataset. Some have thus very few observations, and some have hundreds
of rows.