# Re: st: IV with missing values

 From "Stas Kolenikov" To statalist@hsphsun2.harvard.edu Subject Re: st: IV with missing values Date Tue, 22 Jul 2008 09:50:22 -0500

```I am not sure you will see any efficiency gains in trying to predict
y2 for the rest of the sample and plugging it back to the second stage
regression, even if there were a way to get the standard errors right.
With some extraordinary stretch of imagination (such as assuming
multivariate normality of everything), you could get a maximum
likelihood estimate of the joint covariance matrix of x, y1, y2 and z
using EM algorithm say, and then form the estimate of b from that
matrix, getting the standard errors by the delta method. This might
even work for non-normal data provided you are able to estimate
variability in that covariance matrix consistently. But as I said, I
would imagine the efficiency gains will hardly justify the trouble.

As Maarten suggested, you could run a version of imputation procedure
imputing the missing values of y2 by a regression on z and x plus the
error with the distribution similar to that of the residuals from this
regression. I would be more convinced by a bootstrap approach where
you would take bootstrap samples from the original data, run your
regression of y2 on x and z, predict y2 for the remaining
observations, and plug this into the second stage regression. (Check
if a similar procedure on complete data only will produce something
resembling the proper standard errors though.)

If you suspect that y2 is informatively missing (rather than missing
at random... I hope you are familiar with those concepts), then things
will probably get quite a bit more complicated. There might be some
work on missing data with instrumental variables estimators, but the
direction the modern econometrics tends to lean to is partial
identification where some extreme counterfactuals are proposed for the
missing data, and estimation and inference are aimed at an interval of
parameters rather than a point estimate like in classical statistics.

On Tue, Jul 22, 2008 at 7:52 AM, sara borelli <saraborelli77@yahoo.it> wrote:
> Dear All,
>
> I am estimating the following regression:
>
> y1= ax + by2 + u
> where y2 is endogenous and I am using some varaible z as identifying instrument
>
> y1, x, z are osberved for the whole sample, but y2 is missing for 30% of observations.
> If I use ivreg, stata estimates the model only on the non-missing observations. But I need to estimate the model on the whole sample.
> Therefore I explicitly performed the two steps  separately, predicting y2 in the first stage for the whole sample and inserting it into the second stage. But I know the standard errors may be biased. Does anyone know a way to estimate this correctly?
>
> Thank you for any help
> Sara Borelli
>

--
Stas Kolenikov, also found at http://stas.kolenikov.name