If the equation of interest is the outcome equation of the selection
model, it isn't clear that you need to estimate explicitly the selection
equation explicitly, i.e., 1a and 1b. In other words, you're talking in
terms of estimating a system of equations, but you may only need to
worry about just the outcome equation. If that's the case, then, for
example, the excluded instruments you were thinking of using in (1a) and
(1b) to instrument for the endogenous regressors in the selection
equation could be used directly as the exclusion restrictions in the
Heckman-type estimation in (2b). Or, put another way, the probit first
stage in the selection estimation can be a reduced form estimation with
just the exogenous regressors.
That gives you proper identification, and reduces the number of
equations in the system from 4 to 3, but does not solve Jennifer's
problem of incorporating the complex sample structure into the
estimation procedure. If you can write this down as a system, with
scores/estimating equations/moment conditions implied by it, then the
problem of design-based estimation can be solved through
linearization/sandwich estimator of variance, but I don't think this
has ever been programmed... and that it is easy to program in the
first place. What Jennifer might think of instead is to use resampling
methods of variance estimation with -svy brr- and/or -svy jackknife-,
which however requires a lot of tuning of the weights and such.
With that said... the problem looks pretty hopeless at the moment.
What you can do as an ad-hoc plug-in rule is to run your program in
the most expanded form that allows for survey estimation, take design
effects from say -svy, deff: heckman- if that at all works (I've no
idea!) and use those DEFFs to augment the standard errors and tests in
your final model that will give proper point estimates, but will
understate your standard errors due to the complex survey design. You
would need to mske sure that the standard errors have been corrected
for extra estimation in the IV-first stage of the Heckman-second
stage... and that may not be trivial by itself.
Finally, note that the clustering due to complex survey design may
need to be taken at an earlier stage of sampling than households. Your
data provider should have included the description of the sample
design, and you need to worry to start correcting for clustering at
the level of the primary sampling units (PSUs), which may be something
like county or a postal code or something like that, with households
nested within those PSUs.
HTH.
--
Stas Kolenikov
http://stas.kolenikov.name
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/