Why are observations that are noninformative about the dependent variable,
but are known to be selected, excluded by heckman from the estimation sample?
||Determining the sample for a Heckman model
Vince Wiggins, StataCorp
William Gould, StataCorp
I have a dataset with 9,962 observations on labor market outcomes of Russian
women, and I want to estimate an earnings function correcting for selection
into the labor market. I have full data (no missing values) for all my RHS
variables in both earnings and participation functions. Of the 9,962
observations, 6,691 women work and 3,271 do not. Log earnings data are
obviously missing for all the women who don’t work and are also
missing for 1,109 of the women who do work.
Here is my problem: Heckman (two-step with the LHS participation variable
identified) drops these 1,109 observations from the participation equation
even though no variables, LHS or RHS, are missing. My take on things is that
the first stage should be estimated for all feasible observations regardless
of whether these are available for the earnings equations, but
heckman doesn’t like the inconsistency of participation.
Earnings equals missing and does not use these observations. Am I missing
something from a statistical front, or is this indeed a bug?
First, let’s restate the problem:
Consider data on
- Subsample A: women who DO NOT work
- Subsample B: women who DO work and whose earnings are observed
- Subsample C: women who DO work and
whose earnings are NOT observed
command uses A+B to estimate the Heckman model, ignoring the C observations.
The details are that heckman uses A+B to estimate the participation
part of the model and then uses B to estimate the earnings equation.
The questioner would like heckman to use A+B+C to estimate the
participation part of the Heckman model and then continue to use B to
estimate the earnings equation.
It is not a bug that Stata does not use C in the estimation.
On efficiency grounds, one could argue that C should be included, but only
if the observations in C are in C because earnings are missing at random,
which is not the case in this dataset.
Let’s start by laying out the likelihood for subsamples A, B, and C,
where C is
- Subsample C: women who do work and
whose earnings are NOT observed at random
We can think of the Heckman likelihood function as follows:
b1 = coefficients on selection equation
b2 = coefficients on regression equation
max L_samp = Product L_i
L_i = L(DOES NOT work | b1) if A observation
= L(DOES work and y=yhat | b1, b2) if B observation
= L(DOES work | b1) if C observation
The C observations are missing at random among the workers and so can be ignored.
What is the effect of ignoring them? To reduce the information used to
estimate b1 and, indirectly, b2 if b1 and b2 are correlated.
Is that important? It would be if C observations commonly occur in real
data. What’s more important is that we would rarely consider C to be
a random sample from B and C.
This case, however, includes a fourth subsample,
- Subsample D: women who DO work and whose earnings are known to be small (or zero)
If the earnings are known to be zero, then the likelihood contribution is
L(DOES work and yhat=0 | b1, b2)
However, D observations are no different from B observations. The wage is
observed; it is just zero. heckman already can estimate such models.
The earnings were missing because the questioner posed the wage equation in
terms of ln(wage) and ln(0) is minus infinity. That is a specification
On the other hand, if the D sample is made up of persons whose earnings are
known only to be small, then we are going to need more modeling to write
down the likelihood contribution of D, and then we will have implemented a
new model, an extension beyond the Heckman model.
Here D observerations are nothing like B or C observations and, beyond that,
they cannot be ignored without introducing bias. You cannot ignore D as we
have ignored C because D does not occur at random.