Title | Determining the sample for a Heckman model | |

Authors |
Vince Wiggins, StataCorp William Gould, StataCorp |

I have a dataset with 9,962 observations on labor market outcomes of Russian women, and I want to estimate an earnings function correcting for selection into the labor market. I have full data (no missing values) for all my RHS variables in both earnings and participation functions. Of the 9,962 observations, 6,691 women work and 3,271 do not. Log earnings data are obviously missing for all the women who don’t work and are also missing for 1,109 of the women who do work.

Here is my problem: Heckman (two-step with the LHS participation variable
identified) drops these 1,109 observations from the participation equation
even though no variables, LHS or RHS, are missing. My take on things is that
the first stage should be estimated for all feasible observations regardless
of whether these are available for the earnings equations, but
**heckman** doesn’t like the inconsistency of participation.
Earnings equals missing and does not use these observations. Am I missing
something from a statistical front, or is this indeed a bug?

First, let’s restate the problem:

Consider data on

- Subsample A: women who
**DO NOT**work - Subsample B: women who
**DO**work and whose earnings are observed - Subsample C: women who
**DO**work and whose earnings are**NOT**observed

Stata’s **heckman**
command uses A+B to estimate the Heckman model, ignoring the C observations.
The details are that **heckman** uses A+B to estimate the participation
part of the model and then uses B to estimate the earnings equation.

The questioner would like **heckman** to use A+B+C to estimate the
participation part of the Heckman model and then continue to use B to
estimate the earnings equation.

It is not a bug that Stata does not use C in the estimation.

On efficiency grounds, one could argue that C should be included, but only if the observations in C are in C because earnings are missing at random, which is not the case in this dataset.

Let’s start by laying out the likelihood for subsamples A, B, and C, where C is

- Subsample C: women who do work and
whose earnings are
**NOT**observed at random

We can think of the Heckman likelihood function as follows:

b1 = coefficients on selection equation b2 = coefficients on regression equation max L_samp = Product L_i b1,b2 L_i = L(DOES NOTwork | b1) if A observation = L(DOESwork and y=yhat | b1, b2) if B observation = L(DOESwork | b1) if C observation

The C observations are missing at random among the workers and so can be ignored. What is the effect of ignoring them? To reduce the information used to estimate b1 and, indirectly, b2 if b1 and b2 are correlated.

Is that important? It would be if C observations commonly occur in real data. What’s more important is that we would rarely consider C to be a random sample from B and C.

This case, however, includes a fourth subsample,

- Subsample D: women who
**DO**work and whose earnings are known to be small (or zero)

If the earnings are known to be zero, then the likelihood contribution is

L(DOESwork and yhat=0 | b1, b2)

However, D observations are no different from B observations. The wage is
observed; it is just zero. **heckman** already can estimate such models.
The earnings were missing because the questioner posed the wage equation in
terms of ln(wage) and ln(0) is minus infinity. That is a specification
problem.

On the other hand, if the D sample is made up of persons whose earnings are known only to be small, then we are going to need more modeling to write down the likelihood contribution of D, and then we will have implemented a new model, an extension beyond the Heckman model.

Here D observerations are nothing like B or C observations and, beyond that, they cannot be ignored without introducing bias. You cannot ignore D as we have ignored C because D does not occur at random.