Home  /  Resources & support  /  FAQs  /  Determining the sample for a Heckman model

Why are observations that are noninformative about the dependent variable, but are known to be selected, excluded by heckman from the estimation sample?

Title   Determining the sample for a Heckman model
Authors Vince Wiggins, StataCorp
William Gould, StataCorp


I have a dataset with 9,962 observations on labor market outcomes of Russian women, and I want to estimate an earnings function correcting for selection into the labor market. I have full data (no missing values) for all my RHS variables in both earnings and participation functions. Of the 9,962 observations, 6,691 women work and 3,271 do not. Log earnings data are obviously missing for all the women who don’t work and are also missing for 1,109 of the women who do work.

Here is my problem: Heckman (two-step with the LHS participation variable identified) drops these 1,109 observations from the participation equation even though no variables, LHS or RHS, are missing. My take on things is that the first stage should be estimated for all feasible observations regardless of whether these are available for the earnings equations, but heckman doesn’t like the inconsistency of participation. Earnings equals missing and does not use these observations. Am I missing something from a statistical front, or is this indeed a bug?

First, let’s restate the problem:

Consider data on

  • Subsample A: women who DO NOT work
  • Subsample B: women who DO work and whose earnings are observed
  • Subsample C: women who DO work and whose earnings are NOT observed

Stata’s heckman command uses A+B to estimate the Heckman model, ignoring the C observations. The details are that heckman uses A+B to estimate the participation part of the model and then uses B to estimate the earnings equation.

The questioner would like heckman to use A+B+C to estimate the participation part of the Heckman model and then continue to use B to estimate the earnings equation.


It is not a bug that Stata does not use C in the estimation.

On efficiency grounds, one could argue that C should be included, but only if the observations in C are in C because earnings are missing at random, which is not the case in this dataset.

Let’s start by laying out the likelihood for subsamples A, B, and C, where C is

  • Subsample C: women who do work and whose earnings are NOT observed at random

We can think of the Heckman likelihood function as follows:

         b1 = coefficients on selection equation
         b2 = coefficients on regression equation

         max   L_samp = Product L_i

         L_i =  L(DOES NOT work | b1)              if A observation
             =  L(DOES work and y=yhat | b1, b2)   if B observation
             =  L(DOES work | b1)                  if C observation

The C observations are missing at random among the workers and so can be ignored. What is the effect of ignoring them? To reduce the information used to estimate b1 and, indirectly, b2 if b1 and b2 are correlated.

Is that important? It would be if C observations commonly occur in real data. What’s more important is that we would rarely consider C to be a random sample from B and C.

This case, however, includes a fourth subsample,

  • Subsample D: women who DO work and whose earnings are known to be small (or zero)

If the earnings are known to be zero, then the likelihood contribution is

         L(DOES work and yhat=0 | b1, b2)

However, D observations are no different from B observations. The wage is observed; it is just zero. heckman already can estimate such models. The earnings were missing because the questioner posed the wage equation in terms of ln(wage) and ln(0) is minus infinity. That is a specification problem.

On the other hand, if the D sample is made up of persons whose earnings are known only to be small, then we are going to need more modeling to write down the likelihood contribution of D, and then we will have implemented a new model, an extension beyond the Heckman model.

Here D observerations are nothing like B or C observations and, beyond that, they cannot be ignored without introducing bias. You cannot ignore D as we have ignored C because D does not occur at random.