Stata | FAQ: Determining the sample for a Heckman model

Home / Resources & support / FAQs / Determining the sample for a Heckman model

Why are observations that are noninformative about the dependent variable, but are known to be selected, excluded by heckman from the estimation sample?

Title		Determining the sample for a Heckman model
Authors		Vince Wiggins, StataCorp William Gould, StataCorp

Question:

I have a dataset with 9,962 observations on labor market outcomes of Russian women, and I want to estimate an earnings function correcting for selection into the labor market. I have full data (no missing values) for all my RHS variables in both earnings and participation functions. Of the 9,962 observations, 6,691 women work and 3,271 do not. Log earnings data are obviously missing for all the women who don’t work and are also missing for 1,109 of the women who do work.

Here is my problem: Heckman (two-step with the LHS participation variable identified) drops these 1,109 observations from the participation equation even though no variables, LHS or RHS, are missing. My take on things is that the first stage should be estimated for all feasible observations regardless of whether these are available for the earnings equations, but heckman doesn’t like the inconsistency of participation. Earnings equals missing and does not use these observations. Am I missing something from a statistical front, or is this indeed a bug?

First, let’s restate the problem:

Consider data on

Subsample A: women who DO NOT work
Subsample B: women who DO work and whose earnings are observed
Subsample C: women who DO work and whose earnings are NOT observed

Stata’s heckman command uses A+B to estimate the Heckman model, ignoring the C observations. The details are that heckman uses A+B to estimate the participation part of the model and then uses B to estimate the earnings equation.

The questioner would like heckman to use A+B+C to estimate the participation part of the Heckman model and then continue to use B to estimate the earnings equation.

Answer:

It is not a bug that Stata does not use C in the estimation.

On efficiency grounds, one could argue that C should be included, but only if the observations in C are in C because earnings are missing at random, which is not the case in this dataset.

Let’s start by laying out the likelihood for subsamples A, B, and C, where C is

Subsample C: women who do work and whose earnings are NOT observed at random

We can think of the Heckman likelihood function as follows:

         b1 = coefficients on selection equation
         b2 = coefficients on regression equation

         max   L_samp = Product L_i
        b1,b2

         L_i =  L(DOES NOT work | b1)              if A observation
             =  L(DOES work and y=yhat | b1, b2)   if B observation
             =  L(DOES work | b1)                  if C observation

The C observations are missing at random among the workers and so can be ignored. What is the effect of ignoring them? To reduce the information used to estimate b1 and, indirectly, b2 if b1 and b2 are correlated.

Is that important? It would be if C observations commonly occur in real data. What’s more important is that we would rarely consider C to be a random sample from B and C.

This case, however, includes a fourth subsample,

Subsample D: women who DO work and whose earnings are known to be small (or zero)

If the earnings are known to be zero, then the likelihood contribution is

         L(DOES work and yhat=0 | b1, b2)

However, D observations are no different from B observations. The wage is observed; it is just zero. heckman already can estimate such models. The earnings were missing because the questioner posed the wage equation in terms of ln(wage) and ln(0) is minus infinity. That is a specification problem.

On the other hand, if the D sample is made up of persons whose earnings are known only to be small, then we are going to need more modeling to write down the likelihood contribution of D, and then we will have implemented a new model, an extension beyond the Heckman model.

Here D observerations are nothing like B or C observations and, beyond that, they cannot be ignored without introducing bias. You cannot ignore D as we have ignored C because D does not occur at random.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Why are observations that are noninformative about the dependent variable, but are known to be selected, excluded by heckman from the estimation sample?

Question:

Answer:

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Why are observations that are noninformative about the dependent variable, but are known to be selected, excluded by heckman from the estimation sample?

Question:

Answer:

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies