Title | Mills’ ratios and censoring direction in the Heckman selection model | |
Author | Vince Wiggins, StataCorp | |
Date | May 1999; minor revisions July 2011 |
Someone asked about what Heckman called the “inverse of Mills’ ratio” (IMR) and its relation to Heckman’s two-step method for estimating selection models.
The definition of the IMR tends to be somewhat inconsistent. In fact, the current manual entry for heckman uses the more intuitive “nonselection hazard” instead of “inverse Mills”, primarily because the latter has so many variations in the literature.
Hazard has the unequivocal definition H(x) = f(x) / (1 − F(x)), and Mills’ is usually just taken to be 1 / H(x), so the inverse Mills’ is just the hazard. Many papers, however, take liberties with this definition of the Mills’ and are not always clear about why.
The questioner states
As I understand it, the inverse Mills’ ratio (IMR) computed by Stata’s heckman command, and used in the second-stage regression, is lambda=f(x)/F(x), where f(x) is the pdf and F(x) is the CDF (see [R] heckman).
What I do not understand is exactly how this fits in with the definitions of the IMR found in the literature. For example (sorry for any unclear notation):
They might also have added that
Another user noted that
The key is to remember some basic facts about the standard normal pdf (f) and CDF (F).
Using these two facts and some algebraic manipulation, you can show that all of the different formulas for the IMR are equivalent.
This observation is at the heart of one problem and shows why (5), (4-1), (3), (2), and (1-1) can, in the right context, be used interchangeably.
This first area of confusion results because some authors choose to model selection (e.g., Heckman), whereas others choose to model nonselection (e.g., Maddala).
If you model nonselection, the natural choice for the nonselection hazard is f(Zg) / (1 − F(Zg)), where typically Zg = z1*g1 + z2*g2 + ... from the nonselection model. If you model selection, the natural choice for nonselection hazard is f(Zg) / F(Zg), where Zg = z1*g1 + z2*g2 + ... from the selection model. These are the same number because the Gaussian is symmetric.
In both cases, authors are computing the nonselection hazard; they are just beginning in the first case with a model of nonselection—so you get the standard form for the hazard—and in the second case with a model of selection—so the nonselection hazard has a different computation that arrives at the same value.
Authors also tend to write the nonselection hazard in whatever form is convenient. Heckman, for example, models selection and uses f(−Zg) / (1 − F(−Zg)) which is equal to f(Zg) / F(Zg).
A second question is a bit trickier.
Maybe there is something obvious I am missing here, but I’m still missing it. Stata’s calculation of the IMR appears to assume x < a (“truncation from above” in Greene’s terminology, I think). But in many—perhaps most—cases in econometrics, I suspect the truncation is the other way around. Two that come to mind are the classic female labor supply question and the model I am working on, which is trying to explain the determinants of children’s school performance, taking into consideration the selectivity aspect in a country where a large proportion of children do not go to school.
This is not an issue for the Heckman estimator because the direction of the truncation is normalized out in the specification. The shortest story I can think of to show this is somewhat involved.
Let’s be clear about how the two-step estimator works. We have a regression equation of interest
We do not, however, always observe y1; instead, we have a selection equation that determines whether y1 is observed.
Also, y2 is not observed. We know that y1 is observed only when y2 > 0, e.g., when Zg > e2.
The use of 0 as the cutoff for selection is a necessary normalization without which the model is not identified.
With this in hand, we can write the expectation of y1 conditional on y1 being observed; that is, y1 conditional on y2 > 0 or equivalently e2 > −Zg.
So the conditional expectation of y1 is
E(y1 | e2 > −Zg) = Xb + E(e1 | e2 > −Zg)
and from the moments of a censored bivariate Gaussian this is
E(y1 | e2 > −Zg) = Xb + rho*S1*f(Zg)/F(Zg)
Heckman’s insight was to formulate this conditional likelihood and then to obtain Zg from a probit estimation on whether y1 is observed. Thus getting consistent estimates of b and rho*S1 when f(Zg)/F(Zg) are included in the regression—y1 = Xb + rho*S1 * f(Zg)/F(Zg).
Heckman’s f(Zg)/F(Zg) corresponds to Greene’s expression (I’m going to change Greene’s notation slightly to match the Heckman model).
f(a − Zg) / (1 − F(a − Zg)) if truncation is y2 > a
because, as seen above, we have estimated a selection model and need the nonselection hazard.
Finally, we are ready to answer the question about models where the expected censoring is y2 < a, rather than the y2 > a. This corresponds to the expression Greene quotes as
−f(a − Zg) / F(a − Zg) if truncation is y2 < a
and is the required component of the formula for the conditional expectation of e1 when y2 < a.
Recall that the Heckman model normalized a to be 0—since a could not be identified separately from the parameter vector g. That means we really have the selection rule y2 < 0 for the case that concerns the questioner and y2 > 0 for the standard formulation of the Heckman model. We also know that E[e1] = 0 from the assumption of the regression model on the unconditional value of y2.
When centered at [0,0], the bivariate normal is mirror symmetric about the origin. Thus we can’t tell the difference between a model with selection rule y2 > 0 and a positive value for rho and a model with a selection rule y2 < 0 and a negative value of rho.
The most important thing to know is we will get the same estimates of b from either specification. The data see to it that the direction of the censoring is accounted for. We would require prior information to differentiate the two models. The data alone cannot distinguish them; it can identify only the direction of the censoring. One would have to specify the form of the censoring to distinguish the two models and then only the estimate of rho would differ.