Why are there so many formulas for the inverse of Mills’ ratio?
What if I have censoring from above/below in my Heckman selection model?
||Mills’ ratios and censoring direction in the Heckman selection model
Vince Wiggins, StataCorp
May 1999; minor revisions July 2011
Someone asked about what Heckman called the “inverse of Mills’
ratio” (IMR) and its relation to Heckman’s two-step method for
estimating selection models.
The definition of the IMR tends to be somewhat inconsistent. In fact, the
current help file for
the more intuitive “nonselection hazard” instead of
“inverse Mills”, primarily because the latter has so many
variations in the literature.
Hazard has the unequivocal definition H(x) = f(x) / (1 − F(x)), and
Mills’ is usually just taken to be 1 / H(x), so the inverse
Mills’ is just the hazard. Many papers, however, take liberties with
this definition of the Mills’ and are not always clear about why.
The questioner states
As I understand it, the inverse Mills’ ratio (IMR) computed by
Stata’s heckman command, and used in the
second-stage regression, is lambda=f(x)/F(x), where f(x) is the pdf and
F(x) is the CDF (see [R] heckman).
What I do not understand is exactly how this fits in with the definitions
of the IMR found in the literature. For example (sorry for any unclear
- Greene, Econometric Analysis, 7th Edition. p. 836
(Note: my A = his alpha; a is the truncation point)
lambda(A) = f(A) / (1 − F(A)) if truncation is x > a
lambda(A) = −f(A) / F(A) if truncation is x < a
- Heckman, 1979, p. 156
lambda = f(Zi) / (1 − F(Zi))
where Zi = − X2iB2 / (S22)^0.5
- LimDep 8.0 Manual, p. E23.4 (Also by W. Greene)
lambda = f(A'w) / F(A'w)
- LimDep 8.0 Manual, p. E23.7
lambda = f(A) / F(A) if z = 1
lambda = −f(A) / (1 − F(A)) if z = 0
They might also have added that
- Maddala, 1983, Limited-Dependent and Qualitative Variables in
Econometrics, p. 231, adds his voice for
lambda = f(Z) / (1 − F(Z))
Another user noted that
The key is to remember some basic facts about the standard normal pdf (f)
and CDF (F).
- 1 − F(A) = F(−A)
- f(A) = f(−A)
Using these two facts and some algebraic manipulation, you can show that
all of the different formulas for the IMR are equivalent.
This observation is at the heart of one problem and shows why (5), (4-1),
(3), (2), and (1-1) can, in the right context, be used interchangeably.
This first area of confusion results because some authors choose to model
selection (e.g., Heckman), whereas others choose to model nonselection
If you model nonselection, the natural choice for the nonselection hazard is
f(Zg) / (1 − F(Zg)), where typically Zg = z1*g1 + z2*g2 + ... from the
nonselection model. If you model selection, the natural choice for
nonselection hazard is f(Zg) / F(Zg), where Zg = z1*g1 + z2*g2 + ... from
the selection model. These are the same number because the Gaussian is
In both cases, authors are computing the nonselection hazard; they are just
beginning in the first case with a model of nonselection—so you get
the standard form for the hazard—and in the second case with a model
of selection—so the nonselection hazard has a different computation
that arrives at the same value.
Authors also tend to write the nonselection hazard in whatever form is
convenient. Heckman, for example, models selection and uses f(−Zg) /
(1 − F(−Zg)) which is equal to f(Zg) / F(Zg).
A second question is a bit trickier.
Maybe there is something obvious I am missing here, but I’m still
missing it. Stata’s calculation of the IMR appears to assume x <
a (“truncation from above” in Greene’s terminology, I
think). But in many—perhaps most—cases in econometrics, I
suspect the truncation is the other way around. Two that come to mind are
the classic female labor supply question and the model I am working on,
which is trying to explain the determinants of children’s school
performance, taking into consideration the selectivity aspect in a country
where a large proportion of children do not go to school.
This is not an issue for the Heckman estimator because the
direction of the truncation is normalized out in the specification. The
shortest story I can think of to show this is somewhat involved.
Let’s be clear about how the two-step estimator works. We have a
regression equation of interest
y1 = Xb + e1
where Xb = x1*b1 + x2*b2 + ...
We do not, however, always observe y1; instead, we have a selection equation
that determines whether y1 is observed.
y2 = Zg + e2
where Zg = z1*g1 + z2*g2 + ...
e1, e2 ~ N(0, 0, S1, 1, rho) —bivariate Gaussian
where S2 = 1 is the same normalization used to identify a probit model.
Also, y2 is not observed. We know that y1 is observed only when y2 > 0,
e.g., when Zg > e2.
The use of 0 as the cutoff for selection is a necessary normalization
without which the model is not identified.
With this in hand, we can write the expectation of y1 conditional on y1
being observed; that is, y1 conditional on y2 > 0 or equivalently e2 >
So the conditional expectation of y1 is
E(y1 | e2 > −Zg) = Xb + E(e1 | e2 > −Zg)
and from the moments of a censored bivariate Gaussian this is
E(y1 | e2 > −Zg) = Xb + rho*S1*f(Zg)/F(Zg)
Heckman’s insight was to formulate this conditional likelihood and
then to obtain Zg from a probit estimation on whether y1 is observed. Thus
getting consistent estimates of b and rho*S1 when f(Zg)/F(Zg) are included
in the regression—y1 = Xb + rho*S1 * f(Zg)/F(Zg).
Heckman’s f(Zg)/F(Zg) corresponds to Greene’s expression
(I’m going to change Greene’s notation slightly to match the
f(a − Zg) / (1 − F(a − Zg)) if truncation is y2 > a
because, as seen above, we have estimated a selection model and need the
Finally, we are ready to answer the question about models where the expected
censoring is y2 < a, rather than the y2 > a. This corresponds to the
expression Greene quotes as
−f(a − Zg) / F(a − Zg) if truncation is y2 < a
and is the required component of the formula for the conditional expectation
of e1 when y2 < a.
Recall that the Heckman model normalized a to be 0—since a could not
be identified separately from the parameter vector g. That means we really
have the selection rule y2 < 0 for the case that concerns the questioner
and y2 > 0 for the standard formulation of the Heckman model. We also
know that E[e1] = 0 from the assumption of the regression model on the
unconditional value of y2.
When centered at [0,0], the bivariate normal is mirror symmetric about the
origin. Thus we can’t tell the difference between a model with
selection rule y2 > 0 and a positive value for rho and a model with a
selection rule y2 < 0 and a negative value of rho.
The most important thing to know is we will get the same estimates of b
from either specification. The data see to it that the direction of the
censoring is accounted for. We would require prior information to
differentiate the two models. The data alone cannot distinguish them; it
can identify only the direction of the censoring. One would have to specify
the form of the censoring to distinguish the two models and then only the
estimate of rho would differ.
- Maddala, G. S. 1983.
- Limited-Dependent and Qualitative Variables in Econometrics.
Cambridge: Cambridge University Press.
- Greene, W. H. 2002.
- LIMDEP Version 8.0 Econometric Modeling Guide, Volume 2.
Plainview, NY: Econometric Software.
- Greene, W. H. 2011.
- Econometric Analysis. 7th ed. Upper Saddle River, NJ: Prentice
- Heckman, J. 1979.
- Sample selection bias as a specification error. Econometrica 47: