>> Home >> Resources & support >> FAQs >> Endogeneity versus sample selection bias

What is the difference between “endogeneity” and “sample selection bias”?

Title   Endogeneity versus sample selection bias
Author Daniel Millimet, Southern Methodist University


Many individuals have posted questions using sample selection bias and endogeneity interchangeably or incorrectly. I do not intend to single out one individual, but consider the case of the effect on wages of workers of being in a trade union. Using a dummy variable to pick up this effect in a pooled sample of union and nonunion workers is inappropriate since workers in unions may self-select and workers being in a union may not be random.

One approach I have read is to use a probit model to estimate the probability of being in a union (1 being union worker and 0 being nonunion worker). Then from the probit equation, obtain predicted probabilities of being a union worker for the entire sample of union and nonunion workers. Then use these predicted probabilities in place of a union dummy variable to estimate the effect of being in a union. This approach is supposed to control for sample selection bias.

I am trying to relate this procedure with the standard Heckman’s two-stage procedure that uses the inverse Mills’ ratio. Any help will be much appreciated.


Sample selection bias and endogeneity bias refer to two distinct concepts, both entailing distinct solutions. In general, sample selection bias refers to problems where the dependent variable is observed only for a restricted, nonrandom sample. Using the example above, one observes an individual’s wage within a union only if the individual has joined a union. Conversely, one observes an individual’s nonunion wage only if the individual does not belong to a union. Endogeneity refers to the fact that an independent variable included in the model is potentially a choice variable, correlated with unobservables relegated to the error term. The dependent variable, however, is observed for all observations in the data. Here union status may be endogenous if the decision to join or not join a union is correlated with unobservables that affect wages. For instance, if less able workers are more likely to join a union and therefore receive lower wages ceteris paribus, then failure to control for this correlation will yield an estimated union effect on wages that is biased down.

The problem with unions and wages, and a host of other problems, can be treated either as a sample selection problem or as an endogeneity problem. The “appropriate” model depends on how one believes unions affect wages.

Model I. Endogeneity

If one believes union status has merely an intercept effect on wages (i.e. results in a parallel shift up or down for various wage profiles), then the appropriate model includes union status as a right-hand-side variable and pools the entire sample of union and nonunion workers. Because the entire sample is used, there are no sample-selection issues (there may be a sample selection issue to the extent that wages are observed only for employed workers; typically this is a cause for concern only in estimating wage equations for females). One can then proceed to estimate a typical wage regression equation via OLS. If you believe union status is endogenous and workers self-select into union/nonunion jobs, then one should instrument for union status. One can use either two-step methods, as outlined in the question above, or use the Stata command etregress. Upon fitting the model, the union status coefficient answers the following question: “Conditional on the Xs, what is the average effect on wages of belonging to a union?’ Under this estimation technique, the betas (the coefficients on the Xs) are restricted to be the same for union and nonunion workers. For example, the return to education is restricted to be the same regardless of whether one is in a union.

Model II. Sample Selection

If one believes that union status has not only an intercept effect but also a slope effect (i.e., the betas differ according to union status as well), then a sample selection model is called for. To proceed, split the sample into union and nonunion workers and then estimate a wage equation for each subsample. If union status is the only potentially endogenous variable in the model, the two separate wage equations may be estimated via OLS, accounting for the fact that each sample is a nonrandom sample of all workers. This is accomplished via Heckman’s selection correction model (using either ML estimation, or two-step estimation where in the first stage a probit model is used to predict the probability of union status and in the second stage, the inverse Mills’ ratio [IMR] is included as a regressor). According to this type of model, the union effect does not show up as a dummy variable but rather in the fact that the constant term and betas may differ from the union to the nonunion sample. The difference in the constants yields the difference in average wages if a union and nonunion worker have X=0. The difference in the betas tells one how the returns to different observable attributes vary by union status. Essentially this model allows a full set of interaction terms between union status and the Xs. A Chow test could be used to test if the betas differ across by union status. If they do not, Model I is more efficient. This type of model is also known as an endogenous switching regime model.

Other references: Main and Reilly (1993) estimate a sample-selection model similar to Model II, where they split the sample depending on the size of the firm where the individual works. Thus their first-stage involves an estimating an ordered probit for three classes of firm size (small, medium, or large), and then estimating three wage equations, each including the appropriate IMR term. Millimet (2000, SMU working paper) estimates the effect of household size on schooling using a similar modeling technique. Maddala (1983) also gives a good introduction to these issues.

Model III. Endogeneity and sample selection

One may also confront both types of biases in the same model. For example, say one wants to estimate the effect of union status on wages for women only. Thus one may choose to include union status as a right-hand-side variable (Model I) or wish to split up the sample (Model II). If one opts for Model I, one still has to confront the fact that wages for women are only selectively observed—for those women choosing to participate in the labor force. To fit this model, one would start by estimating a probit model explaining the decision of women to work or not. One would then generate the IMR and include the IMR and the union dummy in a second-stage wage regression, where one would instrument for union status if it was thought to be endogenous. Finally, if Model II were desired, then one would be confronted with a double-selection model. I believe one would estimate a probit for labor force participation first. Upon generating the IMR term, this would be included in a second probit equation explaining union status. The appropriate IMR term from this equation would then be included in the two final wage equations. (This topic is covered in Amemiya 1985.)


As in any model, one must be aware from where identification arises. While it is well known that for instrumental variables estimation one requires a variable that is correlated with the endogenous variable, uncorrelated with the error term, and does not affect the outcome of interest conditional on the included regressors, identification in sample selection issues is often not as well grounded. Because the IMR is a nonlinear function of the variables included in the first-stage probit model, call these Z, then the second-stage equation is identified—because of this nonlinearity—even if Z=X. However, the nonlinearity of the IMR arises from the assumption of normality in the probit model. Since most researchers do not test or justify the use of the normality assumption, it is highly questionable whether this assumption should be used as the sole source of identification. Thus, it is advisable, in my opinion, to have a variable in Z that is not also included in X. This step makes the source of identification clear (and debatable). For the double-selection model discussed above in Model III, two exclusion restrictions would be needed (one for the labor force probit, one for the union probit).


Amemiya, T. 1985.
Advanced Econometrics. Cambridge, MA: Harvard University Press.
Maddala, G. S. 1983.
Limited-Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press.
Main, B. and B. Reilly. 1993.
The employer size-wage gap: evidence for Britain. Economica 60: 125–142.





The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ YouTube
© Copyright 1996–2017 StataCorp LLC   •   Terms of use   •   Privacy   •   Contact us