**[R] heckman** -- Heckman selection model

__Syntax__

Basic syntax

**heckman** *depvar* [*indepvars*]**,** __sel__**ect(***varlist_s***)** [__two__**step**]

or

**heckman** *depvar* [*indepvars*]**,** __sel__**ect(***depvar_s* **=** *varlist_s***)** [__two__**step**]

Full syntax for maximum likelihood estimates only

**heckman** *depvar* [*indepvars*] [*if*] [*in*] [*weight*]**,** __sel__**ect(**[*depvar_s* **=**]
*varlist_s* [**,** __nocons__**tant** __off__**set(***varname_o***)**]**)**
[*heckman_ml_options*]

Full syntax for Heckman's two-step consistent estimates only

**heckman** *depvar* [*indepvars*] [*if*] [*in*]**,** __two__**step** __sel__**ect(**[*depvar_s* **=**]
*varlist_s* [**,** __nocons__**tant**]**)** [*heckman_ts_options*]

*heckman_ml_options* Description
-------------------------------------------------------------------------
Model
__ml__**e** use maximum likelihood estimator; the
default
* __sel__**ect()** specify selection equation: dependent and
independent variables; whether to have
constant term and offset variable
__nocons__**tant** suppress constant term
__off__**set(***varname***)** include *varname* in model with coefficient
constrained to 1
__const__**raints(***constraints***)** apply specified linear constraints
__col__**linear** keep collinear variables

SE/Robust
**vce(***vcetype***)** *vcetype* may be **oim**, __r__**obust**, __cl__**uster**
*clustvar*, **opg**, __boot__**strap**, or __jack__**knife**

Reporting
__l__**evel(***#***)** set confidence level; default is **level(95)**
__fir__**st** report first-step probit estimates
**lrmodel** perform the likelihood-ratio model test
instead of the default Wald test
__ns__**hazard(***newvar***)** generate nonselection hazard variable
__m__**ills(***newvar***)** synonym for **nshazard()**
__nocnsr__**eport** do not display constraints
*display_options* control columns and column formats, row
spacing, line width, display of omitted
variables and base and empty cells, and
factor-variable labeling

Maximization
*maximize_options* control the maximization process; seldom
used

__coefl__**egend** display legend instead of statistics
-------------------------------------------------------------------------
* **select()** is required. The full specification is
__sel__**ect(**[*depvar_s* **=**] *varlist_s* [**,** __nocons__**tant** __off__**set(***varname_o***)**]**)**.

*heckman_ts_options* Description
-------------------------------------------------------------------------
Model
* __two__**step** produce two-step consistent estimate
* __sel__**ect()** specify selection equation: dependent and
independent variables; whether to have
constant term
__nocons__**tant** suppress constant term
__rhos__**igma** truncate rho to [-1,1] with consistent
Sigma
__rhot__**runc** truncate rho to [-1,1]
__rhol__**imited** truncate rho in limited cases
__rhof__**orce** do not truncate rho

SE
**vce(***vcetype***)** *vcetype* may be **conventional**, __boot__**strap**, or
__jack__**knife**

Reporting
__l__**evel(***#***)** set confidence level; default is **level(95)**
__fir__**st** report first-step probit estimates
__ns__**hazard(***newvar***)** generate nonselection hazard variable
__m__**ills(***newvar***)** synonym for **nshazard()**
*display_options* control columns and column formats, row
spacing, line width, display of omitted
variables and base and empty cells, and
factor-variable labeling

__coefl__**egend** display legend instead of statistics
-------------------------------------------------------------------------
* **twostep** and **select()** are required. The full specification is
__sel__**ect(**[*depvar_s* **=**] *varlist_s* [**,** __nocons__**tant**]**)**.

*indepvars* and *varlist_s* may contain factor variables; see fvvarlist.
*depvar*, *indepvars*, *varlist_s*, and *depvar_s* may contain time-series
operators; see tsvarlist.
**bayes**, **bootstrap**, **by**, **fp**, **jackknife**, **rolling**, **statsby**, and **svy** are
allowed; see prefix. For more details, see **[BAYES] bayes: heckman**.
Weights are not allowed with the **bootstrap** prefix.
**twostep**, **vce()**, **first**, **lrmodel**, and weights are not allowed with the **svy**
prefix.
**pweight**s, **fweight**s, and **iweight**s are allowed with maximum likelihood
estimation; see weight. No weights are allowed if **twostep** is
specified.
**coeflegend** does not appear in the dialog box.
See **[R] heckman postestimation** for features available after estimation.

__Menu__

**Statistics > Sample-selection models > Heckman selection model**

__Description__

**heckman** fits regression models with selection by using either Heckman's
two-step consistent estimator or full maximum likelihood.

__Options for Heckman selection model (ML)__

+-------+
----+ Model +------------------------------------------------------------

**mle** requests that the maximum likelihood estimator be used. This is the
default.

**select(**[*depvar_s* **=**] *varlist_s* [**,** **noconstant** **offset(***varname_o***)**]**)** specifies
the variables and options for the selection equation. It is an
integral part of specifying a Heckman model and is required. The
selection equation should contain at least one variable that is not
in the outcome equation.

If *depvar_s* is specified, it should be coded as 0 or 1, with 0
indicating an observation not selected and 1 indicating a selected
observation. If *depvar_s* is not specified, observations for which
*depvar* is not missing are assumed selected, and those for which
*depvar* is missing are assumed not selected.

**noconstant** suppresses the selection constant term (intercept).

**offset(***varname_o***)** specifies that selection offset *varname_o* be
included in the model with the coefficient constrained to be 1.

**noconstant**, **offset(***varname***)**, **constraints(***constraints***)**, **collinear**; see **[R]**
**estimation options**.

+-----------+
----+ SE/Robust +--------------------------------------------------------

**vce(***vcetype***)** specifies the type of standard error reported, which
includes types that are derived from asymptotic theory (**oim**, **opg**),
that are robust to some kinds of misspecification (**robust**), that
allow for intragroup correlation (**cluster** *clustvar*), and that use
bootstrap or jackknife methods (**bootstrap**, **jackknife**); see **[R]**
*vce_option*.

+-----------+
----+ Reporting +--------------------------------------------------------

**level(***#***)**; see **[R] estimation options**.

**first** specifies that the first-step probit estimates of the selection
equation be displayed before estimation.

**lrmodel**; see **[R] estimation options**.

**nshazard(***newvar***)** and **mills(***newvar***)** are synonyms; either will create a new
variable containing the nonselection hazard -- what Heckman (1979)
referred to as the inverse of the Mills ratio -- from the selection
equation. The nonselection hazard is computed from the estimated
parameters of the selection equation.

**nocnsreport**; see **[R] estimation options**.

*display_options*: **noci**, __nopv__**alues**, __noomit__**ted**, **vsquish**, __noempty__**cells**,
__base__**levels**, __allbase__**levels**, __nofvlab__**el**, **fvwrap(***#***)**, **fvwrapon(***style***)**,
**cformat(***%fmt***)**, **pformat(%***fmt***)**, **sformat(%***fmt***)**, and **nolstretch**; see **[R]**
**estimation options**.

+--------------+
----+ Maximization +-----------------------------------------------------

*maximize_options*: __dif__**ficult**, __tech__**nique(***algorithm_spec***)**, __iter__**ate(***#***)**,
[__no__]__lo__**g**, __tr__**ace**, __grad__**ient**, **showstep**, __hess__**ian**, __showtol__**erance**,
__tol__**erance(***#***)**, __ltol__**erance(***#***)**, __nrtol__**erance(***#***)**, __nonrtol__**erance**, and
**from(***init_specs***)**; see **[R] maximize**. These options are seldom used.

Setting the optimization type to **technique(bhhh)** resets the default
*vcetype* to **vce(opg)**.

The following option is available with **heckman** but is not shown in the
dialog box:

**coeflegend**; see **[R] estimation options**.

__Options for Heckman selection model (two-step)__

+-------+
----+ Model +------------------------------------------------------------

**twostep** specifies that Heckman's (1979) two-step efficient estimates of
the parameters, standard errors, and covariance matrix be produced.

**select(**[*depvar_s* **=**] *varlist_s* [**,** **noconstant**]**)** specifies the variables and
options for the selection equation. It is an integral part of
specifying a Heckman model and is required. The selection equation
should contain at least one variable that is not in the outcome
equation.

If *depvar_s* is specified, it should be coded as 0 or 1, with 0
indicating an observation not selected and 1 indicating a selected
observation. If *depvar_s* is not specified, observations for which
*depvar* is not missing are assumed selected, and those for which
*depvar* is missing are assumed not selected.

**noconstant** suppresses the selection constant term (intercept).

**noconstant**; see **[R] estimation options**.

**rhosigma**, **rhotrunc**, **rholimited**, and **rhoforce** are rarely used options to
specify how the two-step estimator (option **twostep**) handles unusual
cases in which the two-step estimate of rho is outside the admissible
range for a correlation, [-1,1]. When rho is outside this range, the
two-step estimate of the coefficient variance-covariance matrix may
not be positive definite and thus may be unusable for testing. The
default is **rhosigma**.

**rhosigma** specifies that rho be truncated, as with the **rhotrunc**
option, and that the estimate of sigma be made consistent with
rho_hat, the truncated estimate of rho. So, sigma_hat = B_m *
rho_hat; see *Methods and formulas* in **[R] heckman** for the definition
of B_m. Both the truncated rho and the new estimate of sigma_hat are
used in all computations to estimate the two-step covariance matrix.

**rhotrunc** specifies that rho be truncated to lie in the range [-1,1].
If the two-step estimate is less than -1, rho is set to -1; if the
two-step estimate is greater than 1, rho is set to 1. This truncated
value of rho is used in all computations to estimate the two-step
covariance matrix.

**rholimited** specifies that rho be truncated only in computing the
diagonal matrix D as it enters V_twostep and Q; see *Methods and*
*formulas* in **[R] heckman**. In all other computations, the untruncated
estimate of rho is used.

**rhoforce** specifies that the two-step estimate of rho be retained,
even if it is outside the admissible range for a correlation. This
option may, in rare cases, lead to a non-positive-definite covariance
matrix.

These options have no effect when estimation is by maximum
likelihood, the default. They also have no effect when the two-step
estimate of rho is in the range [-1,1].

+----+
----+ SE +---------------------------------------------------------------

**vce(***vcetype***)** specifies the type of standard error reported, which
includes types that are derived from asymptotic theory (**conventional**)
and that use bootstrap or jackknife methods (**bootstrap**, **jackknife**);
see **[R] ***vce_option*.

**vce(conventional)**, the default, uses the two-step variance estimator
derived by Heckman.

+-----------+
----+ Reporting +--------------------------------------------------------

**level(***#***)**; see **[R] estimation options**.

**first** specifies that the first-step probit estimates of the selection
equation be displayed before estimation.

**nshazard(***newvar***)** and **mills(***newvar***)** are synonyms; either will create a new
variable containing the nonselection hazard -- what Heckman (1979)
referred to as the inverse of the Mills ratio -- from the selection
equation. The nonselection hazard is computed from the estimated
parameters of the selection equation.

*display_options*: **noci**, __nopv__**alues**, __noomit__**ted**, **vsquish**, __noempty__**cells**,
__base__**levels**, __allbase__**levels**, __nofvlab__**el**, **fvwrap(***#***)**, **fvwrapon(***style***)**,
**cformat(***%fmt***)**, **pformat(%***fmt***)**, **sformat(%***fmt***)**, and **nolstretch**; see **[R]**
**estimation options**.

The following option is available with **heckman** but is not shown in the
dialog box:

**coeflegend**; see **[R] estimation options**.

__Remarks__

Heckman estimates all the parameters in the model:

(regression equation: y is *depvar*, x is *varlist*)
y = xb + u_1

(selection equation: Z is *varlist_s*)
y observed if Zg + u_2 > 0

where:
u_1 ~ N(0, sigma)
u_2 ~ N(0, 1)
corr(u_1, u_2) = rho

In the syntax for **heckman**, *depvar* and *varlist* are the dependent variable
and regressors for the underlying regression model (y = xb), and
*varlist_s* are the variables (Z) thought to determine whether *depvar* is
selected or observed (selected or not selected). By default, **heckman**
assumes that missing values (see missing) of *depvar* imply that the
dependent variable is unobserved (not selected). With some datasets, it
is more convenient to specify a binary variable (*depvar_s*) that
identifies the observations for which the dependent is observed/selected
(*depvar_s*!=0) or not observed (*depvar_s*=0); **heckman** will accommodate
either type of data.

__Examples__

Setup
**. webuse womenwk**

Obtain full ML estimates
**. heckman wage educ age, select(married children educ age)**

Obtain Heckman's two-step consistent estimates
**. heckman wage educ age, select(married children educ age) twostep**

Define and use each equation separately
**. global wage_eqn wage educ age**
**. global seleqn married children age**
**. heckman $wage_eqn, select($seleqn)**

Use a variable to identify selection
**. generate wageseen = (wage < .)**
**. heckman wage educ age, select(wageseen = married children educ age)**

Specify robust variance
**. heckman wage educ age, select(married children educ age)**
**vce(robust)**

Specify clustering on **county**
**. heckman $wage_eqn, select($seleqn) vce(cluster county)**

Report first-step probit estimates
**. heckman wage educ age, select(married children educ age) first**

Create **mymills** containing nonselection hazard
**. heckman $wage_eqn, select($seleqn) mills(mymills)**

No constant in model
**. heckman wage educ age, noconstant select(married children educ age)**

No constant in selection equation
**. heckman wage educ age, select(married children educ age,**
**noconstant)**

__Stored results__

**heckman** (maximum likelihood) stores the following in **e()**:

Scalars
**e(N)** number of observations
**e(N_selected)** number of selected observations
**e(N_nonselected)** number of nonselected observations
**e(k)** number of parameters
**e(k_eq)** number of equations in **e(b)**
**e(k_eq_model)** number of equations in overall model test
**e(k_aux)** number of auxiliary parameters
**e(k_dv)** number of dependent variables
**e(df_m)** model degrees of freedom
**e(ll)** log likelihood
**e(ll_0)** log likelihood, constant-only model
**e(N_clust)** number of clusters
**e(lambda)** lambda
**e(selambda)** standard error of lambda
**e(sigma)** sigma
**e(chi2)** chi-squared
**e(chi2_c)** chi-squared for comparison test
**e(p)** p-value for model test
**e(p_c)** p-value for comparison test
**e(rho)** rho
**e(rank)** rank of **e(V)**
**e(rank0)** rank of **e(V)** for constant-only model
**e(ic)** number of iterations
**e(rc)** return code
**e(converged)** **1** if converged, **0** otherwise

Macros
**e(cmd)** **heckman**
**e(cmdline)** command as typed
**e(depvar)** names of dependent variables
**e(wtype)** weight type
**e(wexp)** weight expression
**e(title)** title in estimation output
**e(title2)** secondary title in estimation output
**e(clustvar)** name of cluster variable
**e(offset1)** offset for regression equation
**e(offset2)** offset for selection equation
**e(mills)** variable containing nonselection hazard (inverse of
Mills's ratio)
**e(chi2type)** **Wald** or **LR**; type of model chi-squared test
**e(chi2_ct)** **Wald** or **LR**; type of model chi-squared test
corresponding to **e(chi2_c)**
**e(vce)** *vcetype* specified in **vce()**
**e(vcetype)** title used to label Std. Err.
**e(opt)** type of optimization
**e(which)** **max** or **min**; whether optimizer is to perform
maximization or minimization
**e(method)** **ml**
**e(ml_method)** type of **ml** method
**e(user)** name of likelihood-evaluator program
**e(technique)** maximization technique
**e(properties)** **b V**
**e(predict)** program used to implement **predict**
**e(marginsok)** predictions allowed by **margins**
**e(marginsnotok)** predictions disallowed by **margins**
**e(asbalanced)** factor variables **fvset** as **asbalanced**
**e(asobserved)** factor variables **fvset** as **asobserved**

Matrices
**e(b)** coefficient vector
**e(Cns)** constraints matrix
**e(ilog)** iteration log (up to 20 iterations)
**e(gradient)** gradient vector
**e(V)** variance-covariance matrix of the estimators
**e(V_modelbased)** model-based variance

Functions
**e(sample)** marks estimation sample

**heckman** (two-step) stores the following in **e()**:

Scalars
**e(N)** number of observations
**e(N_selected)** number of selected observations
**e(N_nonselected)** number of nonselected observations
**e(df_m)** model degrees of freedom
**e(lambda)** lambda
**e(selambda)** standard error of lambda
**e(sigma)** sigma
**e(chi2)** chi-squared
**e(p)** p-value for comparison test
**e(rho)** rho
**e(rank)** rank of **e(V)**

Macros
**e(cmd)** **heckman**
**e(cmdline)** command as typed
**e(depvar)** names of dependent variables
**e(title)** title in estimation output
**e(title2)** secondary title in estimation output
**e(mills)** variable containing nonselection hazard (inverse of
Mills's ratio)
**e(chi2type)** **Wald** or **LR**; type of model chi-squared test
**e(vce)** *vcetype* specified in **vce()**
**e(rhometh)** **rhosigma**, **rhotrunc**, **rholimited**, or **rhoforce**
**e(method)** **twostep**
**e(properties)** **b V**
**e(predict)** program used to implement **predict**
**e(marginsok)** predictions allowed by **margins**
**e(marginsnotok)** predictions disallowed by **margins**
**e(asbalanced)** factor variables **fvset** as **asbalanced**
**e(asobserved)** factor variables **fvset** as **asobserved**

Matrices
**e(b)** coefficient vector
**e(V)** variance-covariance matrix of the estimators

Functions
**e(sample)** marks estimation sample

__Reference__

Heckman, J. 1979. Sample selection bias as a specification error.
*Econometrica* 47: 153--161.