## Stata 15 help for heckman

```
[R] heckman -- Heckman selection model

Syntax

Basic syntax

heckman depvar [indepvars], select(varlist_s) [twostep]

or

heckman depvar [indepvars], select(depvar_s = varlist_s) [twostep]

Full syntax for maximum likelihood estimates only

heckman depvar [indepvars] [if] [in] [weight], select([depvar_s =]
varlist_s [, noconstant offset(varname_o)])
[heckman_ml_options]

Full syntax for Heckman's two-step consistent estimates only

heckman depvar [indepvars] [if] [in], twostep select([depvar_s =]
varlist_s [, noconstant]) [heckman_ts_options]

heckman_ml_options            Description
-------------------------------------------------------------------------
Model
mle                         use maximum likelihood estimator; the
default
* select()                    specify selection equation: dependent and
independent variables; whether to have
constant term and offset variable
noconstant                  suppress constant term
offset(varname)             include varname in model with coefficient
constrained to 1
constraints(constraints)    apply specified linear constraints
collinear                   keep collinear variables

SE/Robust
vce(vcetype)                vcetype may be oim, robust, cluster
clustvar, opg, bootstrap, or jackknife

Reporting
level(#)                    set confidence level; default is level(95)
first                       report first-step probit estimates
lrmodel                     perform the likelihood-ratio model test
instead of the default Wald test
nshazard(newvar)            generate nonselection hazard variable
mills(newvar)               synonym for nshazard()
nocnsreport                 do not display constraints
display_options             control columns and column formats, row
spacing, line width, display of omitted
variables and base and empty cells, and
factor-variable labeling

Maximization
maximize_options            control the maximization process; seldom
used

coeflegend                  display legend instead of statistics
-------------------------------------------------------------------------
* select() is required. The full specification is
select([depvar_s =] varlist_s [, noconstant offset(varname_o)]).

heckman_ts_options            Description
-------------------------------------------------------------------------
Model
* twostep                     produce two-step consistent estimate
* select()                    specify selection equation: dependent and
independent variables; whether to have
constant term
noconstant                  suppress constant term
rhosigma                    truncate rho to [-1,1] with consistent
Sigma
rhotrunc                    truncate rho to [-1,1]
rholimited                  truncate rho in limited cases
rhoforce                    do not truncate rho

SE
vce(vcetype)                vcetype may be conventional, bootstrap, or
jackknife

Reporting
level(#)                    set confidence level; default is level(95)
first                       report first-step probit estimates
nshazard(newvar)            generate nonselection hazard variable
mills(newvar)               synonym for nshazard()
display_options             control columns and column formats, row
spacing, line width, display of omitted
variables and base and empty cells, and
factor-variable labeling

coeflegend                  display legend instead of statistics
-------------------------------------------------------------------------
* twostep and select() are required. The full specification is
select([depvar_s =] varlist_s [, noconstant]).

indepvars and varlist_s may contain factor variables; see fvvarlist.
depvar, indepvars, varlist_s, and depvar_s may contain time-series
operators; see tsvarlist.
bayes, bootstrap, by, fp, jackknife, rolling, statsby, and svy are
allowed; see prefix.  For more details, see [BAYES] bayes: heckman.
Weights are not allowed with the bootstrap prefix.
twostep, vce(), first, lrmodel, and weights are not allowed with the svy
prefix.
pweights, fweights, and iweights are allowed with maximum likelihood
estimation; see weight.  No weights are allowed if twostep is
specified.
coeflegend does not appear in the dialog box.
See [R] heckman postestimation for features available after estimation.

Statistics > Sample-selection models > Heckman selection model

Description

heckman fits regression models with selection by using either Heckman's
two-step consistent estimator or full maximum likelihood.

Options for Heckman selection model (ML)

+-------+
----+ Model +------------------------------------------------------------

mle requests that the maximum likelihood estimator be used.  This is the
default.

select([depvar_s =] varlist_s [, noconstant offset(varname_o)]) specifies
the variables and options for the selection equation.  It is an
integral part of specifying a Heckman model and is required.  The
selection equation should contain at least one variable that is not
in the outcome equation.

If depvar_s is specified, it should be coded as 0 or 1, with 0
indicating an observation not selected and 1 indicating a selected
observation.  If depvar_s is not specified, observations for which
depvar is not missing are assumed selected, and those for which
depvar is missing are assumed not selected.

noconstant suppresses the selection constant term (intercept).

offset(varname_o) specifies that selection offset varname_o be
included in the model with the coefficient constrained to be 1.

noconstant, offset(varname), constraints(constraints), collinear; see [R]
estimation options.

+-----------+
----+ SE/Robust +--------------------------------------------------------

vce(vcetype) specifies the type of standard error reported, which
includes types that are derived from asymptotic theory (oim, opg),
that are robust to some kinds of misspecification (robust), that
allow for intragroup correlation (cluster clustvar), and that use
bootstrap or jackknife methods (bootstrap, jackknife); see [R]
vce_option.

+-----------+
----+ Reporting +--------------------------------------------------------

level(#); see [R] estimation options.

first specifies that the first-step probit estimates of the selection
equation be displayed before estimation.

lrmodel; see [R] estimation options.

nshazard(newvar) and mills(newvar) are synonyms; either will create a new
variable containing the nonselection hazard -- what Heckman (1979)
referred to as the inverse of the Mills ratio -- from the selection
equation.  The nonselection hazard is computed from the estimated
parameters of the selection equation.

nocnsreport; see [R] estimation options.

display_options:  noci, nopvalues, noomitted, vsquish, noemptycells,
baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style),
cformat(%fmt), pformat(%fmt), sformat(%fmt), and nolstretch; see [R]
estimation options.

+--------------+
----+ Maximization +-----------------------------------------------------

maximize_options: difficult, technique(algorithm_spec), iterate(#),
[no]log, trace, gradient, showstep, hessian, showtolerance,
tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and
from(init_specs); see [R] maximize.  These options are seldom used.

Setting the optimization type to technique(bhhh) resets the default
vcetype to vce(opg).

The following option is available with heckman but is not shown in the
dialog box:

coeflegend; see [R] estimation options.

Options for Heckman selection model (two-step)

+-------+
----+ Model +------------------------------------------------------------

twostep specifies that Heckman's (1979) two-step efficient estimates of
the parameters, standard errors, and covariance matrix be produced.

select([depvar_s =] varlist_s [, noconstant]) specifies the variables and
options for the selection equation.  It is an integral part of
specifying a Heckman model and is required.  The selection equation
should contain at least one variable that is not in the outcome
equation.

If depvar_s is specified, it should be coded as 0 or 1, with 0
indicating an observation not selected and 1 indicating a selected
observation.  If depvar_s is not specified, observations for which
depvar is not missing are assumed selected, and those for which
depvar is missing are assumed not selected.

noconstant suppresses the selection constant term (intercept).

noconstant; see [R] estimation options.

rhosigma, rhotrunc, rholimited, and rhoforce are rarely used options to
specify how the two-step estimator (option twostep) handles unusual
cases in which the two-step estimate of rho is outside the admissible
range for a correlation, [-1,1].  When rho is outside this range, the
two-step estimate of the coefficient variance-covariance matrix may
not be positive definite and thus may be unusable for testing.  The
default is rhosigma.

rhosigma specifies that rho be truncated, as with the rhotrunc
option, and that the estimate of sigma be made consistent with
rho_hat, the truncated estimate of rho.  So, sigma_hat = B_m *
rho_hat; see Methods and formulas in [R] heckman for the definition
of B_m.  Both the truncated rho and the new estimate of sigma_hat are
used in all computations to estimate the two-step covariance matrix.

rhotrunc specifies that rho be truncated to lie in the range [-1,1].
If the two-step estimate is less than -1, rho is set to -1; if the
two-step estimate is greater than 1, rho is set to 1.  This truncated
value of rho is used in all computations to estimate the two-step
covariance matrix.

rholimited specifies that rho be truncated only in computing the
diagonal matrix D as it enters V_twostep and Q; see Methods and
formulas in [R] heckman.  In all other computations, the untruncated
estimate of rho is used.

rhoforce specifies that the two-step estimate of rho be retained,
even if it is outside the admissible range for a correlation.  This
option may, in rare cases, lead to a non-positive-definite covariance
matrix.

These options have no effect when estimation is by maximum
likelihood, the default.  They also have no effect when the two-step
estimate of rho is in the range [-1,1].

+----+
----+ SE +---------------------------------------------------------------

vce(vcetype) specifies the type of standard error reported, which
includes types that are derived from asymptotic theory (conventional)
and that use bootstrap or jackknife methods (bootstrap, jackknife);
see [R] vce_option.

vce(conventional), the default, uses the two-step variance estimator
derived by Heckman.

+-----------+
----+ Reporting +--------------------------------------------------------

level(#); see [R] estimation options.

first specifies that the first-step probit estimates of the selection
equation be displayed before estimation.

nshazard(newvar) and mills(newvar) are synonyms; either will create a new
variable containing the nonselection hazard -- what Heckman (1979)
referred to as the inverse of the Mills ratio -- from the selection
equation.  The nonselection hazard is computed from the estimated
parameters of the selection equation.

display_options:  noci, nopvalues, noomitted, vsquish, noemptycells,
baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style),
cformat(%fmt), pformat(%fmt), sformat(%fmt), and nolstretch; see [R]
estimation options.

The following option is available with heckman but is not shown in the
dialog box:

coeflegend; see [R] estimation options.

Remarks

Heckman estimates all the parameters in the model:

(regression equation: y is depvar, x is varlist)
y = xb + u_1

(selection equation: Z is varlist_s)
y observed if Zg + u_2 > 0

where:
u_1 ~ N(0, sigma)
u_2 ~ N(0, 1)
corr(u_1, u_2) = rho

In the syntax for heckman, depvar and varlist are the dependent variable
and regressors for the underlying regression model (y = xb), and
varlist_s are the variables (Z) thought to determine whether depvar is
selected or observed (selected or not selected).  By default, heckman
assumes that missing values (see missing) of depvar imply that the
dependent variable is unobserved (not selected).  With some datasets, it
is more convenient to specify a binary variable (depvar_s) that
identifies the observations for which the dependent is observed/selected
(depvar_s!=0) or not observed (depvar_s=0); heckman will accommodate
either type of data.

Examples

Setup
. webuse womenwk

Obtain full ML estimates
. heckman wage educ age, select(married children educ age)

Obtain Heckman's two-step consistent estimates
. heckman wage educ age, select(married children educ age) twostep

Define and use each equation separately
. global wage_eqn wage educ age
. global seleqn married children age
. heckman \$wage_eqn, select(\$seleqn)

Use a variable to identify selection
. generate wageseen = (wage < .)
. heckman wage educ age, select(wageseen = married children educ age)

Specify robust variance
. heckman wage educ age, select(married children educ age)
vce(robust)

Specify clustering on county
. heckman \$wage_eqn, select(\$seleqn) vce(cluster county)

Report first-step probit estimates
. heckman wage educ age, select(married children educ age) first

Create mymills containing nonselection hazard
. heckman \$wage_eqn, select(\$seleqn) mills(mymills)

No constant in model
. heckman wage educ age, noconstant select(married children educ age)

No constant in selection equation
. heckman wage educ age, select(married children educ age,
noconstant)

Stored results

heckman (maximum likelihood) stores the following in e():

Scalars
e(N)                number of observations
e(N_selected)       number of selected observations
e(N_nonselected)    number of nonselected observations
e(k)                number of parameters
e(k_eq)             number of equations in e(b)
e(k_eq_model)       number of equations in overall model test
e(k_aux)            number of auxiliary parameters
e(k_dv)             number of dependent variables
e(df_m)             model degrees of freedom
e(ll)               log likelihood
e(ll_0)             log likelihood, constant-only model
e(N_clust)          number of clusters
e(lambda)           lambda
e(selambda)         standard error of lambda
e(sigma)            sigma
e(chi2)             chi-squared
e(chi2_c)           chi-squared for comparison test
e(p)                p-value for model test
e(p_c)              p-value for comparison test
e(rho)              rho
e(rank)             rank of e(V)
e(rank0)            rank of e(V) for constant-only model
e(ic)               number of iterations
e(rc)               return code
e(converged)        1 if converged, 0 otherwise

Macros
e(cmd)              heckman
e(cmdline)          command as typed
e(depvar)           names of dependent variables
e(wtype)            weight type
e(wexp)             weight expression
e(title)            title in estimation output
e(title2)           secondary title in estimation output
e(clustvar)         name of cluster variable
e(offset1)          offset for regression equation
e(offset2)          offset for selection equation
e(mills)            variable containing nonselection hazard (inverse of
Mills's ratio)
e(chi2type)         Wald or LR; type of model chi-squared test
e(chi2_ct)          Wald or LR; type of model chi-squared test
corresponding to e(chi2_c)
e(vce)              vcetype specified in vce()
e(vcetype)          title used to label Std. Err.
e(opt)              type of optimization
e(which)            max or min; whether optimizer is to perform
maximization or minimization
e(method)           ml
e(ml_method)        type of ml method
e(user)             name of likelihood-evaluator program
e(technique)        maximization technique
e(properties)       b V
e(predict)          program used to implement predict
e(marginsok)        predictions allowed by margins
e(marginsnotok)     predictions disallowed by margins
e(asbalanced)       factor variables fvset as asbalanced
e(asobserved)       factor variables fvset as asobserved

Matrices
e(b)                coefficient vector
e(Cns)              constraints matrix
e(ilog)             iteration log (up to 20 iterations)
e(V)                variance-covariance matrix of the estimators
e(V_modelbased)     model-based variance

Functions
e(sample)           marks estimation sample

heckman (two-step) stores the following in e():

Scalars
e(N)                number of observations
e(N_selected)       number of selected observations
e(N_nonselected)    number of nonselected observations
e(df_m)             model degrees of freedom
e(lambda)           lambda
e(selambda)         standard error of lambda
e(sigma)            sigma
e(chi2)             chi-squared
e(p)                p-value for comparison test
e(rho)              rho
e(rank)             rank of e(V)

Macros
e(cmd)              heckman
e(cmdline)          command as typed
e(depvar)           names of dependent variables
e(title)            title in estimation output
e(title2)           secondary title in estimation output
e(mills)            variable containing nonselection hazard (inverse of
Mills's ratio)
e(chi2type)         Wald or LR; type of model chi-squared test
e(vce)              vcetype specified in vce()
e(rhometh)          rhosigma, rhotrunc, rholimited, or rhoforce
e(method)           twostep
e(properties)       b V
e(predict)          program used to implement predict
e(marginsok)        predictions allowed by margins
e(marginsnotok)     predictions disallowed by margins
e(asbalanced)       factor variables fvset as asbalanced
e(asobserved)       factor variables fvset as asobserved

Matrices
e(b)                coefficient vector
e(V)                variance-covariance matrix of the estimators

Functions
e(sample)           marks estimation sample

Reference

Heckman, J. 1979.  Sample selection bias as a specification error.
Econometrica 47: 153--161.

```