 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: MIXLOGIT: marginal effects

 From Maarten Buis To statalist@hsphsun2.harvard.edu Subject Re: st: MIXLOGIT: marginal effects Date Tue, 7 Feb 2012 10:46:23 +0100

```On Tue, Feb 7, 2012 at 8:50 AM, Clive Nicholas wrote:
> However, both of you, IMVHO, are wrong, wrong, wrong about the linear
> probability model. There is no justification for the use of this model
> _at all_ when regressing a binary dependent variable on a set of
> regressors. Pampel's (2000) excellent introduction on logistic
> regression spent the first nine or so pages carefully explaining just
> why it is inappropriate (imposing linearity on a nonlinear
> relationship; predicting values out of range; nonadditivity; etc).
> Since when was it in vogue to advocate its usage? I'm afraid that I
> don't really understand this.
>
> Pampel FC (2000) Logistic Regression: A Primer (Sage University Papers
> Series on QASS, 07-132), Thousand Oaks, CA: Sage

There is one situation where the linear probability model is
completely unproblematic and that is when you have a completely
saturated model, i.e. when all your explanatory variables are
categorical and all interaction terms are included. In that case the
predictions of a linear probability model will exactly correspond with
the prediction of a logit model, as you can see below:

*---------------- begin example ------------------
sysuse nlsw88, clear
gen byte goodjob = occupation < 3 if occupation < .
predict pr_logit
predict pr_reg
tab pr_*
*----------------- end example -------------------

The residuals in a linear probability model are heteroskedastic, but
you can easily get around that by specifying the -vce(robust)- option.
If you do that than both -logit- and -regress- will give valid
inference:

*------------ begin simulation ---------------
tempname trd
scalar `trd' = invlogit(.5)-invlogit(-.5)

di as txt    "The true risk difference is " ///
as result `trd'

program drop _all
program define sim, rclass
drop _all
set obs 500
gen x = _n < 251
gen y = runiform() < invlogit(-.5 + x)

logit y x
return scalar lor = _b[x]
return scalar lse = _se[x]

reg y x, vce(robust)
return scalar rd = _b[x]
return scalar rse = _se[x]
end

simulate lor=r(lor) lse=r(lse)  ///
rd=r(rd)   rse=r(rse), ///
reps(20000) : sim

// logit works fine:
simsum lor, true(1) se(lse)

// linear probability model works fine too:
simsum rd, true(`trd') se(rse)
*------------- end simulation ----------------

So in case of a fully saturated model, it is really a matter of
whether you want your parameters in terms of differences in
probabilities or ratios of odds.

In models that do not include all interactions or where you add a
continuous explanatory variable the linear probability model is more
restrictive. However, that does not bother me too much; a model is
after all supposed to be a simplification of reality. You obviously do
want to check that the deviations from linearity or the predictions
outside the [0,1] interval are not getting too much out of hand, but I
think there will be many situations where the linear probability model

Having said all that, in my own research I still almost always use a
logit rather than a linear probability model, but that is a choice not
a necessity.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```