Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Maarten Buis <maartenlbuis@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: MIXLOGIT: marginal effects |
Date | Tue, 7 Feb 2012 10:46:23 +0100 |
On Tue, Feb 7, 2012 at 8:50 AM, Clive Nicholas wrote: > However, both of you, IMVHO, are wrong, wrong, wrong about the linear > probability model. There is no justification for the use of this model > _at all_ when regressing a binary dependent variable on a set of > regressors. Pampel's (2000) excellent introduction on logistic > regression spent the first nine or so pages carefully explaining just > why it is inappropriate (imposing linearity on a nonlinear > relationship; predicting values out of range; nonadditivity; etc). > Since when was it in vogue to advocate its usage? I'm afraid that I > don't really understand this. > > Pampel FC (2000) Logistic Regression: A Primer (Sage University Papers > Series on QASS, 07-132), Thousand Oaks, CA: Sage There is one situation where the linear probability model is completely unproblematic and that is when you have a completely saturated model, i.e. when all your explanatory variables are categorical and all interaction terms are included. In that case the predictions of a linear probability model will exactly correspond with the prediction of a logit model, as you can see below: *---------------- begin example ------------------ sysuse nlsw88, clear gen byte goodjob = occupation < 3 if occupation < . logit goodjob i.collgrad##i.south##i.union predict pr_logit reg goodjob i.collgrad##i.south##i.union, vce(robust) predict pr_reg tab pr_* *----------------- end example ------------------- The residuals in a linear probability model are heteroskedastic, but you can easily get around that by specifying the -vce(robust)- option. If you do that than both -logit- and -regress- will give valid inference: *------------ begin simulation --------------- tempname trd scalar `trd' = invlogit(.5)-invlogit(-.5) di as txt "The true risk difference is " /// as result `trd' program drop _all program define sim, rclass drop _all set obs 500 gen x = _n < 251 gen y = runiform() < invlogit(-.5 + x) logit y x return scalar lor = _b[x] return scalar lse = _se[x] reg y x, vce(robust) return scalar rd = _b[x] return scalar rse = _se[x] end simulate lor=r(lor) lse=r(lse) /// rd=r(rd) rse=r(rse), /// reps(20000) : sim // logit works fine: simsum lor, true(1) se(lse) // linear probability model works fine too: simsum rd, true(`trd') se(rse) *------------- end simulation ---------------- So in case of a fully saturated model, it is really a matter of whether you want your parameters in terms of differences in probabilities or ratios of odds. In models that do not include all interactions or where you add a continuous explanatory variable the linear probability model is more restrictive. However, that does not bother me too much; a model is after all supposed to be a simplification of reality. You obviously do want to check that the deviations from linearity or the predictions outside the [0,1] interval are not getting too much out of hand, but I think there will be many situations where the linear probability model is perfectly adequate. Having said all that, in my own research I still almost always use a logit rather than a linear probability model, but that is a choice not a necessity. Hope this helps, Maarten -------------------------- Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/