Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: perfect prediction in -logit-?

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject st: perfect prediction in -logit-? Date Thu, 5 May 2011 15:07:28 -0400

```Dear listers,

what are good ways to diagnose perfect prediction in -logit- and
-mlogit- models in batch mode? RTFMing gives the following examples:

* Example 1: one-way causation by a dummy variable
use http://www.stata-press.com/data/r11/repair, clear
logit foreign b3.repair

* Example 2: causation by a great linear predictor
use http://www.stata-press.com/data/r11/auto, clear
drop if foreign==0 & gear_ratio > 3.1
logit foreign mpg weight gear_ratio

* Example 3: weird covariate pattern
use http://www.stata-press.com/data/r11/logitxmpl, clear
logit y x1 x2, iter(50)

* Example 4: weak identification in mlogit
use http://www.stata-press.com/data/r11/auto, clear
mlogit rep78 foreign mpg weight
tabulate rep78, generate( rep78d )
logit rep78d1 foreign mpg weight

I believe I can find information about these issues in -e(rules)-
matrix, but since it is not really documented, even in the fine
manuals, I am sort of guessing what its function is by looking at
-matrix list e(rules)- after -logit-. For the models that don't have
any issues, it is a generic 1x4 matrix with no row/colnames. For
models with problems due to a particular variable, it gives the
name(s) of the culprit variable(s), the values that lead to perfect
prediction, and the number of observations that had to be removed for
-logit- to run.

The causation by strong predictor in Example 2 is not reflected in
-e(rules)-, however; there are no infinite coefficients and standard
errors, so the problem is really far into the tails of the
distribution of the linear predictor where Stata simply runs out of
digits in computing something like 1-c(epsdouble) (which happens when
the linear predictor exceeds abs(ln(c(epsdouble))) = 36 in absolute
value). The problem with lack of convergence in Example 3 is
unfortunately not reflected in -e(rules)-, either, although in this
particular case I can also figure out Stata could not estimate all of
the coefficients:

assert e(rank) == e(k)

where the RHS is what Stata wanted to estimate (the number of
parameters), and the LHS is what it really could estimate (the rank of
the resulting vce).

Note that Example 4 is more subtle. -mlogit- did not declare any of
the convergence or perfect prediction issues, although I believe it
should have. There are only two observations with rep78==1, so I don't
really see how Stata (or any other software, apart from WinBUGS that
would simply reproduce the prior in this case) could estimate the
equation for that outcome. As we see in the logit regression for that
cell, -foreign- variable predicts the negative outcome perfectly, but
I am still at a loss as to how Stata came up with three coefficients
based on just two points. Anyway, back to the -mlogit-: it reports
obscenely large standard errors on -foreign- variable (3000 for
rep78==1 equation; 1500 for rep78==2 equation), and that would be a
numeric accuracy concern to me (for real fun, try this -mlogit- with
-basecategory(1)-). However, lacking powerful identification
diagnostic tools of -logit-, it does not say anything to raise a brow,
neither in the output nor in the -ereturn-ed values.

So back to my question. I want to detect issues like lack of
convergence in -mlogit-, and figure out if I can point out any of the
explanatory variables that I can blame, -logit-style. Is that doable?

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```