|
This FAQ is for Stata 8 and older versions of Stata.
Why does my mlogit take so long to converge?
|
Title
|
|
Convergence of maximum likelihood estimators
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
November 1996
|
Maximum-likelihood estimators produce results by an iterative procedure. At
the beginning of iteration k, there is some coefficient vector
bk. This coefficient vector can be combined with the model and
data to produce a log-likelihood value Lk.
The procedure then finds a b{k+1}, which produces a better
(larger) log-likelihood value, L{k+1}. You might expect the rule
to be
L{k+1} - Lk is small
Stata is willing to use that rule—see [R]
maximize—but
by default it does not. Instead, Stata uses the rule
|| b{k+1} - bk || is small
That is, by default Stata uses the rule that the coefficients do not change
much. Basing the rule on the coefficients is more conservative, which is to
say, safer. It is the coefficients that interest you, the researcher, so
the idea is not to declare convergence until the coefficients settle down.
You may sometimes see that the
mlogit command
will continue to iterate even though the log-likelihood value does not
appear to change. Here we know that, in each iteration,
|| b{k+1} - bk || is NOT small
even though
L{k+1} - Lk is small
The coefficient vector was changing while the log-likelihood changed little,
if at all.
How, you may wonder, can this happen? Mechanically, all it takes is a ridge
in the likelihood. You would like to think your likelihood function looked
something like
|
L | +
| / \
| / \
| ___/ \___
+------------------------
b
but it might look like
| ______
L | / \
| / \
| ___/ \___
+------------------------
b
or even
| _____________
L | /
| /
| ___/
+------------------------
b
The flat portions are called ridges.
You might also speculate that the cause of this nonconvergence or
slow-convergence is collinearity in the explanatory variables.
Theoretically, this supposition is right, but Stata drops collinear
variables at the outset, so that is not the problem. Still, it is worth
understanding the picture. We need another dimension, so this time, pretend
that our likelihood function is a function of two parameters, b1 and b2. A
well-behaved likelihood function would look like
| unique .
L | maximum -> . . \
| \ \
| . \ \___ /
| + \ /
| / \ \___ / b2
| / \ /
| ___/ \___ /
+------------------------------------------------
b1
With collinearity, there is no longer a peak; instead, there is a ridge:
| Ridge of .
L | constant . \
| height --> . \
| . \ \___ /
| + \ /
| / \ \___ / b2
| / \ /
| ___/ \___ /
+------------------------------------------------
b1
The point is that with collinear variables, the value of (b1,b2) can change
a lot with no corresponding change in L—you just move along the ridge.
Using Stata, however, that does not happen because Stata looks for
collinearity before iterating and drops any unnecessary variables.
Now, if you go back to the two-dimensional pictures, for various
mathematical reasons we know the mlogit likelihood function does not
have an internal ridge. It can, however, have a nearly flat section that
stretches out to infinity. The flat portion is not really flat; it is
gently rising, but the rise is so gentle that it is difficult to detect.
This happens when the maximum-likelihood value for one of the coefficients
is infinity. The function reaches a maximum at infinity and, since it takes
forever to get there, the function gently rises to that maximum. For
mlogit, coefficients around 30 are nearly infinite.
If we evaluated the log-likelihood function for, say, coefficient
b2=infinity, (something we would find impossible to do on a computer but
that we could do with pencil and paper by carefully taking the limit as b2
→ infinity) we will get some number, say −1.
Now if we ran off to my computer and evaluated the log-likelihood function
for b2=30, I will get a number very close to −1. In fact, we will get
−1 exactly because of rounding. If we tried to evaluate b2=60 on my
computer, we would not get an answer because we would run into numerical
overflows. Thus, for all practical purposes, b2=30 is infinity for the
mlogit model on a digital computer.
That is what is happening here. The likelihood function is changing very
little (it’s not changing at all in the first 8 digits), but one or
more of the coefficients is marching off to infinity, which is to say, 30.
Eventually, it will get there, so stopping the iterations is premature.
What will lead to that? It is the same story as infinite coefficients in
binary logit: one-way causation. Imagine there are four outcomes in my
model—call them A, B, C, and D. Imagine we have an indicator
variable, x1, in our model, and further imagine that all the x1==1
observations are observed to have outcome A. The model then wants to
estimate Pr(outcome A given x1==1) = 1 and Pr(any other outcome given x1==1)
= 0. To do this, it must set one of the coefficients to infinity.
Stata’s logit and
logistic
commands protect you from this problem by examining the data ahead of time
and dropping variables and observations that have this problem.
logit and logistic will report
x1 dropped because x1~=0 predicts success perfectly;
x1 dropped and 12 observations not used
mlogit does not protect you because there are just too many ways it
can happen; the problem is too difficult.
What you should do here is an interesting question. Basically, you must
decide whether the infinite coefficient is meaningful. If you included x1
in your model on the off chance that x1 might have an effect and there are
just a handful of x1==1 observations, I would remove x1 from the model and
reestimate.
On the other hand, if you have strong theoretical reasons to believe x1 has
an effect or, even in the absence of a theoretical justification, the
evidence is overwhelming (there were 10,000 x1==1 observations in your data
and all 10,000 of them had outcome A), then you are going to have to admit
that x1 belongs in the model and that either
- the effect is dominating, x1==1 does lead to outcome A with
probability 1, and therefore the multinomial logistic model
is not the correct way to model this, or
- the effect is whopping, x1==1 leads to outcome A with high
probability (but not exactly 1), and you simply do not have
enough data to model the x1==1 population.
In case 2, you could remove x1 from the model, exclude the x1==1
observations, and so estimate a model conditional on x1~=1.
In the above when we refer to x1==1 and x1==0, you could obviously turn it
around; it might be x1==0 that leads to outcome A. Moreover, x1 does not
need to be an indicator variable; x1 can be continuous, but that case is
more difficult to discuss. Indicator or not, it does not have to be one
variable that leads to the difficulty; there might be a linear combination
of x1 and x2 that leads to the same difficulty. Returning to indicator x1,
x1 does not have to lead to just outcome A. If x1 leads to outcome A or B
and the absence of x1 leads to C or D, there can be problems.
Anyway, the way to diagnose all these problems is to let the problem
continue until “convergence” and then examine the pattern of
coefficients. We put convergence in quotes because when you force Stata or
any software package to estimate models with infinite coefficients, you are
taxing the precision of the numerical routines and the reported results
should be interpreted cautiously. Theoretically, all the noninfinite
coefficients are estimated correctly, as are their standard errors.
Practically, because of how the routine has been stretched, that may not be
true; the other coefficients and standard errors are probably less accurate
than the package would normally produce. Final results, the results you
report, should be based on estimates with the infinities removed.
|