Title | Convergence of maximum likelihood estimators | |

Author | William Gould, StataCorp |

Maximum-likelihood estimators produce results by an iterative procedure. At
the beginning of iteration k, there is some coefficient vector
b_{k}. This coefficient vector can be combined with the model and
data to produce a log-likelihood value L_{k}.

The procedure then finds a b_{{k+1}}, which produces a better
(larger) log-likelihood value, L_{{k+1}}. You might expect the rule
for covergence to be

L_{{k+1}}- L_{k}is small

Stata is willing to use that rule—see
[R] **maximize**—but
by default it does not. Instead, Stata uses the rule

|| b_{{k+1}}- b_{k}|| is small

That is, by default Stata uses the rule that the coefficients do not change much. Basing the rule on the coefficients is more conservative, which is to say, safer. It is the coefficients that interest you, the researcher, so the idea is not to declare convergence until the coefficients settle down.

You may sometimes see that the
**mlogit** command
will continue to iterate even though the log-likelihood value does not
appear to change. Here we know that, in each iteration,

|| b_{{k+1}}- b_{k}|| isNOTsmall

even though

L_{{k+1}}- L_{k}is small

The coefficient vector was changing while the log-likelihood changed little, if at all.

How, you may wonder, can this happen? Mechanically, all it takes is a ridge in the likelihood. You would like to think your likelihood function looked something like

| L | + | / \ | / \ | ___/ \___ +------------------------ b

but it might look like

| ______ L | / \ | / \ | ___/ \___ +------------------------ b

or even

| _____________ L | / | / | ___/ +------------------------ b

The flat portions are called ridges.

You might also speculate that the cause of this nonconvergence or slow-convergence is collinearity in the explanatory variables. Theoretically, this supposition is right, but Stata drops collinear variables at the outset, so that is not the problem. Still, it is worth understanding the picture. We need another dimension, so this time, pretend that our likelihood function is a function of two parameters, b1 and b2. A well-behaved likelihood function would look like

| unique . L | maximum -> . . \ | \ \ | . \ \___ / | + \ / | / \ \___ / b2 | / \ / | ___/ \___ / +------------------------------------------------ b1

With collinearity, there is no longer a peak; instead, there is a ridge:

| Ridge of . L | constant . \ | height --> . \ | . \ \___ / | + \ / | / \ \___ / b2 | / \ / | ___/ \___ / +------------------------------------------------ b1

The point is that with collinear variables, the value of (b1,b2) can change a lot with no corresponding change in L—you just move along the ridge.

Using Stata, however, that does not happen because Stata looks for collinearity before iterating and drops any unnecessary variables.

Now, if you go back to the two-dimensional pictures, for various
mathematical reasons we know the **mlogit** likelihood function does not
have an internal ridge. It can, however, have a nearly flat section that
stretches out to infinity. The flat portion is not really flat; it is
gently rising, but the rise is so gentle that it is difficult to detect.

This happens when the maximum-likelihood value for one of the coefficients
is infinity. The function reaches a maximum at infinity and, since it takes
forever to get there, the function gently rises to that maximum. For
**mlogit**, coefficients around 30 are nearly infinite.

If we evaluated the log-likelihood function for, say, coefficient b2=infinity, (something we would find impossible to do on a computer but that we could do with pencil and paper by carefully taking the limit as b2 → infinity) we will get some number, say −1.

Now if we ran off to my computer and evaluated the log-likelihood function
for b2=30, I will get a number very close to −1. In fact, we will get
−1 exactly because of rounding. If we tried to evaluate b2=60 on my
computer, we would not get an answer because we would run into numerical
overflows. Thus, for all practical purposes, b2=30 is infinity for the
**mlogit** model on a digital computer.

That is what is happening here. The likelihood function is changing very little (it’s not changing at all in the first 8 digits), but one or more of the coefficients is marching off to infinity, which is to say, 30. Eventually, it will get there, so stopping the iterations is premature.

What will lead to that? It is the same story as infinite coefficients in binary logit: one-way causation. Imagine there are four outcomes in my model—call them A, B, C, and D. Imagine we have an indicator variable, x1, in our model, and further imagine that all the x1==1 observations are observed to have outcome A. The model then wants to estimate Pr(outcome A given x1==1) = 1 and Pr(any other outcome given x1==1) = 0. To do this, it must set one of the coefficients to infinity.

Stata’s **logit** and
**logistic**
commands protect you from this problem by examining the data ahead of time
and dropping variables and observations that have this problem.
**logit** and **logistic** will report

x1 dropped because x1~=0 predicts success perfectly; x1 dropped and 12 observations not used

**mlogit** does not protect you because there are just too many ways it
can happen; the problem is too difficult.

What you should do here is an interesting question. Basically, you must decide whether the infinite coefficient is meaningful. If you included x1 in your model on the off chance that x1 might have an effect and there are just a handful of x1==1 observations, I would remove x1 from the model and reestimate.

On the other hand, if you have strong theoretical reasons to believe x1 has an effect or, even in the absence of a theoretical justification, the evidence is overwhelming (there were 10,000 x1==1 observations in your data and all 10,000 of them had outcome A), then you are going to have to admit that x1 belongs in the model and that either

- the effect is dominating, x1==1 does lead to outcome A with probability 1, and therefore the multinomial logistic model is not the correct way to model this, or
- the effect is whopping, x1==1 leads to outcome A with high probability (but not exactly 1), and you simply do not have enough data to model the x1==1 population.

In case 2, you could remove x1 from the model, exclude the x1==1 observations, and so estimate a model conditional on x1~=1.

In the above when we refer to x1==1 and x1==0, you could obviously turn it around; it might be x1==0 that leads to outcome A. Moreover, x1 does not need to be an indicator variable; x1 can be continuous, but that case is more difficult to discuss. Indicator or not, it does not have to be one variable that leads to the difficulty; there might be a linear combination of x1 and x2 that leads to the same difficulty. Returning to indicator x1, x1 does not have to lead to just outcome A. If x1 leads to outcome A or B and the absence of x1 leads to C or D, there can be problems.

Anyway, the way to diagnose all these problems is to let the problem continue until “convergence” and then examine the pattern of coefficients. We put convergence in quotes because when you force Stata or any software package to estimate models with infinite coefficients, you are taxing the precision of the numerical routines and the reported results should be interpreted cautiously. Theoretically, all the noninfinite coefficients are estimated correctly, as are their standard errors. Practically, because of how the routine has been stretched, that may not be true; the other coefficients and standard errors are probably less accurate than the package would normally produce. Final results, the results you report, should be based on estimates with the infinities removed.