Home  /  Resources & support  /  FAQs  /  Convergence of maximum likelihood estimators
Note: This FAQ is for Stata 8 and older versions of Stata.

Why does my mlogit take so long to converge?

Title   Convergence of maximum likelihood estimators
Author William Gould, StataCorp

Maximum-likelihood estimators produce results by an iterative procedure. At the beginning of iteration k, there is some coefficient vector bk. This coefficient vector can be combined with the model and data to produce a log-likelihood value Lk.

The procedure then finds a b{k+1}, which produces a better (larger) log-likelihood value, L{k+1}. You might expect the rule for covergence to be

        L{k+1} - Lk        is    small

Stata is willing to use that rule—see [R] maximize—but by default it does not. Instead, Stata uses the rule

        || b{k+1} - bk ||   is small

That is, by default Stata uses the rule that the coefficients do not change much. Basing the rule on the coefficients is more conservative, which is to say, safer. It is the coefficients that interest you, the researcher, so the idea is not to declare convergence until the coefficients settle down.

You may sometimes see that the mlogit command will continue to iterate even though the log-likelihood value does not appear to change. Here we know that, in each iteration,

        || b{k+1} - bk ||     is NOT small 

even though

        L{k+1} - Lk        is small

The coefficient vector was changing while the log-likelihood changed little, if at all.

How, you may wonder, can this happen? Mechanically, all it takes is a ridge in the likelihood. You would like to think your likelihood function looked something like

          |
        L |           +
          |         /   \
          |       /       \
          |  ___/           \___
          +------------------------
                                 b

but it might look like

          |          ______
        L |         /      \
          |       /          \
          |  ___/              \___
          +------------------------ 
                                 b

or even

          |          _____________
        L |         /
          |       /
          |  ___/
          +------------------------ 
                                 b

The flat portions are called ridges.

You might also speculate that the cause of this nonconvergence or slow-convergence is collinearity in the explanatory variables. Theoretically, this supposition is right, but Stata drops collinear variables at the outset, so that is not the problem. Still, it is worth understanding the picture. We need another dimension, so this time, pretend that our likelihood function is a function of two parameters, b1 and b2. A well-behaved likelihood function would look like

          |      unique                  .
        L |      maximum ->   .     .      \
          |                     \            \
          |                .      \             \___            /
          |            +            \                          / 
          |          /   \            \___                    /  b2
          |        /       \                                 /
          |   ___/           \___                           /
          +------------------------------------------------
                                 b1

With collinearity, there is no longer a peak; instead, there is a ridge:

          |   Ridge of                 .
        L |   constant            .      \ 
          |   height   -->    .            \
          |               .     \            \___              /
          |            +          \                           /
          |          /   \          \___                     /   b2
          |        /       \                                /
          |   ___/           \___                          /
          +------------------------------------------------
                                 b1

The point is that with collinear variables, the value of (b1,b2) can change a lot with no corresponding change in L—you just move along the ridge.

Using Stata, however, that does not happen because Stata looks for collinearity before iterating and drops any unnecessary variables.

Now, if you go back to the two-dimensional pictures, for various mathematical reasons we know the mlogit likelihood function does not have an internal ridge. It can, however, have a nearly flat section that stretches out to infinity. The flat portion is not really flat; it is gently rising, but the rise is so gentle that it is difficult to detect.

This happens when the maximum-likelihood value for one of the coefficients is infinity. The function reaches a maximum at infinity and, since it takes forever to get there, the function gently rises to that maximum. For mlogit, coefficients around 30 are nearly infinite.

If we evaluated the log-likelihood function for, say, coefficient b2=infinity, (something we would find impossible to do on a computer but that we could do with pencil and paper by carefully taking the limit as b2 → infinity) we will get some number, say −1.

Now if we ran off to my computer and evaluated the log-likelihood function for b2=30, I will get a number very close to −1. In fact, we will get −1 exactly because of rounding. If we tried to evaluate b2=60 on my computer, we would not get an answer because we would run into numerical overflows. Thus, for all practical purposes, b2=30 is infinity for the mlogit model on a digital computer.

That is what is happening here. The likelihood function is changing very little (it’s not changing at all in the first 8 digits), but one or more of the coefficients is marching off to infinity, which is to say, 30. Eventually, it will get there, so stopping the iterations is premature.

What will lead to that? It is the same story as infinite coefficients in binary logit: one-way causation. Imagine there are four outcomes in my model—call them A, B, C, and D. Imagine we have an indicator variable, x1, in our model, and further imagine that all the x1==1 observations are observed to have outcome A. The model then wants to estimate Pr(outcome A given x1==1) = 1 and Pr(any other outcome given x1==1) = 0. To do this, it must set one of the coefficients to infinity.

Stata’s logit and logistic commands protect you from this problem by examining the data ahead of time and dropping variables and observations that have this problem. logit and logistic will report

        x1 dropped because x1~=0 predicts success perfectly; 
        x1 dropped and 12 observations not used

mlogit does not protect you because there are just too many ways it can happen; the problem is too difficult.

What you should do here is an interesting question. Basically, you must decide whether the infinite coefficient is meaningful. If you included x1 in your model on the off chance that x1 might have an effect and there are just a handful of x1==1 observations, I would remove x1 from the model and reestimate.

On the other hand, if you have strong theoretical reasons to believe x1 has an effect or, even in the absence of a theoretical justification, the evidence is overwhelming (there were 10,000 x1==1 observations in your data and all 10,000 of them had outcome A), then you are going to have to admit that x1 belongs in the model and that either

  1. the effect is dominating, x1==1 does lead to outcome A with probability 1, and therefore the multinomial logistic model is not the correct way to model this, or
  2. the effect is whopping, x1==1 leads to outcome A with high probability (but not exactly 1), and you simply do not have enough data to model the x1==1 population.

In case 2, you could remove x1 from the model, exclude the x1==1 observations, and so estimate a model conditional on x1~=1.

In the above when we refer to x1==1 and x1==0, you could obviously turn it around; it might be x1==0 that leads to outcome A. Moreover, x1 does not need to be an indicator variable; x1 can be continuous, but that case is more difficult to discuss. Indicator or not, it does not have to be one variable that leads to the difficulty; there might be a linear combination of x1 and x2 that leads to the same difficulty. Returning to indicator x1, x1 does not have to lead to just outcome A. If x1 leads to outcome A or B and the absence of x1 leads to C or D, there can be problems.

Anyway, the way to diagnose all these problems is to let the problem continue until “convergence” and then examine the pattern of coefficients. We put convergence in quotes because when you force Stata or any software package to estimate models with infinite coefficients, you are taxing the precision of the numerical routines and the reported results should be interpreted cautiously. Theoretically, all the noninfinite coefficients are estimated correctly, as are their standard errors. Practically, because of how the routine has been stretched, that may not be true; the other coefficients and standard errors are probably less accurate than the package would normally produce. Final results, the results you report, should be based on estimates with the infinities removed.