Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

st: strange and differing results for mi vs. ice mlogit

 From M Hollis <[email protected]> To [email protected] Subject st: strange and differing results for mi vs. ice mlogit Date Sun, 17 Oct 2010 16:08:00 -0700 (PDT)

```I'm exploring various options for trying to impute values of a nominal variable.
The actual situation is somewhat unusual and requires separate imputations for
several sub-samples of the data. Even with a relatively simple case, though, I'm
getting very strange and poor results using Stata's mi command, and very
different and almost as bad results using ice.

Here's the simple case. A sample with 74 complete cases and 62 cases with a
missing occupation code. The occupation variable is missing completely at
random, and the occupation variable can take on one of nine possible occupation
codes. The table below (hopefully legible without courrier font) shows that the
9 occupation codes are very unevenly distributed within the complete cases, with
62 of the 74 complete cases having the code "105". In this example I'm
predicting the occupation using the variable sex:

mi impute mlogit occ1990 = sex, add(1) noisily

Here's a summary of the resulting imputed values:

tab _1_occ1990 complete

|       complete
_1_occ1990 |         0          1 |     Total
-----------+----------------------+----------
15 |         0          2 |         2
21 |         0          1 |         1
23 |         0          1 |         1
105 |         0         62 |        62
175 |         0          2 |         2
446 |         0          1 |         1
447 |         0          3 |         3
458 |         0          1 |         1
459 |        62          1 |        63
-----------+----------------------+----------
Total |        62         74 |       136

As you can see the     distribution of the imputed values (complete=0) is very
different     from the complete cases (complete=1). It's not clear why mi
imputed a value of 469 for all cases. Given the uneven distribution in the
complete cases, it's not surprising that the mlogit results (which I'll paste at
the end of this email) show that the coefficient for sex is not significant for
any of the codes, and in many cases has a huge standard error. So, sex is not
the greatest of predictors in this case, but the imputed values should still
reflect some combination of the distribution of the complete cases and random
variation.  It's not random variation, though. I've run this several times and
occasionally it assigns a few other codes, but always almost all of them get
459. With other subsamples I've run it's always the last code listed that gets
the vast majority of the imputations.

The problem clearly has something to do with mi's attempt to include random
variation in the results and not the mlogit command itself because if I just run
mlogit and get predicted probabilities those predicted probabilities are very
reasonable (i.e. they reflect the high likelihood of code 105, with some
differences between men and women).

The ice approach seems to do better but is still problematic: uvis mlogit
occ1990 sex,     gen(ice1)

. tab ice1 complete

imputed |
from |       complete
occ1990 |         0          1 |     Total
-----------+----------------------+----------
15         |         4          2 |         6
21         |         0          1 |         1
23         |         0          1 |         1
105         |        38         62 |       100
175         |         3          2 |         5
446         |         0          1 |         1
447         |        13          3 |        16
458         |         4          1 |         5
459         |         0          1         |         1
-----------+----------------------+----------
Total |        62         74 |       136
I've run this several times and every time the results are better than the mi
results but the 105 code is always quite a smaller proportion than in the
complete cases and one other group is quite a bit more. I understand there's
random variation, but shouldn't that mean that sometimes the imputed cases
should have a higher proportion of code 105?

If I try to do this with a different, larger sub-sample with more possible
occupation codes, the problems are the same or worse. With mi, as before, all of
the imputed values are given the last occupation code. The ice command assigns
everyone to just a few seemingly random codes and no one to the most common
code.

I understand that I'm looking to do multiple imputation for a rather unusual
variable. The distribution of the occupations is very uneven and clearly
contributing to the problem. Yet the overall predicted probabilities with the
mlogit model are reasonable, so clearly the issue has to do with the imputation
process. The imputation involves introducing variation partly through selecting
values for the coefficients from a posterior distribution. Given the poor fit of
the model, I wouldn't be surprised by considerable variation in the imputations,
but this isn't random variation, the results are different in a consistent way
every time. I also tried the Amelia program for multiple imputation and got
issues that were similar to the ice results (under-imputation of the most common
category).

I have two thoughts on why this might be happening, but I'll be the first to
admit that my knowledge of the details of both multinomial logit and multiple
imputation are pretty rudimentary. My first observation is that the problems
seem to have something to do with the order in which codes are assigned. In the
case of mi, the last code seems to be inordinately likely to be imputed, perhaps
because that code is assigned if no others have been selected and the
predictions of the other codes are consistently too low. In contrast, ice (and
Amelia) seem to have unusually low imputations of the most common group, which
is the baseline (omitted) group. Perhaps in these programs the predicted
likelihood of the other codes is consistently too high, leaving few people to be
assigned the leftover baseline code.

My second observation is that the predicted values might be off because of the
complication of introducing random variation into the coefficients when the
overall model requires that the predicted probabilities of each of the nominal
values should add up to 1. If the mi model is sampling the coefficients
independently ignoring the interdependence, then this constraint might be
violated. Perhaps if I can sort through the mi or ice code (I'm not the best
programmer) I can get a better sense of how these predicted  probabilities are
generated. I'm not sure, though, how this problem would lead to persistent over-
or -under prediction of specific probabilities, unless the asymptotic nature of
the posterior distribution means that the deviations in one direction are larger
than the deviations in the other direction, which might be amplified in cases
with values that are very unevenly distributed. As I said, though, my knowledge
of these procedures is pretty slim, so these are just wild speculations.

Any thoughts would be greatly appreciated. Details for the mlogit results are
below.

Thanks,

Matissa Hollister

Multinomial logistic regression                   Number of obs   =         74
LR chi2(8)      =      12.78
Prob > chi2     =     0.1195
Log likelihood = -50.158729                       Pseudo R2       =     0.1130

------------------------------------------------------------------------------
occ1990 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
15           |
sex |  -1.056146   1.443658    -0.73   0.464    -3.885663     1.77337
_cons |  -2.772487   1.030735    -2.69   0.007    -4.792691   -.7522826
-------------+----------------------------------------------------------------
21           |
sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
_cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
23           |
sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
_cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
105          |  (base outcome)
-------------+----------------------------------------------------------------
175          |
sex |  -18.40767    4143.09    -0.00   0.996    -8138.715      8101.9
_cons |  -2.079367   .7499813    -2.77   0.006    -3.549304   -.6094308
-------------+----------------------------------------------------------------
446          |
sex |  -18.40767   5859.214    -0.00   0.997    -11502.26    11465.44
_cons |  -2.772514   1.030749    -2.69   0.007    -4.792745    -.752284
-------------+----------------------------------------------------------------
447          |
sex |   17.92559   7642.808     0.00   0.998     -14961.7    14997.55
_cons |  -20.65561   7642.808    -0.00   0.998    -15000.28    14958.97
-------------+----------------------------------------------------------------
458          |
sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
_cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
459          |
sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
_cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
------------------------------------------------------------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```