Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: strange and differing results for mi vs. ice mlogit


From   M Hollis <m73hollis_stata@yahoo.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: strange and differing results for mi vs. ice mlogit
Date   Sun, 17 Oct 2010 16:08:00 -0700 (PDT)

I'm exploring various options for trying to impute values of a nominal variable. 
The actual situation is somewhat unusual and requires separate imputations for 
several sub-samples of the data. Even with a relatively simple case, though, I'm 
getting very strange and poor results using Stata's mi command, and very 
different and almost as bad results using ice.

Here's the simple case. A sample with 74 complete cases and 62 cases with a 
missing occupation code. The occupation variable is missing completely at 
random, and the occupation variable can take on one of nine possible occupation 
codes. The table below (hopefully legible without courrier font) shows that the 
9 occupation codes are very unevenly distributed within the complete cases, with 
62 of the 74 complete cases having the code "105". In this example I'm 
predicting the occupation using the variable sex:

mi impute mlogit occ1990 = sex, add(1) noisily

Here's a summary of the resulting imputed values:

tab _1_occ1990 complete

              |       complete
_1_occ1990 |         0          1 |     Total
-----------+----------------------+----------
        15 |         0          2 |         2 
        21 |         0          1 |         1 
        23 |         0          1 |         1 
       105 |         0         62 |        62 
       175 |         0          2 |         2 
       446 |         0          1 |         1 
       447 |         0          3 |         3 
       458 |         0          1 |         1 
       459 |        62          1 |        63 
-----------+----------------------+----------
     Total |        62         74 |       136 

As you can see the     distribution of the imputed values (complete=0) is very 
different     from the complete cases (complete=1). It's not clear why mi 
imputed a value of 469 for all cases. Given the uneven distribution in the 
complete cases, it's not surprising that the mlogit results (which I'll paste at 
the end of this email) show that the coefficient for sex is not significant for 
any of the codes, and in many cases has a huge standard error. So, sex is not 
the greatest of predictors in this case, but the imputed values should still 
reflect some combination of the distribution of the complete cases and random 
variation.  It's not random variation, though. I've run this several times and 
occasionally it assigns a few other codes, but always almost all of them get 
459. With other subsamples I've run it's always the last code listed that gets 
the vast majority of the imputations.

The problem clearly has something to do with mi's attempt to include random 
variation in the results and not the mlogit command itself because if I just run 
mlogit and get predicted probabilities those predicted probabilities are very 
reasonable (i.e. they reflect the high likelihood of code 105, with some 
differences between men and women).

The ice approach seems to do better but is still problematic: uvis mlogit 
occ1990 sex,     gen(ice1)
 
. tab ice1 complete
 
   imputed |
      from |       complete
   occ1990 |         0          1 |     Total
-----------+----------------------+----------
        15         |         4          2 |         6 
        21         |         0          1 |         1 
        23         |         0          1 |         1 
       105         |        38         62 |       100 
       175         |         3          2 |         5 
       446         |         0          1 |         1 
       447         |        13          3 |        16 
       458         |         4          1 |         5 
       459         |         0          1         |         1 
-----------+----------------------+----------
     Total |        62         74 |       136 
I've run this several times and every time the results are better than the mi 
results but the 105 code is always quite a smaller proportion than in the 
complete cases and one other group is quite a bit more. I understand there's 
random variation, but shouldn't that mean that sometimes the imputed cases 
should have a higher proportion of code 105?

If I try to do this with a different, larger sub-sample with more possible 
occupation codes, the problems are the same or worse. With mi, as before, all of 
the imputed values are given the last occupation code. The ice command assigns 
everyone to just a few seemingly random codes and no one to the most common 
code. 


I understand that I'm looking to do multiple imputation for a rather unusual 
variable. The distribution of the occupations is very uneven and clearly 
contributing to the problem. Yet the overall predicted probabilities with the 
mlogit model are reasonable, so clearly the issue has to do with the imputation 
process. The imputation involves introducing variation partly through selecting 
values for the coefficients from a posterior distribution. Given the poor fit of 
the model, I wouldn't be surprised by considerable variation in the imputations, 
but this isn't random variation, the results are different in a consistent way 
every time. I also tried the Amelia program for multiple imputation and got 
issues that were similar to the ice results (under-imputation of the most common 
category).

I have two thoughts on why this might be happening, but I'll be the first to 
admit that my knowledge of the details of both multinomial logit and multiple 
imputation are pretty rudimentary. My first observation is that the problems 
seem to have something to do with the order in which codes are assigned. In the 
case of mi, the last code seems to be inordinately likely to be imputed, perhaps 
because that code is assigned if no others have been selected and the 
predictions of the other codes are consistently too low. In contrast, ice (and 
Amelia) seem to have unusually low imputations of the most common group, which 
is the baseline (omitted) group. Perhaps in these programs the predicted 
likelihood of the other codes is consistently too high, leaving few people to be 
assigned the leftover baseline code.

My second observation is that the predicted values might be off because of the 
complication of introducing random variation into the coefficients when the 
overall model requires that the predicted probabilities of each of the nominal 
values should add up to 1. If the mi model is sampling the coefficients 
independently ignoring the interdependence, then this constraint might be 
violated. Perhaps if I can sort through the mi or ice code (I'm not the best  
programmer) I can get a better sense of how these predicted  probabilities are 
generated. I'm not sure, though, how this problem would lead to persistent over- 
or -under prediction of specific probabilities, unless the asymptotic nature of 
the posterior distribution means that the deviations in one direction are larger 
than the deviations in the other direction, which might be amplified in cases 
with values that are very unevenly distributed. As I said, though, my knowledge 
of these procedures is pretty slim, so these are just wild speculations.

Any thoughts would be greatly appreciated. Details for the mlogit results are 
below.

Thanks,

Matissa Hollister


Multinomial logistic regression                   Number of obs   =         74
                                                  LR chi2(8)      =      12.78
                                                  Prob > chi2     =     0.1195
Log likelihood = -50.158729                       Pseudo R2       =     0.1130

------------------------------------------------------------------------------
     occ1990 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
15           |
         sex |  -1.056146   1.443658    -0.73   0.464    -3.885663     1.77337
       _cons |  -2.772487   1.030735    -2.69   0.007    -4.792691   -.7522826
-------------+----------------------------------------------------------------
21           |
         sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
       _cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
23           |
         sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
       _cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
105          |  (base outcome)
-------------+----------------------------------------------------------------
175          |
         sex |  -18.40767    4143.09    -0.00   0.996    -8138.715      8101.9
       _cons |  -2.079367   .7499813    -2.77   0.006    -3.549304   -.6094308
-------------+----------------------------------------------------------------
446          |
         sex |  -18.40767   5859.214    -0.00   0.997    -11502.26    11465.44
       _cons |  -2.772514   1.030749    -2.69   0.007    -4.792745    -.752284
-------------+----------------------------------------------------------------
447          |
         sex |   17.92559   7642.808     0.00   0.998     -14961.7    14997.55
       _cons |  -20.65561   7642.808    -0.00   0.998    -15000.28    14958.97
-------------+----------------------------------------------------------------
458          |
         sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
       _cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
-------------+----------------------------------------------------------------
459          |
         sex |   17.92559   13237.73     0.00   0.999    -25927.55     25963.4
       _cons |  -21.75423   13237.73    -0.00   0.999    -25967.23    25923.72
------------------------------------------------------------------------------


      
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index