Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: strange and differing results for mi vs. ice mlogit

From	Maarten buis <[email protected]>
To	[email protected]
Subject	Re: st: strange and differing results for mi vs. ice mlogit
Date	Mon, 18 Oct 2010 09:52:09 +0100 (BST)
-ice- needs to estimate a -mlogit- model to a dependent variable where
some categories have only one or two observations. This will (almost) 
always give great difficulty, as you found out. Basically, this means that
the information you want just is not present in the data.

One solution is to combine the occupation codes. Given the structure 
of your data, I would say you have to reduce your occupation
variable to a binary variable. Alternatively, you could reconsider 
estimating an imputation model on different sub-samples. I guess that
neither is perfect, but the alternative is to make data up, which is 
even worse.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------


--- On Mon, 18/10/10, M Hollis <[email protected]> wrote:

> From: M Hollis <[email protected]>
> Subject: st: strange and differing results for mi vs. ice mlogit
> To: [email protected]
> Date: Monday, 18 October, 2010, 0:08
> I'm exploring various options for
> trying to impute values of a nominal variable. 
> The actual situation is somewhat unusual and requires
> separate imputations for 
> several sub-samples of the data. Even with a relatively
> simple case, though, I'm 
> getting very strange and poor results using Stata's mi
> command, and very 
> different and almost as bad results using ice.
> 
> Here's the simple case. A sample with 74 complete cases and
> 62 cases with a 
> missing occupation code. The occupation variable is missing
> completely at 
> random, and the occupation variable can take on one of nine
> possible occupation 
> codes. The table below (hopefully legible without courrier
> font) shows that the 
> 9 occupation codes are very unevenly distributed within the
> complete cases, with 
> 62 of the 74 complete cases having the code "105". In this
> example I'm 
> predicting the occupation using the variable sex:
> 
> mi impute mlogit occ1990 = sex, add(1) noisily
> 
> Here's a summary of the resulting imputed values:
> 
> tab _1_occ1990 complete
> 
>               | 
>      complete
> _1_occ1990 |         0 
>         1 | 
>    Total
> -----------+----------------------+----------
>         15 |     
>    0          2
> |         2 
>         21 |     
>    0          1
> |         1 
>         23 |     
>    0          1
> |         1 
>        105 |     
>    0         62
> |        62 
>        175 |     
>    0          2
> |         2 
>        446 |     
>    0          1
> |         1 
>        447 |     
>    0          3
> |         3 
>        458 |     
>    0          1
> |         1 
>        459 |     
>   62          1 |   
>     63 
> -----------+----------------------+----------
>      Total |       
> 62         74 |   
>    136 
> 
> As you can see the     distribution of
> the imputed values (complete=0) is very 
> different     from the complete cases
> (complete=1). It's not clear why mi 
> imputed a value of 469 for all cases. Given the uneven
> distribution in the 
> complete cases, it's not surprising that the mlogit results
> (which I'll paste at 
> the end of this email) show that the coefficient for sex is
> not significant for 
> any of the codes, and in many cases has a huge standard
> error. So, sex is not 
> the greatest of predictors in this case, but the imputed
> values should still 
> reflect some combination of the distribution of the
> complete cases and random 
> variation.  It's not random variation, though. I've
> run this several times and 
> occasionally it assigns a few other codes, but always
> almost all of them get 
> 459. With other subsamples I've run it's always the last
> code listed that gets 
> the vast majority of the imputations.
> 
> The problem clearly has something to do with mi's attempt
> to include random 
> variation in the results and not the mlogit command itself
> because if I just run 
> mlogit and get predicted probabilities those predicted
> probabilities are very 
> reasonable (i.e. they reflect the high likelihood of code
> 105, with some 
> differences between men and women).
> 
> The ice approach seems to do better but is still
> problematic: uvis mlogit 
> occ1990 sex,     gen(ice1)
>  
> . tab ice1 complete
>  
>    imputed |
>       from |   
>    complete
>    occ1990 |     
>    0          1
> |     Total
> -----------+----------------------+----------
>         15     
>    |     
>    4          2
> |         6 
>         21     
>    |     
>    0          1
> |         1 
>         23     
>    |     
>    0          1
> |         1 
>        105     
>    |        38 
>        62 |   
>    100 
>        175     
>    |     
>    3          2
> |         5 
>        446     
>    |     
>    0          1
> |         1 
>        447     
>    |        13 
>         3 |       
> 16 
>        458     
>    |     
>    4          1
> |         5 
>        459     
>    |     
>    0         
> 1         |   
>      1 
> -----------+----------------------+----------
>      Total |       
> 62         74 |   
>    136 
> I've run this several times and every time the results are
> better than the mi 
> results but the 105 code is always quite a smaller
> proportion than in the 
> complete cases and one other group is quite a bit more. I
> understand there's 
> random variation, but shouldn't that mean that sometimes
> the imputed cases 
> should have a higher proportion of code 105?
> 
> If I try to do this with a different, larger sub-sample
> with more possible 
> occupation codes, the problems are the same or worse. With
> mi, as before, all of 
> the imputed values are given the last occupation code. The
> ice command assigns 
> everyone to just a few seemingly random codes and no one to
> the most common 
> code. 
> 
> 
> I understand that I'm looking to do multiple imputation for
> a rather unusual 
> variable. The distribution of the occupations is very
> uneven and clearly 
> contributing to the problem. Yet the overall predicted
> probabilities with the 
> mlogit model are reasonable, so clearly the issue has to do
> with the imputation 
> process. The imputation involves introducing variation
> partly through selecting 
> values for the coefficients from a posterior distribution.
> Given the poor fit of 
> the model, I wouldn't be surprised by considerable
> variation in the imputations, 
> but this isn't random variation, the results are different
> in a consistent way 
> every time. I also tried the Amelia program for multiple
> imputation and got 
> issues that were similar to the ice results
> (under-imputation of the most common 
> category).
> 
> I have two thoughts on why this might be happening, but
> I'll be the first to 
> admit that my knowledge of the details of both multinomial
> logit and multiple 
> imputation are pretty rudimentary. My first observation is
> that the problems 
> seem to have something to do with the order in which codes
> are assigned. In the 
> case of mi, the last code seems to be inordinately likely
> to be imputed, perhaps 
> because that code is assigned if no others have been
> selected and the 
> predictions of the other codes are consistently too low. In
> contrast, ice (and 
> Amelia) seem to have unusually low imputations of the most
> common group, which 
> is the baseline (omitted) group. Perhaps in these programs
> the predicted 
> likelihood of the other codes is consistently too high,
> leaving few people to be 
> assigned the leftover baseline code.
> 
> My second observation is that the predicted values might be
> off because of the 
> complication of introducing random variation into the
> coefficients when the 
> overall model requires that the predicted probabilities of
> each of the nominal 
> values should add up to 1. If the mi model is sampling the
> coefficients 
> independently ignoring the interdependence, then this
> constraint might be 
> violated. Perhaps if I can sort through the mi or ice code
> (I'm not the best  
> programmer) I can get a better sense of how these
> predicted  probabilities are 
> generated. I'm not sure, though, how this problem would
> lead to persistent over- 
> or -under prediction of specific probabilities, unless the
> asymptotic nature of 
> the posterior distribution means that the deviations in one
> direction are larger 
> than the deviations in the other direction, which might be
> amplified in cases 
> with values that are very unevenly distributed. As I said,
> though, my knowledge 
> of these procedures is pretty slim, so these are just wild
> speculations.
> 
> Any thoughts would be greatly appreciated. Details for the
> mlogit results are 
> below.
> 
> Thanks,
> 
> Matissa Hollister
> 
> 
> Multinomial logistic regression       
>            Number of
> obs   =     
>    74
>                
>                
>                
>   LR chi2(8)      =     
> 12.78
>                
>                
>                
>   Prob > chi2     = 
>    0.1195
> Log likelihood = -50.158729       
>            
>    Pseudo R2   
>    =     0.1130
> 
> ------------------------------------------------------------------------------
>      occ1990 |     
> Coef.   Std. Err.      z 
>   P>|z|     [95% Conf.
> Interval]
> -------------+----------------------------------------------------------------
> 15           |
>          sex | 
> -1.056146   1.443658   
> -0.73   0.464    -3.885663 
>    1.77337
>        _cons | 
> -2.772487   1.030735   
> -2.69   0.007   
> -4.792691   -.7522826
> -------------+----------------------------------------------------------------
> 21           |
>          sex
> |   17.92559   13237.73 
>    0.00   0.999   
> -25927.55     25963.4
>        _cons | 
> -21.75423   13237.73   
> -0.00   0.999    -25967.23 
>   25923.72
> -------------+----------------------------------------------------------------
> 23           |
>          sex
> |   17.92559   13237.73 
>    0.00   0.999   
> -25927.55     25963.4
>        _cons | 
> -21.75423   13237.73   
> -0.00   0.999    -25967.23 
>   25923.72
> -------------+----------------------------------------------------------------
> 105          |  (base
> outcome)
> -------------+----------------------------------------------------------------
> 175          |
>          sex | 
> -18.40767    4143.09   
> -0.00   0.996    -8138.715 
>     8101.9
>        _cons | 
> -2.079367   .7499813   
> -2.77   0.006   
> -3.549304   -.6094308
> -------------+----------------------------------------------------------------
> 446          |
>          sex | 
> -18.40767   5859.214   
> -0.00   0.997    -11502.26 
>   11465.44
>        _cons | 
> -2.772514   1.030749   
> -2.69   0.007    -4.792745 
>   -.752284
> -------------+----------------------------------------------------------------
> 447          |
>          sex
> |   17.92559   7642.808 
>    0.00   0.998 
>    -14961.7    14997.55
>        _cons | 
> -20.65561   7642.808   
> -0.00   0.998    -15000.28 
>   14958.97
> -------------+----------------------------------------------------------------
> 458          |
>          sex
> |   17.92559   13237.73 
>    0.00   0.999   
> -25927.55     25963.4
>        _cons | 
> -21.75423   13237.73   
> -0.00   0.999    -25967.23 
>   25923.72
> -------------+----------------------------------------------------------------
> 459          |
>          sex
> |   17.92559   13237.73 
>    0.00   0.999   
> -25927.55     25963.4
>        _cons | 
> -21.75423   13237.73   
> -0.00   0.999    -25967.23 
>   25923.72
> ------------------------------------------------------------------------------
> 
> 
>       
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


      

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: strange and differing results for mi vs. ice mlogit
  - From: M Hollis <[email protected]>
Prev by Date: Re: st: Quadratic term in ZIP model
Next by Date: st: RE: Standard error for correlation coefficient in "biprobit"
Previous by thread: st: strange and differing results for mi vs. ice mlogit
Next by thread: Re: st: strange and differing results for mi vs. ice mlogit
Index(es):
- Date
- Thread