Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Mary E. Mackesy-Amiti" <mmamiti@uic.edu> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: strange and differing results for mi vs. ice mlogit |
Date | Mon, 18 Oct 2010 09:57:38 -0500 |
information, add an "unknown" category to the occupation variable. On 10/17/2010 6:08 PM, M Hollis wrote:
I'm exploring various options for trying to impute values of a nominal variable. The actual situation is somewhat unusual and requires separate imputations for several sub-samples of the data. Even with a relatively simple case, though, I'm getting very strange and poor results using Stata's mi command, and very different and almost as bad results using ice. Here's the simple case. A sample with 74 complete cases and 62 cases with a missing occupation code. The occupation variable is missing completely at random, and the occupation variable can take on one of nine possible occupation codes. The table below (hopefully legible without courrier font) shows that the 9 occupation codes are very unevenly distributed within the complete cases, with 62 of the 74 complete cases having the code "105". In this example I'm predicting the occupation using the variable sex: mi impute mlogit occ1990 = sex, add(1) noisily Here's a summary of the resulting imputed values: tab _1_occ1990 complete | complete _1_occ1990 | 0 1 | Total -----------+----------------------+---------- 15 | 0 2 | 2 21 | 0 1 | 1 23 | 0 1 | 1 105 | 0 62 | 62 175 | 0 2 | 2 446 | 0 1 | 1 447 | 0 3 | 3 458 | 0 1 | 1 459 | 62 1 | 63 -----------+----------------------+---------- Total | 62 74 | 136 As you can see the distribution of the imputed values (complete=0) is very different from the complete cases (complete=1). It's not clear why mi imputed a value of 469 for all cases. Given the uneven distribution in the complete cases, it's not surprising that the mlogit results (which I'll paste at the end of this email) show that the coefficient for sex is not significant for any of the codes, and in many cases has a huge standard error. So, sex is not the greatest of predictors in this case, but the imputed values should still reflect some combination of the distribution of the complete cases and random variation. It's not random variation, though. I've run this several times and occasionally it assigns a few other codes, but always almost all of them get 459. With other subsamples I've run it's always the last code listed that gets the vast majority of the imputations. The problem clearly has something to do with mi's attempt to include random variation in the results and not the mlogit command itself because if I just run mlogit and get predicted probabilities those predicted probabilities are very reasonable (i.e. they reflect the high likelihood of code 105, with some differences between men and women). The ice approach seems to do better but is still problematic: uvis mlogit occ1990 sex, gen(ice1) . tab ice1 complete imputed | from | complete occ1990 | 0 1 | Total -----------+----------------------+---------- 15 | 4 2 | 6 21 | 0 1 | 1 23 | 0 1 | 1 105 | 38 62 | 100 175 | 3 2 | 5 446 | 0 1 | 1 447 | 13 3 | 16 458 | 4 1 | 5 459 | 0 1 | 1 -----------+----------------------+---------- Total | 62 74 | 136 I've run this several times and every time the results are better than the mi results but the 105 code is always quite a smaller proportion than in the complete cases and one other group is quite a bit more. I understand there's random variation, but shouldn't that mean that sometimes the imputed cases should have a higher proportion of code 105? If I try to do this with a different, larger sub-sample with more possible occupation codes, the problems are the same or worse. With mi, as before, all of the imputed values are given the last occupation code. The ice command assigns everyone to just a few seemingly random codes and no one to the most common code. I understand that I'm looking to do multiple imputation for a rather unusual variable. The distribution of the occupations is very uneven and clearly contributing to the problem. Yet the overall predicted probabilities with the mlogit model are reasonable, so clearly the issue has to do with the imputation process. The imputation involves introducing variation partly through selecting values for the coefficients from a posterior distribution. Given the poor fit of the model, I wouldn't be surprised by considerable variation in the imputations, but this isn't random variation, the results are different in a consistent way every time. I also tried the Amelia program for multiple imputation and got issues that were similar to the ice results (under-imputation of the most common category). I have two thoughts on why this might be happening, but I'll be the first to admit that my knowledge of the details of both multinomial logit and multiple imputation are pretty rudimentary. My first observation is that the problems seem to have something to do with the order in which codes are assigned. In the case of mi, the last code seems to be inordinately likely to be imputed, perhaps because that code is assigned if no others have been selected and the predictions of the other codes are consistently too low. In contrast, ice (and Amelia) seem to have unusually low imputations of the most common group, which is the baseline (omitted) group. Perhaps in these programs the predicted likelihood of the other codes is consistently too high, leaving few people to be assigned the leftover baseline code. My second observation is that the predicted values might be off because of the complication of introducing random variation into the coefficients when the overall model requires that the predicted probabilities of each of the nominal values should add up to 1. If the mi model is sampling the coefficients independently ignoring the interdependence, then this constraint might be violated. Perhaps if I can sort through the mi or ice code (I'm not the best programmer) I can get a better sense of how these predicted probabilities are generated. I'm not sure, though, how this problem would lead to persistent over- or -under prediction of specific probabilities, unless the asymptotic nature of the posterior distribution means that the deviations in one direction are larger than the deviations in the other direction, which might be amplified in cases with values that are very unevenly distributed. As I said, though, my knowledge of these procedures is pretty slim, so these are just wild speculations. Any thoughts would be greatly appreciated. Details for the mlogit results are below. Thanks, Matissa Hollister Multinomial logistic regression Number of obs = 74 LR chi2(8) = 12.78 Prob> chi2 = 0.1195 Log likelihood = -50.158729 Pseudo R2 = 0.1130 ------------------------------------------------------------------------------ occ1990 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 15 | sex | -1.056146 1.443658 -0.73 0.464 -3.885663 1.77337 _cons | -2.772487 1.030735 -2.69 0.007 -4.792691 -.7522826 -------------+---------------------------------------------------------------- 21 | sex | 17.92559 13237.73 0.00 0.999 -25927.55 25963.4 _cons | -21.75423 13237.73 -0.00 0.999 -25967.23 25923.72 -------------+---------------------------------------------------------------- 23 | sex | 17.92559 13237.73 0.00 0.999 -25927.55 25963.4 _cons | -21.75423 13237.73 -0.00 0.999 -25967.23 25923.72 -------------+---------------------------------------------------------------- 105 | (base outcome) -------------+---------------------------------------------------------------- 175 | sex | -18.40767 4143.09 -0.00 0.996 -8138.715 8101.9 _cons | -2.079367 .7499813 -2.77 0.006 -3.549304 -.6094308 -------------+---------------------------------------------------------------- 446 | sex | -18.40767 5859.214 -0.00 0.997 -11502.26 11465.44 _cons | -2.772514 1.030749 -2.69 0.007 -4.792745 -.752284 -------------+---------------------------------------------------------------- 447 | sex | 17.92559 7642.808 0.00 0.998 -14961.7 14997.55 _cons | -20.65561 7642.808 -0.00 0.998 -15000.28 14958.97 -------------+---------------------------------------------------------------- 458 | sex | 17.92559 13237.73 0.00 0.999 -25927.55 25963.4 _cons | -21.75423 13237.73 -0.00 0.999 -25967.23 25923.72 -------------+---------------------------------------------------------------- 459 | sex | 17.92559 13237.73 0.00 0.999 -25927.55 25963.4 _cons | -21.75423 13237.73 -0.00 0.999 -25967.23 25923.72 ------------------------------------------------------------------------------ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/
-- Mary Ellen Mackesy-Amiti, Ph.D. Research Assistant Professor Community Outreach Intervention Projects (COIP) School of Public Health m/c 923 Division of Epidemiology and Biostatistics University of Illinois at Chicago ph. 312-355-4892 fax: 312-996-1450 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/