Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Austin Nichols <austinnichols@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Imputation using ML for a lognormal ordered income variable |
Date | Mon, 19 Nov 2012 22:26:21 -0500 |
Tinna Asgeirsdottir <statalist.tla@gmail.com> : Dagum or other distributions are probably to be preferred to the lognormal; see -dagfit-, -smgfit-, -gbgfit- on SSC for estimation of parameters using interval-censored data, then use those parameters to predict mean, variance, etc. On Mon, Nov 19, 2012 at 10:48 AM, Stas Kolenikov <skolenik@gmail.com> wrote: > My understanding is that -lognfit- works with the exact data, not the > coarsened data that you have. As you obviously see, imputing the > median or the mean or any specific number is plain wrong (although I > have to admit to having done just that in the -polychoric- module... > which was more than 10 years ago when I was stupid enough to even > start the whole project :) ). So what I would do is: > 1. estimate the parameters of the lognormal model via -intreg-, using > logs of income as the cutoffs between categories. It will give you the > mean and the variance of logs (or conditional mean if you put > demographic covariates into your regression) > 2. figure out the conditional distribution (truncated normal for logs > within a given bin) > 3. simulate from that conditional distribution, create a new variable > 4. repeat a bunch of times, creating say 20 or 50 plausible income variables > 5. declare this to be an -mi set wide- data set and analyze the data > as multiply imputed > > To check the sensitivity at the right tail, you might want to modify > the simulated value in 3 for the upper category to be a Pareto > distribution that connects smoothly to the lognormal distribution. I > also recall that Stephen Jenkins, the author of lognfit, also worked > on other parametric income distribution specifications -- see e.g. > http://www.citeulike.org/user/ctacmo/article/4500072. > > On Mon, Nov 19, 2012 at 9:34 AM, Tinna Asgeirsdottir > <statalist.tla@gmail.com> wrote: >> Thanks for the helpful reply Stas, >> >> I don´t think the recommendation referred to interval regression or >> multiple imputation. I think it referred to imputing the probable >> average or median of each category, but without the obviously false >> assumption of a uniform distribution within each category the midpoint >> would suggest. >> >> If I do a ML fit of a lognormal distribution using the lognfit command >> I can get the parameters of the distribution. I guess I should be able >> to work this out by hand from there, but figured that there might be >> an easier way. >> >> Best >> Tinna >> >> 2012/11/17 Stas Kolenikov <skolenik@gmail.com>: >>> Lognormal distribution will likely underestimate how heavy the top >>> tail is (although if you are interested in Iceland, you may have a >>> very egalitarian income distribution, so the shape of that tail may >>> not be that terrible). Lognormal distribution is a very cute model to >>> play with and very dangerous in real work. In my work on Russian data, >>> changing the assumptions about the top tail moved our Gini index from >>> 0.48 to 0.60... and that's a little bit of a difference, let's put it >>> this way. >>> >>> The recommendation you have heard probably concerns -intreg-, which >>> you can read the help on. >>> >>> Imputing the mean income over a group will lead to a multitude of >>> problems due to artificially compressed variability and values that >>> are simply too low for the top group. If you desperately need to >>> impute, you would want to go with multiple imputations (-help mi-), >>> although you would want to read the MI manual and a paper >>> (http://www.citeulike.org/user/ctacmo/article/8525275) or two >>> (http://www.jstor.org/stable/2291635) if you are not familiar with the >>> technique. What I have done in one of my projects recently was to >>> generate the plausible values of the variable of interest a bunch of >>> times (say, 50... the original suggestion to use 5 imputations dates >>> back to late 1970s... and your smartphone now has more computing power >>> than a then-Cray supercomputer) and make Stata believe they were >>> imputed in Stata mi wide format. >>> >>> -- >>> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name >>> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at >>> srbi dot com >>> -- Opinions stated in this email are mine only, and do not reflect the >>> position of my employer >>> >>> >>> On Sat, Nov 17, 2012 at 6:12 AM, Tinna Asgeirsdottir >>> <statalist.tla@gmail.com> wrote: >>>> Dear Stata users, >>>> >>>> In my data I have income in 13 groups. The top group is open ended. I >>>> am trying to impute sensible values and would like to use this as a >>>> continuous variable. I am especially concerned about the top category. >>>> It has been suggested to me that I should use STATA´s ML command in >>>> stead of using each categories mid-point. I am having trouble finding >>>> what I need on the internet. Thus I wonder if anyone can tell me how >>>> to fit a lognormal distribution to the variable and subsequently infer >>>> the average income in the top bracket. If you know how to do this in >>>> general for all the categories that is great as well as the >>>> distributions over the other brackets is surely not uniform. However, >>>> I think finding a good solution for my top category is the most >>>> important thing though. >>>> >>>> Best regards, >>>> Tinna * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/