Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: Imputation using ML for a lognormal ordered income variable

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Imputation using ML for a lognormal ordered income variable Date Mon, 19 Nov 2012 22:26:21 -0500

```Tinna Asgeirsdottir  <statalist.tla@gmail.com> :
Dagum or other distributions are probably to be preferred to the lognormal;
see -dagfit-, -smgfit-, -gbgfit- on SSC for estimation of parameters
using interval-censored data, then use those parameters to predict
mean, variance, etc.

On Mon, Nov 19, 2012 at 10:48 AM, Stas Kolenikov <skolenik@gmail.com> wrote:
> My understanding is that -lognfit- works with the exact data, not the
> coarsened data that you have. As you obviously see, imputing the
> median or the mean or any specific number is plain wrong (although I
> have to admit to having done just that in the -polychoric- module...
> which was more than 10 years ago when I was stupid enough to even
> start the whole project :) ). So what I would do is:
> 1. estimate the parameters of the lognormal model via -intreg-, using
> logs of income as the cutoffs between categories. It will give you the
> mean and the variance of logs (or conditional mean if you put
> demographic covariates into your regression)
> 2. figure out the conditional distribution (truncated normal for logs
> within a given bin)
> 3. simulate from that conditional distribution, create a new variable
> 4. repeat a bunch of times, creating say 20 or 50 plausible income variables
> 5. declare this to be an -mi set wide- data set and analyze the data
> as multiply imputed
>
> To check the sensitivity at the right tail, you might want to modify
> the simulated value in 3 for the upper category to be a Pareto
> distribution that connects smoothly to the lognormal distribution. I
> also recall that Stephen Jenkins, the author of lognfit, also worked
> on other parametric income distribution specifications -- see e.g.
> http://www.citeulike.org/user/ctacmo/article/4500072.
>
> On Mon, Nov 19, 2012 at 9:34 AM, Tinna Asgeirsdottir
> <statalist.tla@gmail.com> wrote:
>> Thanks for the helpful reply Stas,
>>
>> I don´t think the recommendation referred to interval regression or
>> multiple imputation. I think it referred to imputing the probable
>> average or median of each category, but without the obviously false
>> assumption of a uniform distribution within each category the midpoint
>> would suggest.
>>
>> If I do a ML fit of a lognormal distribution using the lognfit command
>> I can get the parameters of the distribution. I guess I should be able
>> to work this out by hand from there, but figured that there might be
>> an easier way.
>>
>> Best
>> Tinna
>>
>> 2012/11/17 Stas Kolenikov <skolenik@gmail.com>:
>>> Lognormal distribution will likely underestimate how heavy the top
>>> tail is (although if you are interested in Iceland, you may have a
>>> very egalitarian income distribution, so the shape of that tail may
>>> not be that terrible). Lognormal distribution is a very cute model to
>>> play with and very dangerous in real work. In my work on Russian data,
>>> changing the assumptions about the top tail moved our Gini index from
>>> 0.48 to 0.60... and that's a little bit of a difference, let's put it
>>> this way.
>>>
>>> The recommendation you have heard probably concerns -intreg-, which
>>> you can read the help on.
>>>
>>> Imputing the mean income over a group will lead to a multitude of
>>> problems due to artificially compressed variability and values that
>>> are simply too low for the top group. If you desperately need to
>>> impute, you would want to go with multiple imputations (-help mi-),
>>> although you would want to read the MI manual and a paper
>>> (http://www.citeulike.org/user/ctacmo/article/8525275) or two
>>> (http://www.jstor.org/stable/2291635) if you are not familiar with the
>>> technique. What I have done in one of my projects recently was to
>>> generate the plausible values of the variable of interest a bunch of
>>> times (say, 50... the original suggestion to use 5 imputations dates
>>> back to late 1970s... and your smartphone now has more computing power
>>> than a then-Cray supercomputer) and make Stata believe they were
>>> imputed in Stata mi wide format.
>>>
>>> --
>>> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
>>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>>> srbi dot com
>>> -- Opinions stated in this email are mine only, and do not reflect the
>>> position of my employer
>>>
>>>
>>> On Sat, Nov 17, 2012 at 6:12 AM, Tinna Asgeirsdottir
>>> <statalist.tla@gmail.com> wrote:
>>>> Dear Stata users,
>>>>
>>>> In my data I have income in 13 groups. The top group is open ended. I
>>>> am trying to impute sensible values and would like to use this as a
>>>> continuous variable. I am especially concerned about the top category.
>>>>  It has been suggested to me that I should use STATA´s ML command in
>>>> stead of using each categories mid-point. I am having trouble finding
>>>> what I need on the internet. Thus I wonder if anyone can tell me how
>>>> to fit a lognormal distribution to the variable and subsequently infer
>>>> the average income in the top bracket. If you know how to do this in
>>>> general for all the categories that is great as well as the
>>>> distributions over the other brackets is surely not uniform. However,
>>>> I think finding a good solution for my top category is the most
>>>> important thing though.
>>>>
>>>> Best regards,
>>>> Tinna

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```