Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Imputation using ML for a lognormal ordered income variable

From   Austin Nichols <>
Subject   Re: st: Imputation using ML for a lognormal ordered income variable
Date   Mon, 19 Nov 2012 22:26:21 -0500

Tinna Asgeirsdottir  <> :
Dagum or other distributions are probably to be preferred to the lognormal;
see -dagfit-, -smgfit-, -gbgfit- on SSC for estimation of parameters
using interval-censored data, then use those parameters to predict
mean, variance, etc.

On Mon, Nov 19, 2012 at 10:48 AM, Stas Kolenikov <> wrote:
> My understanding is that -lognfit- works with the exact data, not the
> coarsened data that you have. As you obviously see, imputing the
> median or the mean or any specific number is plain wrong (although I
> have to admit to having done just that in the -polychoric- module...
> which was more than 10 years ago when I was stupid enough to even
> start the whole project :) ). So what I would do is:
> 1. estimate the parameters of the lognormal model via -intreg-, using
> logs of income as the cutoffs between categories. It will give you the
> mean and the variance of logs (or conditional mean if you put
> demographic covariates into your regression)
> 2. figure out the conditional distribution (truncated normal for logs
> within a given bin)
> 3. simulate from that conditional distribution, create a new variable
> 4. repeat a bunch of times, creating say 20 or 50 plausible income variables
> 5. declare this to be an -mi set wide- data set and analyze the data
> as multiply imputed
> To check the sensitivity at the right tail, you might want to modify
> the simulated value in 3 for the upper category to be a Pareto
> distribution that connects smoothly to the lognormal distribution. I
> also recall that Stephen Jenkins, the author of lognfit, also worked
> on other parametric income distribution specifications -- see e.g.
> On Mon, Nov 19, 2012 at 9:34 AM, Tinna Asgeirsdottir
> <> wrote:
>> Thanks for the helpful reply Stas,
>> I don´t think the recommendation referred to interval regression or
>> multiple imputation. I think it referred to imputing the probable
>> average or median of each category, but without the obviously false
>> assumption of a uniform distribution within each category the midpoint
>> would suggest.
>> If I do a ML fit of a lognormal distribution using the lognfit command
>> I can get the parameters of the distribution. I guess I should be able
>> to work this out by hand from there, but figured that there might be
>> an easier way.
>> Best
>> Tinna
>> 2012/11/17 Stas Kolenikov <>:
>>> Lognormal distribution will likely underestimate how heavy the top
>>> tail is (although if you are interested in Iceland, you may have a
>>> very egalitarian income distribution, so the shape of that tail may
>>> not be that terrible). Lognormal distribution is a very cute model to
>>> play with and very dangerous in real work. In my work on Russian data,
>>> changing the assumptions about the top tail moved our Gini index from
>>> 0.48 to 0.60... and that's a little bit of a difference, let's put it
>>> this way.
>>> The recommendation you have heard probably concerns -intreg-, which
>>> you can read the help on.
>>> Imputing the mean income over a group will lead to a multitude of
>>> problems due to artificially compressed variability and values that
>>> are simply too low for the top group. If you desperately need to
>>> impute, you would want to go with multiple imputations (-help mi-),
>>> although you would want to read the MI manual and a paper
>>> ( or two
>>> ( if you are not familiar with the
>>> technique. What I have done in one of my projects recently was to
>>> generate the plausible values of the variable of interest a bunch of
>>> times (say, 50... the original suggestion to use 5 imputations dates
>>> back to late 1970s... and your smartphone now has more computing power
>>> than a then-Cray supercomputer) and make Stata believe they were
>>> imputed in Stata mi wide format.
>>> --
>>> -- Stas Kolenikov, PhD, PStat (SSC)  ::
>>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>>> srbi dot com
>>> -- Opinions stated in this email are mine only, and do not reflect the
>>> position of my employer
>>> On Sat, Nov 17, 2012 at 6:12 AM, Tinna Asgeirsdottir
>>> <> wrote:
>>>> Dear Stata users,
>>>> In my data I have income in 13 groups. The top group is open ended. I
>>>> am trying to impute sensible values and would like to use this as a
>>>> continuous variable. I am especially concerned about the top category.
>>>>  It has been suggested to me that I should use STATA´s ML command in
>>>> stead of using each categories mid-point. I am having trouble finding
>>>> what I need on the internet. Thus I wonder if anyone can tell me how
>>>> to fit a lognormal distribution to the variable and subsequently infer
>>>> the average income in the top bracket. If you know how to do this in
>>>> general for all the categories that is great as well as the
>>>> distributions over the other brackets is surely not uniform. However,
>>>> I think finding a good solution for my top category is the most
>>>> important thing though.
>>>> Best regards,
>>>> Tinna

*   For searches and help try:

© Copyright 1996–2015 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index