Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Imputation using ML for a lognormal ordered income variable

From   Nick Cox <>
Subject   Re: st: Imputation using ML for a lognormal ordered income variable
Date   Mon, 19 Nov 2012 15:57:25 +0000

-lognfit- (SSC, Stephen Jenkins) takes the data you feed it literally,
which means numerically. What it expects sounds some distance from
what Tinna has. It's not an imputation command.


On Mon, Nov 19, 2012 at 3:48 PM, Stas Kolenikov <> wrote:

> My understanding is that -lognfit- works with the exact data, not the
> coarsened data that you have. As you obviously see, imputing the
> median or the mean or any specific number is plain wrong (although I
> have to admit to having done just that in the -polychoric- module...
> which was more than 10 years ago when I was stupid enough to even
> start the whole project :) ). So what I would do is:
> 1. estimate the parameters of the lognormal model via -intreg-, using
> logs of income as the cutoffs between categories. It will give you the
> mean and the variance of logs (or conditional mean if you put
> demographic covariates into your regression)
> 2. figure out the conditional distribution (truncated normal for logs
> within a given bin)
> 3. simulate from that conditional distribution, create a new variable
> 4. repeat a bunch of times, creating say 20 or 50 plausible income variables
> 5. declare this to be an -mi set wide- data set and analyze the data
> as multiply imputed
> To check the sensitivity at the right tail, you might want to modify
> the simulated value in 3 for the upper category to be a Pareto
> distribution that connects smoothly to the lognormal distribution. I
> also recall that Stephen Jenkins, the author of lognfit, also worked
> on other parametric income distribution specifications -- see e.g.
> On Mon, Nov 19, 2012 at 9:34 AM, Tinna Asgeirsdottir
> <> wrote:
>> Thanks for the helpful reply Stas,
>> I don´t think the recommendation referred to interval regression or
>> multiple imputation. I think it referred to imputing the probable
>> average or median of each category, but without the obviously false
>> assumption of a uniform distribution within each category the midpoint
>> would suggest.
>> If I do a ML fit of a lognormal distribution using the lognfit command
>> I can get the parameters of the distribution. I guess I should be able
>> to work this out by hand from there, but figured that there might be
>> an easier way.
>> Best
>> Tinna
>> 2012/11/17 Stas Kolenikov <>:
>>> Lognormal distribution will likely underestimate how heavy the top
>>> tail is (although if you are interested in Iceland, you may have a
>>> very egalitarian income distribution, so the shape of that tail may
>>> not be that terrible). Lognormal distribution is a very cute model to
>>> play with and very dangerous in real work. In my work on Russian data,
>>> changing the assumptions about the top tail moved our Gini index from
>>> 0.48 to 0.60... and that's a little bit of a difference, let's put it
>>> this way.
>>> The recommendation you have heard probably concerns -intreg-, which
>>> you can read the help on.
>>> Imputing the mean income over a group will lead to a multitude of
>>> problems due to artificially compressed variability and values that
>>> are simply too low for the top group. If you desperately need to
>>> impute, you would want to go with multiple imputations (-help mi-),
>>> although you would want to read the MI manual and a paper
>>> ( or two
>>> ( if you are not familiar with the
>>> technique. What I have done in one of my projects recently was to
>>> generate the plausible values of the variable of interest a bunch of
>>> times (say, 50... the original suggestion to use 5 imputations dates
>>> back to late 1970s... and your smartphone now has more computing power
>>> than a then-Cray supercomputer) and make Stata believe they were
>>> imputed in Stata mi wide format.
>>> --
>>> -- Stas Kolenikov, PhD, PStat (SSC)  ::
>>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>>> srbi dot com
>>> -- Opinions stated in this email are mine only, and do not reflect the
>>> position of my employer
>>> On Sat, Nov 17, 2012 at 6:12 AM, Tinna Asgeirsdottir
>>> <> wrote:
>>>> Dear Stata users,
>>>> In my data I have income in 13 groups. The top group is open ended. I
>>>> am trying to impute sensible values and would like to use this as a
>>>> continuous variable. I am especially concerned about the top category.
>>>>  It has been suggested to me that I should use STATA´s ML command in
>>>> stead of using each categories mid-point. I am having trouble finding
>>>> what I need on the internet. Thus I wonder if anyone can tell me how
>>>> to fit a lognormal distribution to the variable and subsequently infer
>>>> the average income in the top bracket. If you know how to do this in
>>>> general for all the categories that is great as well as the
>>>> distributions over the other brackets is surely not uniform. However,
>>>> I think finding a good solution for my top category is the most
>>>> important thing though.

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index