# Re: st: Imputing Mean of Top-Coded Income Category

I waited to see if you got a response to your query to the list, but I now have some time to respond.

About ten years ago I tried a simple program in a spread sheet that used the Pareto formula to compute top category means for yearly income distributions from the CPS from 1947 until about 1990. The upshot was that the Pareto estimator didn't seem to work well. I recall that I got plausible means around \$13,000 for a top category (incomes above \$10,000 untill about 1955) for several years after 1950 then all of a sudden the estimator gave me a mean of \$17,000 for no apparent reason. After that year, the estimates again reverted to plausible numbers. In the end Ron Helms and I got top category data from the IRS income tax tables and used them to compute top category means. There was a systematic bias but it probably was fairly similar for all years. But it was a lot of trouble to collect all that IRS data however.

Dave Jacobs

I have top-coded, continuous CPS data on earnings. I want to impute the
mean income of this group of top-coded earners, making the assumption that
the upper-tail follows a pareto distribution. I'm wondering if anyone has
suggestions about how to do this in STATA (or even just generally how to
do it).

Some notes:

1.
The standard method of doing this typically involves imputing the mean of
top-coded earners given categories of earnings, using the following
formula:

Mean Income for top-coded category = X(V/V-1)
where:
X = topcode/open-ended category
V = c-d/b-a
where
a = log of lower limit of interval preceding top-coded/open-ended category
b = log of lower limit of top-coded/open-ended category
c = log of the sum of the frequencies in the top-coded category and the
category preceding it
d = log of the frequencies in the top-coded category

The problem with using this method given continuous earnings data (like
the CPS) is that the result is highly dependent on the choice one makes
about what interval to define as the "preceding category."

2.
Another method would use the mode and median to solve the equation:

median = mode * 2 (to the 1/V power)

(using the observed median and mode of the sample to calculate V and solve
the equation above)

The problem here is that when the median is less than the mode, it gives a
value for V less and 1, such that multiplying the top code gives a mean
for the top-coded income that is LESS than the top code, much to my
consternation.

Any help on this would be much appreciated!

