[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Chi-square test for Categorical Data Analysis

From   David Radwin <>
To   statalist <>
Subject   Re: st: Chi-square test for Categorical Data Analysis
Date   Tue, 18 Sep 2007 16:37:27 -0700

[This email did not appear to go through the first time, and I apologize if this is the second time you are seeing it.]


Another possibility is to approximate a continuous measure of income by using the midpoint of the range as the actual value. So, for example, the first value would be $12,500, the second would be $75,000, etc.

To estimate the midpoint of the uppermost, open-ended distribution ($500,001 or more), you can calculate the Pareto Curve from Pareto's Law of Income Distribution (see p. 874 of reference below). Of course, both steps require making assumptions about the distribution of income in your sample, so proceed with caution.

Then you can calculate and report estimated group means for A and B and the difference between them, and you can also do the usual difference-of-means hypothesis testing.

Below is a short program I wrote for a categorical income measure with 11 categories, where the 10th category is $150,000-$199,999 and the 11th category is $200,000 or more. I did my best to accurately implement the formula, but I am still a novice Stata programmer and not 100% sure it is correct. (I am certain it could be written more elegantly and parsimoniously, but that will have to wait for another day.)


* Pareto's curve calculation
tempvar income_a income_b income_c income_d income_v
gen double `income_a'=log10(175000) /* Log of midpoint of interval before open ended category */
gen double `income_b'=log10(200000) /* Log of lower limit of open interval category */
count if inlist(income,10,11)
local c_numerator=r(N)
count if !mi(income)
local valid_n=r(N)
gen double `income_c'=log10(100*(`c_numerator'/`valid_n')) /* log of sum of pr. for open intvl & preceding intvl */
count if inlist(income,11)
local d_numerator=r(N)
gen double `income_d'=log10(100*(`d_numerator'/`valid_n')) /* log of sum of pr. for open intvl only */
gen double `income_v'=(`income_c'-`income_d')/(`income_b'-`income_a')
scalar income_p=round(200000*`income_v'/(`income_v'-1))
recode income 1=5000 2=15000 3=27500 4=42500 5=57500 6=72500 7=90000 8=112500 9=137500 10=175000 11=99, gen(income_midpt)
replace income_midpt=income_p if income_midpt==99

Parker, R. N., & Fenwick, R. (1983). The Pareto curve and its utility for open-ended income distributions in survey research. Social Forces, Vol. 61, No. 3, 872-885.

At 4:40 PM -0400 9/18/07, Hugh Colaco wrote:

Participants were asked about their income level and had to choose one
from below. Assume the income ranges are:-

$0 - $25,000
$25,001 - $50,000
$50,001 - $100,000
$100,001 - $150,000
$150,001 - $200,000
$200,001 - $500,000
$500,001 or more

Rather than report so many income ranges, I would now like to report
just two, based on the median of all 150 participants. So, I will have
4 groups in all (i.e. Group A below median income, Group A equal to or
above the median income, Group B below median income, Group B equal to
or above the median income).
David Radwin, Principal Analyst //
Office of Student Research, University of California, Berkeley
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index