Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Grouping income variables- RECODE COMMAND


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Grouping income variables- RECODE COMMAND
Date   Tue, 4 Feb 2014 11:52:58 +0000

Small suggestions about interacting with the list:

1. It's best to think that you are addressing the entire list,
especially when you change the question.

2. It should be evident on reflection that minimal references such as
Hout (2004) are not informative. Statalist members should not be
presumed to be familiar with the literature you happen to know well.
This point is made in the FAQ. (Similarly, I and no doubt some others
have no idea what "ESS" means, although I suspect that's not important
here.)

In terms of your specific question:

Note that instead of

gen missinc=0
replace missinc=1 if missing(hincome)

you can go

gen missinc = missing(hincome)

as that is what -missing()- does.

That said, I can't help much on your main question. Your dummy
variable (I recommend the terminology "indicator variable" instead; I
have heard too many stories in which "dummy" was regarded as
offensive) is collinear with something else. So, why not look at the
correlation matrix or a scatter plot matrix to identify that something
else?

Given your crude data on incomes, assigning midpoints to categories
is, as you are aware, difficult if not dangerous. I'd recommend some
sensitivity analysis, including some check that outliers are not being
created. Using a log transformation should, however, be some help
here.
Nick
njcoxstata@gmail.com


On 4 February 2014 11:29, Antonio Rodriguez Andres
<Antonio.Andres@emu.edu.tr> wrote:
> Nıck
>
> In the ESS in 2006, the total household income is grouped into 12 categories associated with different weekly, monthly, and annual ranges. For instance for letter J (less than 1 800 euros), R (1800 TO UNDER 3600), etc.
> tab hinctnt
>
> Household's |
>   total net |
> income, all |
>     sources |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           J |      1,348        4.07        4.07
>           R |      1,353        4.08        8.15
>           C |      1,968        5.94       14.10
>           M |      3,067        9.26       23.35
>           F |      3,000        9.06       32.41
>           S |      2,934        8.86       41.27
>           K |      2,733        8.25       49.52
>           P |      2,682        8.10       57.62
>           D |      4,432       13.38       71.00
>           H |      1,962        5.92       76.92
>           U |        608        1.84       78.76
>           N |        396        1.20       79.95
>     Refusal |      3,958       11.95       91.90
>  Don't know |      2,549        7.70       99.60
>   No answer |        134        0.40      100.00
> ------------+-----------------------------------
>       Total |     33,124      100.00
> ab hinctnt, nolabel
>
> Household's |
>   total net |
> income, all |
>     sources |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           1 |      1,348        4.07        4.07
>           2 |      1,353        4.08        8.15
>           3 |      1,968        5.94       14.10
>           4 |      3,067        9.26       23.35
>           5 |      3,000        9.06       32.41
>           6 |      2,934        8.86       41.27
>           7 |      2,733        8.25       49.52
>           8 |      2,682        8.10       57.62
>           9 |      4,432       13.38       71.00
>          10 |      1,962        5.92       76.92
>          11 |        608        1.84       78.76
>          12 |        396        1.20       79.95
>          77 |      3,958       11.95       91.90
>          88 |      2,549        7.70       99.60
>          99 |        134        0.40      100.00
> ------------+-----------------------------------
>       Total |     33,124      100.00
> First of all, I recode the household income variable using mıd-points. The problem is defining a midpoint for the open ended top category. For that purpose, I follow Hout (2004).
> *Create income midpoints
> recode hinctnt (1=900) (2=2700) (3=4800) (4=9000) (5=15000) (6=21000) (7=27000) (8= 33000) (9=48000) (10=75000) (11=105000) (12= 175200) , gen(hincome)
> replace hincome=. if hinctnt==77 | hinctnt==88 |  hinctnt==99  // I recode hinctnt= 77 & 88 & 99 (Don’t Know,  Refusal, No answer) as missing values
> gen lhincome=log(hincome)
> I also need to include in my regression a dummy variable for the mıssing values corresponding to income. I type in Stata.
> gen missinc=0
> replace missinc=1 if missing(hincome)
>
> When estimating the following model, the dummy variable for missing values for income is dropped but ıt has to be in my model. Is there anything wrong with the Stata code?
> xtm. xtmixed dprt age age2 gender married separated divorced widowed eduyrs ichldhm interaction missinc lhincome ihealth iuemp5yr iuemp12m rgdp06[pw=dweight]  || cntry: gender , mle
> note: missinc omitted because of collinearity
> (29900 missing values generated)
>
> Obtaining starting values by EM:
>
> Mixed-effects regression                        Number of obs      =      7603
> Group variable: cntry                           Number of groups   =        20
>
>                                                 Obs per group: min =       156
>                                                                avg =     380.1
>                                                                max =       698
>
>
>                                                 Wald chi2(15)      =   1601.80
> Log pseudolikelihood = -20437.471               Prob > chi2        =    0.0000
>
>                                  (Std. Err. adjusted for 20 clusters in cntry)
> ------------------------------------------------------------------------------
>              |               Robust
>         dprt |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>          age |    .076018    .026928     2.82   0.005     .0232401    .1287959
>         age2 |  -.0009855   .0002563    -3.84   0.000    -.0014879   -.0004831
>       gender |  -.3724799   .1183211    -3.15   0.002    -.6043849   -.1405749
>      married |  -.6450081   .1533621    -4.21   0.000    -.9455923   -.3444239
>    separated |   .5868276   .2951732     1.99   0.047     .0082988    1.165356
>     divorced |   .1042908   .1962848     0.53   0.595    -.2804203    .4890018
>      widowed |   1.208098   .2994334     4.03   0.000     .6212191    1.794976
>       eduyrs |  -.0146007   .0143999    -1.01   0.311     -.042824    .0136225
>      ichldhm |   .1518147   .1852086     0.82   0.412    -.2111874    .5148168
>  interaction |  -.3089155   .2124631    -1.45   0.146    -.7253355    .1075044
>      missinc |          0  (omitted)
>     lhincome |  -.6049375   .0732486    -8.26   0.000    -.7485022   -.4613728
>      ihealth |  -1.672027   .0842643   -19.84   0.000    -1.837182   -1.506872
>     iuemp5yr |   .2910945   .1074106     2.71   0.007     .0805737    .5016153
>     iuemp12m |   .3892144   .1335323     2.91   0.004     .1274958     .650933
>       rgdp06 |   6.49e-06   .0000108     0.60   0.547    -.0000146    .0000276
>        _cons |   17.25528   .9936399    17.37   0.000     15.30778    19.20278
> ----------------------------------------------------
>
> Antonio
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
> Sent: Sunday, February 02, 2014 10:59 AM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Grouping income variables- RECODE COMMAND
>
> Your -recode- mapped 1,...,11 to 1,...,11, which makes precisely no progress with the main problem. As I understand what you want, you need something more like
>
> recode hinctnt 1=40 2=70 3=130 ...
>
> Nick
> njcoxstata@gmail.com
>
> On 1 February 2014 19:43, Antonio Rodriguez Andres <Antonio.Andres@emu.edu.tr> wrote:
>> Nıck
>>
>> You are right. But ıf I type the following code
>>
>> recode hinctnt (1=1 "1st interval") (2=2 "2nd interval") (3=3 "3rd
>> interval") (4=4 "4th interval") (5=5 "5th interval") (6=6 "6th
>> interval") (7=7 "7th interval") (8=8 "8th interval") (9=9 "9th
>> interval") (10=10 "10th interval") (11=11 "11th interval") (12=12
>> "12th interval") (.=.m "Missing") (77=.r "Refusal") (88=.d "Don't
>> Know") (99=.s "Not answer"), gen (ihinctnt)
>>
>> I generate a new variable ihinctnt. Then I tabulated and I compute
>> summary statistics. But these are not incomes. I should specify the
>> upper and lower linıt for each interval. How can I do it
>>
>>
>> tab ihinctnt, missing
>>
>> RECODE of
>> hinctnt
>> (Household's
>> total net
>> income, all
>> sources)       Freq.     Percent        Cum.
>>
>> 1st interval       1,663        3.87        3.87
>> 2nd interval       1,561        3.63        7.50
>> 3rd interval       2,262        5.26       12.76
>> 4th interval       3,676        8.55       21.31
>> 5th interval       3,545        8.24       29.55
>> 6th interval       3,293        7.66       37.21
>> 7th interval       3,010        7.00       44.21
>> 8th interval       2,871        6.68       50.89
>> 9th interval       4,707       10.95       61.83
>> 10th interval       2,058        4.79       66.62
>> 11th interval         644        1.50       68.12
>> 12th interval         428        1.00       69.11
>> Don't Know       3,540        8.23       77.34
>> Missing       5,037       11.71       89.06
>> Refusal       4,525       10.52       99.58
>> Not answer         180        0.42      100.00
>>
>> Total      43,000      100.00
>>
>> . summ ihinctnt
>>
>> Variable        Obs        Mean    Std. Dev.       Min  Max
>>
>> ihinctnt      29718    6.156504     2.75604          1  12
>>
>> . summ ihinctnt,d
>>
>> RECODE of hinctnt (Household's total net income, all sources)
>>
>> Percentiles      Smallest
>> 1%            1              1
>> 5%            1              1
>> 10%            2              1       Obs               29718
>> 25%            4              1       Sum of Wgt.       29718
>>
>> 50%            6                      Mean           6.156504
>> Largest       Std. Dev.       2.75604
>> 75%            9             12
>> 90%           10             12       Variance       7.595757
>> 95%           10             12       Skewness       -.080652
>> 99%           12             12       Kurtosis       2.098037
>> -----Original Message-----
>> From: owner-statalist@hsphsun2.harvard.edu
>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
>> Sent: Saturday, February 01, 2014 9:17 PM
>> To: statalist@hsphsun2.harvard.edu
>> Subject: Re: st: Grouping income variables- RECODE COMMAND
>>
>> The numeric values of -hinctnt- don't exceed 99. They are evidently numeric codes, not incomes. So, why you are surprised at your results?
>> You have to -recode- your data before you can classify them. And that means the -recode- command.
>> Nick
>> njcoxstata@gmail.com
>>
>>
>> On 1 February 2014 18:14, Antonio Rodriguez Andres <Antonio.Andres@emu.edu.tr> wrote:
>>> Here you can see the basic description of the income variable
>>>
>>> tab hinctnt
>>>
>>> Household's |
>>>   total net |
>>> income, all |
>>>     sources |      Freq.     Percent        Cum.
>>> ------------+-----------------------------------
>>>           J |      1,663        4.38        4.38
>>>           R |      1,561        4.11        8.49
>>>           C |      2,262        5.96       14.45
>>>           M |      3,676        9.68       24.13
>>>           F |      3,545        9.34       33.47
>>>           S |      3,293        8.67       42.15
>>>           K |      3,010        7.93       50.08
>>>           P |      2,871        7.56       57.64
>>>           D |      4,707       12.40       70.04
>>>           H |      2,058        5.42       75.46
>>>           U |        644        1.70       77.15
>>>           N |        428        1.13       78.28
>>>     Refusal |      4,525       11.92       90.20
>>>  Don't know |      3,540        9.32       99.53
>>>   No answer |        180        0.47      100.00
>>> ------------+-----------------------------------
>>>       Total |     37,963      100.00
>>>
>>>
>>> sum hinctnt, d
>>>
>>>           Household's total net income, all sources
>>> -------------------------------------------------------------
>>>       Percentiles      Smallest
>>>  1%            1              1
>>>  5%            2              1
>>> 10%            3              1       Obs               37963
>>> 25%            5              1       Sum of Wgt.       37963
>>>
>>> 50%            7                      Mean           22.67271
>>>                         Largest       Std. Dev.      31.57352
>>> 75%           10             99
>>> 90%           77             99       Variance       996.8872
>>> 95%           88             99       Skewness       1.378759
>>> 99%           88             99       Kurtosis       2.984444
>>>
>>> .
>>>
>>> -----Original Message-----
>>> From: owner-statalist@hsphsun2.harvard.edu
>>> [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
>>> Sent: Saturday, February 01, 2014 7:52 PM
>>> To: statalist@hsphsun2.harvard.edu
>>> Subject: Re: st: Grouping income variables- RECODE COMMAND
>>>
>>> Your code shows you using the -recode()- function, which is quite different from the -recode- command. In Stata functions and commands are different!
>>>
>>> I think that to comment helpfully we need to see more about your
>>> -hinctnt-, for example, the results of
>>>
>>> . su hinctnt, detail
>>>
>>> Your categories are not disjoint as (e.g.) the definitions [70, 120] and [120, 230] leave ambiguous what happens with 120. Alternatively, your notation here confuses the meaning of [ ] and ( ).
>>> Nick
>>> njcoxstata@gmail.com
>>>
>>>
>>> On 1 February 2014 17:29, Antonio Rodriguez Andres <Antonio.Andres@emu.edu.tr> wrote:
>>>> Dear Stata users,
>>>>
>>>> I have to group the income variable in different intervals. In the
>>>> original dataset, the household income variable is grouped İnto 12
>>>> categories
>>>>
>>>> J <40
>>>> R [40,70]
>>>> C [70, 120]
>>>> M [120, 230]
>>>> F [230, 350]
>>>> S
>>>> K
>>>> P
>>>> D
>>>>  H
>>>>  U [1730, 2310)
>>>> N > 2310
>>>>
>>>> I want to group J and R categories <70 Euros, and create dummy
>>>> variables for all income groups. That is the Stata ouput. I used the
>>>> recode command But it does not work
>>>>
>>>> gen hinc_gr=recode(hinctnt, 70, 120, 230, 350, 460, 580, 690, 1150,
>>>> 1730,
>>>> 2310)
>>>> (13282 missing values generated)
>>>>
>>>> . tab hinc_gr
>>>>
>>>>     hinc_gr |      Freq.     Percent        Cum.
>>>> ------------+-----------------------------------
>>>>          70 |     29,718      100.00      100.00
>>>> ------------+-----------------------------------
>>>>       Total |     29,718      100.00
>>>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index