Mirko <[email protected]> writes,
> I have prepared the following example.
> Given this data:
>
>        id  region     educ       age   income
>        1        2        1        25        5
>        1        2        1        26        5
>        2        2        1        29        8
>        2        2        1        30        8
>        3        2        1        32        11
>        3        2        1        33        11
>        4        1        1        40        5
>        4        1        1        41        5
>        5        1        2        37        8
>        5        1        2        38        8
>        6        1        2        42        9
>        6        1        2        43        9
>
> I need to create the variable yref, that is the average income of a
> reference group defined by the same level of education, the same
> region and the same cohort (i.e., people who are 5 years younger and 5
> years older).
I think I have a solution.  Mirko gave answers for this small dataset
and, in two places, my solution does not give the same results as Mirko
does, but I'm hoping Mirko did the pencil-and-paper calculation wrong.
If I made a mistake, finding it should not be too difficult.
My approach is to create ref.dta containing region, educ, age, and yref.
I said to myself, if I had that dataset, then getting what Mirko wants
would be easy:
        . use master                 // Mirko's original dataset
        . sort region educ age
        . merge region educ age using ref, nokeep
So now let's think about creating ref.dta.
Every observation in Mirko's original dataset appears in 11 cohort
groups, age-5, age-4, ..., age, age+1, age+2, ..., age+5.
So let's start with the original dataset and make a dataset eleven times
bigger (each original observation appears in 11 different cohorts).  Then we
can collapse the big dataset into averages.
Here is my solution to the make-the-giant-dataset problem:
        . use master, clear
        . rename age cohort
        . drop id
        . drop in f/l
        . save sofar, emptyok replace
        . forvalues i= -5(1)5 {
        .         use master, clear
        .         drop id
        .         replace age = age + `i'
        .         rename age cohort
        .         append using sofar
        .         save sofar, replace
        . }
I'm sure sofar.dta could have been made more efficiently.
I suspect someone is going to point out that we could -expand- the data,
then -sort-, and then do some very clever -replace-.  Above is what occurred
to me, however.
Now that I have sofar.dta, I can make ref.dta from it:
        . sort region educ cohort
        . by region educ cohort: gen yref = sum(income)/_n
        . by region educ cohort: keep if _n==_N
        . drop income
        . rename cohort age
        . sort region educ age
        . save ref, replace
and now I'm read to do the last step:
        . use master
        . sort region educ age
        . merge region educ age using ref, nokeep
I have a do-file below my signature.  Here's the output it produced:
        . list
             +----------------------------------------------+
             | id   region   educ   age   income       yref |
             |----------------------------------------------|
          1. |  1        2      1    25        5        6.5 |
          2. |  1        2      1    26        5        6.5 |
          3. |  2        2      1    29        8          8 |
          4. |  2        2      1    30        8          8 |
          5. |  3        2      1    32       11        9.5 |
             |----------------------------------------------|
          6. |  3        2      1    33       11        9.5 |
          7. |  4        1      1    40        5          5 |
          8. |  4        1      1    41        5          5 |
          9. |  5        1      2    37        8   8.333333 |   <--
         10. |  5        1      2    38        8        8.5 |
             |----------------------------------------------|
         11. |  6        1      2    42        9        8.5 |
         12. |  6        1      2    43        9   8.666667 |   <--
             +----------------------------------------------+
My program differs from Mirkos expectation in two observations.
-- Bill
[email protected]
------------------------------------------------------------------------------
clear
input id region educ age income
1       2       1       25      5
1       2       1       26      5
2       2       1       29      8
2       2       1       30      8
3       2       1       32      11
3       2       1       33      11
4       1       1       40      5
4       1       1       41      5
5       1       2       37      8
5       1       2       38      8
6       1       2       42      9
6       1       2       43      9
end
sort region educ age
save master, replace
use master, clear
rename age cohort
drop id
drop in f/l
describe
save sofar, emptyok replace
forvalues i= -5(1)5 {
        use master, clear
        drop id
        replace age = age + `i'
        rename age cohort
        append using sofar
        save sofar, replace
}
sort region educ cohort
by region educ cohort: gen yref = sum(income)/_n
by region educ cohort: keep if _n==_N
drop income
list
rename cohort age
sort region educ age
save ref, replace
use master, clear
merge region educ age using ref, nokeep
sort id educ age
list
------------------------------------------------------------------------------
<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/