Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Creating variables describing parents' characteristics with parents' ID


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Creating variables describing parents' characteristics with parents' ID
Date   Tue, 17 Mar 2009 09:01:04 -0500

Eunsu Ju <juxx0008@umn.edu> writes, 

> I would like to generate a new variable which contains the information of
> parents, e.g. dad's education.  My data looks like below.
> 
> FID	PID	Var1	Edu	Var3	Dad's FID  Dad's PID  Dad's Edu
> 1001	10	1	3	5	.	        .	
> 1001	20	3	3	2	.		.	
> 1001	30	4	2	3	1001	        10	 3
> 1001	31	8	5	5	1001		10	 3
> 1002	1	2	4	3	.		.	 .
> 1002	10	4	2	1	1002		1	 4
> 1002	20	5	4	2	.		.	 .
> 1002	30	9	3	2	1002		10	 2
> 1002	31	6	1	4	1002		10	 2
> 1002	32	4	2	5	1002		10	 2
> 
> Note: FID = Family ID; PID = Person ID; Edu = Educational attainment
> (Values are randomly assigned, but data structure is similar to the above.)
>  
> What I want to do is to have the last (far left) column, which is not 
> included in the dataset. (I want to do this kind of works for other 
> variables, e.g. Var1 and Var3.)
> What is the best & simplist way to do this in stata? 
> 
> I think I can do this like the following.
> [...]

Eunsu Ju's plan is exactly right.  He makes step 1, 

> 1) Split the data set into two files so that one file contains Dad's FID 
>    and Dad's PID, and the other has all others.

more difficult than it needs to be and later leaves doesn't worry about 
something that may not happen, but it's right overall.

Here's the solution, calling Eunsu Ju's original data master.dta.  First, 
however, I want to verify something, 

        . use master

        . sort fid pid
        . by fid pid:  assert _n==1

        . save master, replace

I'm sure it's true that (fid, pid) uniquely identify the observations, but 
it never hurts to check.  I also want master.dta sorted by fid and pid.
Now let's continue, 

        . keep dads_fid dads_pid 

        . drop if dads_fid>=. | dads_fid>=. 

        . rename dads_fid fid

        . rename dads_pid pid 

        . rename edu dads_edu

        . sort fid pid

        . by fid pid: keep if _n == 1

        . save step1, replace

At this point, I've got a dataset of of the dads.  My only contribution 
so far is the next to the last line:  I worried that a dad might be 
the dad of more than one person.  We will only need on record per dad
and, in fact, that will turn out to be important, although I admit I didn't 
know that when I first wrote this paragraph.  I just followed the rule, "Don't
carry duplicate information, all that will happen is that it won't be a
duplicate when you think it is or it will otherwise bite you later."  Now
let's get Dad's education merged into step1.dta:

        . merge fid pid using master

        . keep if _merge==1 | _merge==3

        . keep fid pid edu

        . rename fid dads_fid

        . rename pid dads_pid

        . rename edu dads_edu

        . sort dads_pid dads_edu

        . save step2, replace

At this point, we have a dataset of (dads_fid, dad_pid, _dadsedu), and 
we know that (dads_fid, dads_pid) uniquely identifies the observations.

The -keep if _merge==1 | _merge==3- above could be changed to 
-keep if _merge==3-.  That would be more computer efficient, because then 
step2.dta would have fewer observations.  I'm keeping all the dads, for no
good reason, and I'm wondering right now whether I'll go back and have 
to edit this paragraph.

Okay, how we can fix master:

        . use master, clear 

        . sort dads_fid dads_pid

        . merge dads_fid dads_pid using step2  

        . keep if _merge==1 | _merge==3

We now have the desired result.  Note that in master, the same dad might 
appear more than once.  The dads in step2.dta appear only once, however, so
that same dad will be spread across the observations in master.  Perfect.


Eunsu Ju also wrote, 

> However, I think there might be easier way to do this.

Well, there is a different soltuion, but it's not easier.  I want to show you
this because sometimes the resulting dataset is more convenient to work 
with.  Let's start all over again.  Do the following:

        . use master, clear 
        . sort fid pid 
        . gen fprecno = _n
        . save master, replace

Now perform solution 1 but this time, rather than adding dads_edu, add
dads_fprecno to master.dta.  

Now let's get dads_edu:

        . use master, clear    // this is the one including dads_fprecno

        . sort fid pid 

        . gen dads_edu = edu[dads_fprecno]

See how this works?  With the data in (fid, pid) order, variable dads_fprecno
records the observation number of the dad, or it records missing.  Thus, 
edu[dads_fprecno] is the dad's education, or it is missing.

This is convenient because now it is so easy to pull anything from the 
dads record and attach it to the son's or daughter's record.

Be careful, however.  For dads_fprecno to be valid, you must not add or 
delete observations once you create the variable.  The following 
would produce incorrect results:

       . drop if age<20
       . gen dads_somethingelse = somethingelse[dads_fprecno]

because now dads_fprecno no longer corresponds to observation number.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index