[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
wgould@stata.com (William Gould, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Creating variables describing parents' characteristics with parents' ID |

Date |
Tue, 17 Mar 2009 09:01:04 -0500 |

Eunsu Ju <juxx0008@umn.edu> writes, > I would like to generate a new variable which contains the information of > parents, e.g. dad's education. My data looks like below. > > FID PID Var1 Edu Var3 Dad's FID Dad's PID Dad's Edu > 1001 10 1 3 5 . . > 1001 20 3 3 2 . . > 1001 30 4 2 3 1001 10 3 > 1001 31 8 5 5 1001 10 3 > 1002 1 2 4 3 . . . > 1002 10 4 2 1 1002 1 4 > 1002 20 5 4 2 . . . > 1002 30 9 3 2 1002 10 2 > 1002 31 6 1 4 1002 10 2 > 1002 32 4 2 5 1002 10 2 > > Note: FID = Family ID; PID = Person ID; Edu = Educational attainment > (Values are randomly assigned, but data structure is similar to the above.) > > What I want to do is to have the last (far left) column, which is not > included in the dataset. (I want to do this kind of works for other > variables, e.g. Var1 and Var3.) > What is the best & simplist way to do this in stata? > > I think I can do this like the following. > [...] Eunsu Ju's plan is exactly right. He makes step 1, > 1) Split the data set into two files so that one file contains Dad's FID > and Dad's PID, and the other has all others. more difficult than it needs to be and later leaves doesn't worry about something that may not happen, but it's right overall. Here's the solution, calling Eunsu Ju's original data master.dta. First, however, I want to verify something, . use master . sort fid pid . by fid pid: assert _n==1 . save master, replace I'm sure it's true that (fid, pid) uniquely identify the observations, but it never hurts to check. I also want master.dta sorted by fid and pid. Now let's continue, . keep dads_fid dads_pid . drop if dads_fid>=. | dads_fid>=. . rename dads_fid fid . rename dads_pid pid . rename edu dads_edu . sort fid pid . by fid pid: keep if _n == 1 . save step1, replace At this point, I've got a dataset of of the dads. My only contribution so far is the next to the last line: I worried that a dad might be the dad of more than one person. We will only need on record per dad and, in fact, that will turn out to be important, although I admit I didn't know that when I first wrote this paragraph. I just followed the rule, "Don't carry duplicate information, all that will happen is that it won't be a duplicate when you think it is or it will otherwise bite you later." Now let's get Dad's education merged into step1.dta: . merge fid pid using master . keep if _merge==1 | _merge==3 . keep fid pid edu . rename fid dads_fid . rename pid dads_pid . rename edu dads_edu . sort dads_pid dads_edu . save step2, replace At this point, we have a dataset of (dads_fid, dad_pid, _dadsedu), and we know that (dads_fid, dads_pid) uniquely identifies the observations. The -keep if _merge==1 | _merge==3- above could be changed to -keep if _merge==3-. That would be more computer efficient, because then step2.dta would have fewer observations. I'm keeping all the dads, for no good reason, and I'm wondering right now whether I'll go back and have to edit this paragraph. Okay, how we can fix master: . use master, clear . sort dads_fid dads_pid . merge dads_fid dads_pid using step2 . keep if _merge==1 | _merge==3 We now have the desired result. Note that in master, the same dad might appear more than once. The dads in step2.dta appear only once, however, so that same dad will be spread across the observations in master. Perfect. Eunsu Ju also wrote, > However, I think there might be easier way to do this. Well, there is a different soltuion, but it's not easier. I want to show you this because sometimes the resulting dataset is more convenient to work with. Let's start all over again. Do the following: . use master, clear . sort fid pid . gen fprecno = _n . save master, replace Now perform solution 1 but this time, rather than adding dads_edu, add dads_fprecno to master.dta. Now let's get dads_edu: . use master, clear // this is the one including dads_fprecno . sort fid pid . gen dads_edu = edu[dads_fprecno] See how this works? With the data in (fid, pid) order, variable dads_fprecno records the observation number of the dad, or it records missing. Thus, edu[dads_fprecno] is the dad's education, or it is missing. This is convenient because now it is so easy to pull anything from the dads record and attach it to the son's or daughter's record. Be careful, however. For dads_fprecno to be valid, you must not add or delete observations once you create the variable. The following would produce incorrect results: . drop if age<20 . gen dads_somethingelse = somethingelse[dads_fprecno] because now dads_fprecno no longer corresponds to observation number. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Creating variables describing parents' characteristics with parents' ID***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: loglikelihood and loglikelihood ratio** - Next by Date:
**Re: st: loglikelihood and loglikelihood ratio** - Previous by thread:
**Re: st: Creating variables describing parents' characteristics with parents' ID** - Next by thread:
**RE: st: Creating variables describing parents' characteristics with parents' ID** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |