Can infile read a hierarchical dataset?
|
Title
|
|
Reading a hierarchal dataset with infile
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
January 1996; minor revisions July 2011
|
In hierarchical datasets, the records do not all have the same format, nor
do they contain the same type of information. In computer jargon, the file
is organized so that the number and types of fields depend on the record
types. For example, you may have a dataset that has records for families
and records for persons within family.
To read these types of datasets, you need to create a dictionary for each
type of record. Then read in all the data by using each of the
dictionaries and keep only the appropriate data. Once you have these two
Stata datasets, you can merge them. This is a
difficult data management problem, but it is manageable if you break the
problem into several steps. We cover this topic in an extended example in
lecture 1 of NetCourse 151. Below is an excerpt
from that lecture.
Begin quotation from NetCourse 151
Now let’s consider reading a hierarchical dataset with Stata. This is
tricky.
Suppose that you have data on families and persons within families. The
data have the format family record followed by one or more person records:
family record
person record
...
person record
family record
etc...
Let’s assume that
|
family record: |
person record: |
| col. 1–5 |
family id |
col. 1–5 |
person id |
| col. 7 |
"1" |
col. 7 |
"2" |
| col. 9 |
dwelling type code |
col. 8–9 |
age |
| |
|
col. 11 |
sex code |
Note: Sample data are at the end of this lecture, after the exercises. You
should use those data to test your ability to read this kind of dataset.
The data probably contain more information than this, but this is enough for
illustration. If column 7 contains a 1, it is a family record, and
if it contains a 2, it is a person record. This is called the
record-type indicator.
I want to create a Stata .dta set containing
- family id
- dwelling code
- person id
- age
- sex code
My dataset will contain one observation per person, and the family
information will be repeated for persons in the same family.
First, I will create separate dictionaries for reading the family and person
information:
|
DICTIONARY:
family.dct |
dictionary using hier.raw {
long famid %5f "family id"
_column(7) byte rectype %1f "record type"
_column(9) byte dwell %1f "dwelling code"
} |
|
DICTIONARY:
person.dct |
dictionary using hier.raw {
long perid %5f "person id"
_column(7) byte rectype %1f "record type"
_column(8) byte age %2f "age (years)"
_column(11) byte sex %1f "sex code"
} |
I will then test each one of these dictionaries, to make sure they work:
. clear
. infile using family if rectype==1 in 1/100
. list in 1/5
. type hier.raw <- I'll press Break to stop this
. clear
. infile using person if rectype==2 in 1/100
. list in 1/5
. type hier.raw <- I'll press Break to stop this
What I’m doing above is reading a few data, typing the original, and
comparing them.
Satisfied that I have good dictionaries, I then create a do-file to read the
entire dataset. The basic plan of my do-file is to
- read the family records and save them in a data file,
- read the person records, and
- merge the person and family records.
This problem would be easy if the person records contained the family id to
which they belonged—step 1 would be an infile ... if
rectype==1 followed by a sort and save, and step 2 would
be an infile ... if rectype==2 followed by a sort, and
step 3 would be a merge. My whole do-file would be
STEP 1
clear
infile using family if rectype==1
sort famid
save tmph, replace
STEP 2
clear
infile using person if rectype==2
sort famid
STEP 3
merge m:1 famid using tmph
In my example, however, famid does not appear on the person records
(just as it does not in data released by the U.S. Bureau of the Census and
the U.S. Bureau of Labor Statistics).
This adds significantly to the complication. I’m going to read the family
records as I did above, but in addition, I’m going to manufacture my own
family ID variable, labeling the first family 1, the second 2, and so on:
MODIFIED STEP 1
clear
infile using family if rectype==1
gen long id = _n
sort id
save tmph, replace
Next, when I read the person data, I am going to read the family records as
if they were person records, too. The result will be that I have a
placeholder observation for the family record:
| |
perid |
rectype |
age |
sex |
| 1. |
junk
|
1 |
junk
|
junk
|
| 2. |
1 |
2 |
32 |
0 |
| 3. |
2 |
2 |
30 |
1 |
| 4. |
junk |
1 |
junk |
junk |
| 5. |
1 |
2 |
40 |
1 |
| etc. |
|
|
|
|
I will then regenerate my temporary family ID variable and discard the
misread family records.
To regenerate the id variable, I will first gen id = 1 if rectype
== 1,
| |
perid |
rectype |
age |
sex |
id |
| 1. |
junk |
1 |
junk |
junk |
1 |
| 2. |
1 |
2 |
32 |
0 |
. |
| 3. |
2 |
2 |
30 |
1 |
. |
| 4. |
junk |
1 |
junk |
junk |
1 |
| 5. |
1 |
2 |
40 |
1 |
. |
| etc. |
|
|
|
|
|
and then replace id = sum(id),
| |
perid |
rectype |
age |
sex |
id |
| 1. |
junk |
1 |
junk |
junk |
1 |
| 2. |
1 |
2 |
32 |
0 |
1 |
| 3. |
2 |
2 |
30 |
1 |
1 |
| 4. |
junk
|
1 |
junk
|
junk
|
2 |
| 5. |
1 |
2 |
40 |
1 |
2 |
| etc. |
|
|
|
|
|
and finally, drop if rectype == 1
| |
perid |
rectype |
age |
sex |
id |
| 1. |
1 |
2 |
32 |
0 |
1 |
| 2. |
2 |
2 |
30 |
1 |
1 |
| 3. |
1 |
2 |
40 |
1 |
2 |
| |
|
|
|
|
|
I will then be able to merge my family data with my person data. So, the
outline of my do-file is
MODIFIED STEP 1
clear
infile using family if rectype==1
gen long id = _n
sort id
save tmph, replace
MODIFIED STEP 2
clear
infile using person /* no if! */
gen long id = 1 if rectype==1
replace id = sum(id)
sort id
MODIFIED STEP 3
merge m:1 id using tmph
drop id
Some final details:
My outline assumes that rectype takes on the values 1 and 2 as the
documentation claims and that everything merges. I need to include some
checks. Thus my final do-file is
|
DO-FILE:
crhier.do |
capture log close
log using crhier, replace
clear
infile using family if rectype==1
drop rectype
gen long id = _n /* make my own temporary id var */
sort id /* to set sort markers */
save tmph, replace
clear
infile using person /* no matter what the rectype */
assert rectype==1 | rectype==2 /* just to be safe -- see note */
gen long id = 1 if rectype==1
replace id = sum(id)
drop if rectype==1
drop rectype
sort id perid
merge m:1 id using tmph
assert _merge==3 /* they are supposed to match */
drop _merge id
sort famid perid
save hier, replace
erase tmph.dta
log close
exit |
Let me direct your attention to the line that reads
assert rectype==1 | rectype==2
This line is an important part of my do-file. Everything I’m doing
hinges on the documentation being correct; that is, rectype really
does take on the values 1 and 2, and only the values 1 and 2, and that I am
correctly reading rectype (I’m reading the right columns of the
data). In my do-file, if rectype ever takes on a value other than 1
or 2, things stop right there.
Similarly, after merging, I include the line
assert _merge==3
Theoretically, this line must be true, but in reality, I sometimes make
mistakes. Asserting things that must be true is a good way to catch bugs.
Early on, I should verify more about the data. For instance, the
documentation implies that there are no empty households, meaning two
household records in a row. I could prove this by including the line
assert rectype!=1 if rectype[_n-1]==1
The final thing you need to know is that Stata’s
infix
command and infile
with a data dictionary are really the same command—both read the data
in record mode. I have chosen to use infile in the example above,
but I could have used infix. My dictionaries would have had a
slightly different format, but the logic would be the same. All that would
change in my do-file would be the switch from infile to infix.
|