Home  /  Resources & support  /  FAQs  /  Reading a hierarchal dataset with infile

Can infile read a hierarchical dataset?

Title   Reading a hierarchal dataset with infile
Author William Gould, StataCorp

In hierarchical datasets, the records do not all have the same format, nor do they contain the same type of information. In computer jargon, the file is organized so that the number and types of fields depend on the record types. For example, you may have a dataset that has records for families and records for persons within family.

To read these types of datasets, you need to create a dictionary for each type of record. Then read in all the data by using each of the dictionaries and keep only the appropriate data. Once you have these two Stata datasets, you can merge them. This is a difficult data management problem, but it is manageable if you break the problem into several steps. We cover this topic in an extended example in lecture 1 of NetCourse 151. Below is an excerpt from that lecture.


Begin quotation from NetCourse 151

Now let’s consider reading a hierarchical dataset with Stata. This is tricky.

Suppose that you have data on families and persons within families. The data have the format family record followed by one or more person records:

family record
person record
...
person record
family record
etc...

Let’s assume that

family record: person record:
col. 1–5 family id col. 1–5 person id
col. 7 "1" col. 7 "2"
col. 9 dwelling type code col. 8–9 age
    col. 11 sex code

Note: Sample data are at the end of this lecture, after the exercises. You should use those data to test your ability to read this kind of dataset.

The data probably contain more information than this, but this is enough for illustration. If column 7 contains a 1, it is a family record, and if it contains a 2, it is a person record. This is called the record-type indicator.

I want to create a Stata .dta set containing

  1. family id
  2. dwelling code
  3. person id
  4. age
  5. sex code

My dataset will contain one observation per person, and the family information will be repeated for persons in the same family.

First, I will create separate dictionaries for reading the family and person information:

DICTIONARY: family.dct
      dictionary using hier.raw {
                      long    famid   %5f     "family id"
      _column(7)      byte    rectype %1f     "record type"
      _column(9)      byte    dwell   %1f     "dwelling code"
      }

DICTIONARY: person.dct
      dictionary using hier.raw {
                      long    perid   %5f     "person id"
      _column(7)      byte    rectype %1f     "record type"
      _column(8)      byte    age     %2f     "age (years)"
      _column(11)     byte    sex     %1f     "sex code"
      }

I will then test each one of these dictionaries, to make sure they work:

        . clear
        . infile using family if rectype==1 in 1/100
        . list in 1/5
        . type hier.raw                 <- I'll press Break to stop this
        . clear
        . infile using person if rectype==2 in 1/100
        . list in 1/5
        . type hier.raw                 <- I'll press Break to stop this

What I’m doing above is reading a few data, typing the original, and comparing them.

Satisfied that I have good dictionaries, I then create a do-file to read the entire dataset. The basic plan of my do-file is to

  1. read the family records and save them in a data file,
  2. read the person records, and
  3. merge the person and family records.

This problem would be easy if the person records contained the family id to which they belonged—step 1 would be an infile ... if rectype==1 followed by a sort and save, and step 2 would be an infile ... if rectype==2 followed by a sort, and step 3 would be a merge. My whole do-file would be

        STEP 1	
	        clear				
	        infile using family if rectype==1
	        sort famid
	        save tmph, replace

        STEP 2
	        clear		
	        infile using person if rectype==2
	        sort famid

        STEP 3
	        merge m:1 famid using tmph

In my example, however, famid does not appear on the person records (just as it does not in data released by the U.S. Bureau of the Census and the U.S. Bureau of Labor Statistics).

This adds significantly to the complication. I’m going to read the family records as I did above, but in addition, I’m going to manufacture my own family ID variable, labeling the first family 1, the second 2, and so on:

        MODIFIED STEP 1
	        clear
	        infile using family if rectype==1
	        gen long id = _n
	        sort id
	        save tmph, replace

Next, when I read the person data, I am going to read the family records as if they were person records, too. The result will be that I have a placeholder observation for the family record:

  perid rectype age sex
1. junk 1 junk junk
2. 1 2 32 0
3. 2 2 30 1
4. junk 1 junk junk
5. 1 2 40 1
etc.        

I will then regenerate my temporary family ID variable and discard the misread family records.

To regenerate the id variable, I will first gen id = 1 if rectype == 1,

  perid rectype age sex id
1. junk 1 junk junk 1
2. 1 2 32 0 .
3. 2 2 30 1 .
4. junk 1 junk junk 1
5. 1 2 40 1 .
etc.          

and then replace id = sum(id),

  perid rectype age sex id
1. junk 1 junk junk 1
2. 1 2 32 0 1
3. 2 2 30 1 1
4. junk 1 junk junk 2
5. 1 2 40 1 2
etc.          

and finally, drop if rectype == 1

  perid rectype age sex id
1. 1 2 32 0 1
2. 2 2 30 1 1
3. 1 2 40 1 2
           

I will then be able to merge my family data with my person data. So, the outline of my do-file is

        MODIFIED  STEP 1
       		clear                                
	        infile using family if rectype==1
	        gen long id = _n
	        sort id
	        save tmph, replace

        MODIFIED STEP 2
	        clear     
	        infile using person /* no if! */
	        gen long id = 1 if rectype==1
	        replace id = sum(id)
	        sort id

        MODIFIED STEP 3
	        merge m:1 id using tmph
	        drop id

Some final details:

My outline assumes that rectype takes on the values 1 and 2 as the documentation claims and that everything merges. I need to include some checks. Thus my final do-file is

DO-FILE: crhier.do
    capture log close
    log using crhier, replace
    clear
    infile using family if rectype==1
    drop rectype
    gen long id = _n                /* make my own temporary id var */
    sort id                         /* to set sort markers */
    save tmph, replace

    clear
    infile using person             /* no matter what the rectype */
    assert rectype==1 | rectype==2  /* just to be safe -- see note */
    gen long id = 1 if rectype==1
    replace id = sum(id)
    drop if rectype==1
    drop rectype
    sort id perid

    merge m:1 id using tmph
    assert _merge==3                /* they are supposed to match */
    drop _merge id

    sort famid perid
    save hier, replace
    erase tmph.dta
    log close
    exit

Let me direct your attention to the line that reads

        assert rectype==1 | rectype==2

This line is an important part of my do-file. Everything I’m doing hinges on the documentation being correct; that is, rectype really does take on the values 1 and 2, and only the values 1 and 2, and that I am correctly reading rectype (I’m reading the right columns of the data). In my do-file, if rectype ever takes on a value other than 1 or 2, things stop right there.

Similarly, after merging, I include the line

        assert _merge==3

Theoretically, this line must be true, but in reality, I sometimes make mistakes. Asserting things that must be true is a good way to catch bugs.

Early on, I should verify more about the data. For instance, the documentation implies that there are no empty households, meaning two household records in a row. I could prove this by including the line

        assert rectype!=1 if rectype[_n-1]==1

The final thing you need to know is that Stata’s infix command and infile with a data dictionary are really the same command—both read the data in record mode. I have chosen to use infile in the example above, but I could have used infix. My dictionaries would have had a slightly different format, but the logic would be the same. All that would change in my do-file would be the switch from infile to infix.