Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Reading data with multiple (and different number of) lines per


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Reading data with multiple (and different number of) lines per
Date   Mon, 13 Apr 2009 09:16:26 -0500

Carolina Casas-Cordero <ccasas@survey.umd.edu> writes, 

> I understand that Stata can read ASCII files with multiple lines per
> observations (using infix) as long as each observation has exactly the
> same number of lines (using lines #). 
> 
> In my dataset, the number of lines varies from observation to
> observation. Im providing an example of how my raw data looks like
> below. For each observation therere 3 pieces of information:
> 1.Household level data (84 characters long, in line 1)
> 2.Indicator of number of person-level records per household (4
> characters long, in line 2)
> 3.Person level data (25 characters long, in lines 3+)
> 
> How can I tell Stata to take into account the variability in the number
> of lines per observation?
> 
> Example of data structure:
> ---------------------------------------------------------------------------------
> 000010011800111 1140471504000000000 000001171993
> 4
> 0000005237208080100000000
> 0000005237249501000000000
> 0000005237257601100000000
> 0000005237309120100000000
> 000010011800111 1140471504000000000 000001171993
> 2
> 0000005237208080100000000
> 0000005237309120100000000
> 000010011800111 1140471504000000000 000001171993
> 3
> 0000005237208080100000000
> 0000005237257601100000000
> 0000005237309120100000000
> 000010011800111 1140471504000000000 000001171993
> 8
> 0000005237208080100000000
> 0000005237249501000000000
> 0000005237257601100000000
> 0000005237309120100000000
> 0000005237208080100000000
> 0000005237249501000000000
> 0000005237257601100000000
> 0000005237309120100000000
> ....etc

Roy Wada <roywada@hotmail.com> already gave an answer, which was, 
in outline, 

>   1.  infix acording to household.  Drop everything else.
>
>   2.  infix according to person level data.  After droping the 
>       household row (you can figure that out), ...
>
>   3.  User -merge- or -joinby- to put them together on id.

I omit the dtail Roy provided because I'm about to fill in those details
in my own way.


Layout of input data
--------------------

I have a suspicion that Carolina inserted line 2 in each record, recording 
the number of person records that followed -- as a way to make it easier 
for us to understand the problem.  In most longitudinal datasets I've seen, 
one has te following important variables:

     rectype     a variable that indicates whether this is a household
                 record or a family record.  If there is a rectype variable,
                 it appears in the same columns in hh records and 
                 person records.  I'm going to assume a rectype variable, 
                 but there's a Final Comment below about that.  I  will also 
                 assume, this time without loss of generality, that 
                 rectype is 1 for household record and 2 for a person 
                 record.

                 As an aside, the family record might have a field
                 that records the number of person records that follow.
                 We will not be using that count.

     hhid        Household (hh) identification number.  Sometimes hhid 
                 appears on the hh record only, and sometimes it appears
                 on both the hh and the person records.  The problem is more
                 difficult when hhid appears on the hh record only, so I'll
                 assume that's the only place it appears.

     pid         Person-within-household identification number.
                 Obviously, pid appears only on person records.


Step 1:  infix according to household
-------------------------------------

Roy's advice is exactly right.  We are going to read the data just as if every
record were a household record!  In particular, we will type 

        . infix ... rectype ... if rectype==1

where the details of -infix- are filled in for reading a household record.
Among the variables we read will be rectype.  We keep only the rectype==1 
records, which is to say, the household records.

When we do this we may get a lot of notes about problems reading the data 
because most records are in fact person records.  Nevertheless, we are  
discarding the person records, so we don't care.  Either -set more off-
or put a -quietly- in front of the infix records.

Okay, we have the household data.  Let's save it, 

       . save household


Step 2:  infix according to person
----------------------------------

Now we are going to read the data again, but this time, we will read the data
just as if every record were a person record.  I bet your expecting me to add
"and we'll keep just the peroson records."  No, this we time, we are going x
to keep rectype==1 and rectype==2!  We will do that because I am not 
assuming the hhid appears on person records.

        . infix ... rectype ...             // no -if rectype==#- this time

I kept the rectype==1 records because I want to identify the start of 
each household.  The next step is, 

        . gen long myhhid = 1 if rectype==1
        . replace  myhhid = sum(myhhid)

The above is a useful trick to remember when you want to create an id 
variable based on grouping records according to the occurance of some 
event.  Mark the start of each group with 1, fill in the other values 
with 0 or ., and then sum.  The result is a variable that contains 1
for a group of recrods, then 2, then 3, ...

We don't need the rectype==1 records any more.  We do need to save the 
data:

        . drop if rectype==1 
        . save person


Step 3:  Combine 
----------------

Now we do the following:

        . use household 
        . gen long myhhid = 1 if hhid != hhid[_n-1]
        . replace myhhid = sum(hhid)

With those steps, I have just created the same myhhid that goes 1, 2, ...
in household, just like I have in person.dta.  I was almost tempted to 
code, 

        . use household                              // WRONG
        . sort hhid                                  // WRONG
        . by hhid: gen long myhhid = 1 if _n==1      // WRONG
        . replace myhhid = sum(hhid)                 // WRONG

The problem with the above is that it assumes that hhid appears in 
increasing order in the original input file, and that might not be true.
So I coded the first, not the second.

I'm now ready to merge household.dta and person.dta on myhhid.  I
sort each dataset, and merge,


        . sort myhhid 
        . save household

        . use person 
        . sort myhhid pid
        . save person 

        . use household 
        . merge myhhid using person
        . assert _merge==3            /* never hurts to check */
        . drop _merge

We are done.


Final comment
-------------

It may be that Carolina's data does not have a rectype variable and we 
really are going to have to use the number-of-person-records variable 
she mentioned at the outset.  I'm betting not, however.  If I'm wrong, 
Carolina should ask again and then I'll write another posting on how 
to deal with that.


-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index