Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Conditional infile statements

From	Steven Samuels <[email protected]>
To	[email protected]
Subject	Re: st: Conditional infile statements
Date	Sun, 20 Nov 2011 13:01:09 -0500

Oops, I didn't show the revised input file with the ID variable.

There's even less typing in this version. Also, the different record types are merged, not appended. We often design our questionnaires in sections so that we can do this. Each section is entered on a different data line with a unique record type as the first variable and is saved to its own data file. Eventually the separate files are merged. I imagine Gordon's questionnaire is of this type.

In practice, we write two or three do-files to process each section. For short sections, the first do file does edit checks, and the second does necessary corrections, recodes, and assigns variable and value labels. These do files are rerun until no further changes are needed. At the end all sections with the same ID are merged into one record. Editing the sections separately makes for shorter do files and also makes it easy to browse records.

Steve
**************CODE BEGINS*************
/* Read record lines with different variables, controlled by a record type:
This uses the functionality of of SAS's "hold the line" @ */

/* myfile.txt is
1 100 John Smith
2 100 50
1 101 Bob Jones
2 101 45
*/

local rt1 "rtype 1-1 id 3-5 str10 name 7-16"
local rt2 "rtype 1-1 id 3-5 age 7-8"

/* Create data sets for appending */
tempvar dum
gen dum =.
forvalues i = 1/2{
quietly save t`i', replace
}

forvalues i = 1/2{
quietly infix `rt`i'' using my.txt if rtype== `i', clear
append using t`i'
drop dum
sort id
quietly save t`i', replace
}

merge 1:1 id using t1
drop _merge rtype
list
**************CODE ENDS**************

On Nov 20, 2011, at 9:28 AM, David Kantor wrote:

At 07:40 AM 11/20/2011, Gordon Hughes wrote:
I would like to read a *very* large dataset using conditional infile statements. With some oversimplification the structure of the data is as follows:

Line 1: type1 id 1 2 3 4 5
Line 2: type1 id 3 4 5 6 7
Line 3: type2 id ABC DEF FGH
Line 4: type1 id 5 6 7 8 9
Line 5: type3 id IJK 3 4 XYZ
...

The format of the data on each line is fixed but the formatting varies according the value of the first variable on the line. For practical purposes the data may be treated as having one line per observation but with different variables recorded for the different line types. There is no consistent pattern of the occurrence of lines of different types.

In high level programming languages, SAS and some other languages it is possible to read such data using the following generic code:

read str ltype @
if ltype=="type1" {read id str type var1-var5}
if ltype=="type2" {read id str type str char1 str char2 str char3}
if ltype=="type3" {read id str char4 var6 var7 str char5}

where the @ character holds the current line for re-reading. As far as I can work out this is not possible, at least directly, in Stata.

In fact the problem is even worse than this description implies because many of the variables have the form "123*" where 123 is a value and "*" may or may not be present and indicates a flag or note.

There is a way of doing this but to my mind it is clumsy:

infix str sline 1-30 using ...
gen ltype=substr(sline, 1, 5)
gen var1=real(substr(sline, 6, 2)) if ltype=="type1"
....

The user-written routine -strparse- can also be deployed for free format data, but again it involves the use of sub-string manipulation. I cannot locate any other user-written routine which provides a better way of doing this, but my -net search- terms may not pick up the right keywords.

I would appreciate any suggestions as to a better way of doing this - or should I just resign myself to writing the code required to parse each line. (Incidentally, one reason for my reluctance to do this is that it increases the maximum memory size required to hold the initial pass through the data.)

Gordon Hughes
[email protected]

Your "clumsy" method might not be bad.
My own approach would be use -infile- with a dictionary, and to set up a dictionary to read all types of lines concurrently. This would include a variable that serves as the discriminant (ltype), plus all other possible variables. The fact that they would overlap in terms of position doesn't matter. The dictionary content might look like this:

_column(1) str5 ltype %5s
_column(10) byte var1 %1s
_column(12) byte var2 %1s
[etc.]
_column(10) str3 char1 %3s
_column(14) str3 char2 %3s
_column(18) str3 char3 %3s
[and so on for all the variables in all possible line types]

Then, in later processing, you would do:
replace var1 = . if ltype ~= "type1"
replace var2 = . if ltype ~= "type2"
[etc.]
replace char1 = "" if ltype ~= "type2"
replace char2 = "" if ltype ~= "type2"
replace char3 = "" if ltype ~= "type2"
[etc.]

Of course, some of that code could be compacted into -forvalues- commands, but that's another matter.

HTH
--David

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- st: Conditional infile statements
  - From: Gordon Hughes <[email protected]>
- Re: st: Conditional infile statements
  - From: David Kantor <[email protected]>

Prev by Date: Re: st: Conditional infile statements
Next by Date: st: Fitstat Command usage
Previous by thread: Re: st: Conditional infile statements
Next by thread: Re: st: Conditional infile statements
Index(es):
- Date
- Thread