Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Conditional infile statements

From	Gordon Hughes <[email protected]>
To	[email protected]
Subject	st: Conditional infile statements
Date	Sun, 20 Nov 2011 12:40:48 +0000

I would like to read a *very* large dataset using conditional infilestatements. With some oversimplification the structure of the datais as follows:


Line 1: type1 id 1 2 3 4 5
Line 2: type1 id 3 4 5 6 7
Line 3: type2 id ABC DEF FGH
Line 4: type1 id 5 6 7 8 9
Line 5: type3 id IJK 3 4 XYZ
...

The format of the data on each line is fixed but the formattingvaries according the value of the first variable on the line. Forpractical purposes the data may be treated as having one line perobservation but with different variables recorded for the differentline types. There is no consistent pattern of the occurrence oflines of different types.

In high level programming languages, SAS and some other languages itis possible to read such data using the following generic code:


read str ltype @
if ltype=="type1" {read id str type var1-var5}
if ltype=="type2" {read id str type str char1 str char2 str char3}
if ltype=="type3" {read id str char4 var6 var7 str char5}

where the @ character holds the current line for re-reading. As faras I can work out this is not possible, at least directly, in Stata.

In fact the problem is even worse than this description impliesbecause many of the variables have the form "123*" where 123 is avalue and "*" may or may not be present and indicates a flag or note.


There is a way of doing this but to my mind it is clumsy:

infix str sline 1-30 using ...
gen ltype=substr(sline, 1, 5)
gen var1=real(substr(sline, 6, 2)) if ltype=="type1"
....

The user-written routine -strparse- can also be deployed for freeformat data, but again it involves the use of sub-stringmanipulation. I cannot locate any other user-written routine whichprovides a better way of doing this, but my -net search- terms maynot pick up the right keywords.

I would appreciate any suggestions as to a better way of doing this -or should I just resign myself to writing the code required to parseeach line. (Incidentally, one reason for my reluctance to do this isthat it increases the maximum memory size required to hold theinitial pass through the data.)


Gordon Hughes
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Conditional infile statements
  - From: Steven Samuels <[email protected]>
- Re: st: Conditional infile statements
  - From: Steven Samuels <[email protected]>
- Re: st: Conditional infile statements
  - From: David Kantor <[email protected]>
- Re: st: Conditional infile statements
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: means compairison with weights and unequal variance
Next by Date: Re: st: Conditional infile statements
Previous by thread: st: means compairison with weights and unequal variance
Next by thread: Re: st: Conditional infile statements
Index(es):
- Date
- Thread