Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Conditional infile statements


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Conditional infile statements
Date   Sun, 20 Nov 2011 13:57:35 +0000

-strparse- (SSC) has been superseded long since by the official
command -split-. However, that I think may not help Gordon much, but
reading in as one string variable and then some mix of -split- and
-reshape- might help.

I also would commend -file- here, or Unix utilities such as awk
(available in ports to Windows).

Nick

On Sun, Nov 20, 2011 at 12:40 PM, Gordon Hughes <G.A.Hughes@ed.ac.uk> wrote:
> I would like to read a *very* large dataset using conditional infile
> statements.  With some oversimplification the structure of the data is as
> follows:
>
> Line 1: type1 id 1 2 3 4 5
> Line 2: type1 id 3 4 5 6 7
> Line 3: type2 id ABC DEF FGH
> Line 4: type1 id 5 6 7 8 9
> Line 5: type3 id IJK 3 4 XYZ
> ...
>
> The format of the data on each line is fixed but the formatting varies
> according the value of the first variable on the line.  For practical
> purposes the data may be treated as having one line per observation but with
> different variables recorded for the different line types.  There is no
> consistent pattern of the occurrence of lines of different types.
>
> In high level programming languages, SAS and some other languages it is
> possible to read such data using the following generic code:
>
> read str ltype @
> if ltype=="type1" {read id str type var1-var5}
> if ltype=="type2" {read id str type str char1 str char2 str char3}
> if ltype=="type3" {read id str char4 var6 var7 str char5}
>
> where the @ character holds the current line for re-reading.  As far as I
> can work out this is not possible, at least directly, in Stata.
>
> In fact the problem is even worse than this description implies because many
> of the variables have the form "123*" where 123 is a value and "*" may or
> may not be present and indicates a flag or note.
>
> There is a way of doing this but to my mind it is clumsy:
>
> infix str sline 1-30 using ...
> gen ltype=substr(sline, 1, 5)
> gen var1=real(substr(sline, 6, 2)) if ltype=="type1"
> ....
>
> The user-written routine -strparse- can also be deployed for free format
> data, but again it involves the use of sub-string manipulation.  I cannot
> locate any other user-written routine which provides a better way of doing
> this, but my -net search- terms may not pick up the right keywords.
>
> I would appreciate any suggestions as to a better way of doing this - or
> should I just resign myself to writing the code required to parse each line.
>  (Incidentally, one reason for my reluctance to do this is that it increases
> the maximum memory size required to hold the initial pass through the data.)
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index