Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Conditional infile statements


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: Conditional infile statements
Date   Sun, 20 Nov 2011 09:52:00 -0500

Here's a solution that will allow Bill to use one -infix- statement per record type.

Steve

**************CODE BEGINS*************
/* file my.txt contains:
1 Smith
2 3
2 18
1 Bill Jones
*/

gen dum =1
forvalues i = 1/2{
save t`i', replace
}
quietly infix  type 1-1  str10 strvar 3-12 using my.txt if type==1, clear
if type ==1 append using t1
save t1, replace

quietly infix  type 1-1  numvar 3-12 using my.txt if type==2, clear
if type==2 append using t2
save t2, replace

append using t1
drop dum
list
**************CODE ENDS**************
'
On Nov 20, 2011, at 8:57 AM, Nick Cox wrote:

-strparse- (SSC) has been superseded long since by the official
command -split-. However, that I think may not help Gordon much, but
reading in as one string variable and then some mix of -split- and
-reshape- might help.

I also would commend -file- here, or Unix utilities such as awk
(available in ports to Windows).

Nick

On Sun, Nov 20, 2011 at 12:40 PM, Gordon Hughes <[email protected]> wrote:
> I would like to read a *very* large dataset using conditional infile
> statements.  With some oversimplification the structure of the data is as
> follows:
> 
> Line 1: type1 id 1 2 3 4 5
> Line 2: type1 id 3 4 5 6 7
> Line 3: type2 id ABC DEF FGH
> Line 4: type1 id 5 6 7 8 9
> Line 5: type3 id IJK 3 4 XYZ
> ...
> 
> The format of the data on each line is fixed but the formatting varies
> according the value of the first variable on the line.  For practical
> purposes the data may be treated as having one line per observation but with
> different variables recorded for the different line types.  There is no
> consistent pattern of the occurrence of lines of different types.
> 
> In high level programming languages, SAS and some other languages it is
> possible to read such data using the following generic code:
> 
> read str ltype @
> if ltype=="type1" {read id str type var1-var5}
> if ltype=="type2" {read id str type str char1 str char2 str char3}
> if ltype=="type3" {read id str char4 var6 var7 str char5}
> 
> where the @ character holds the current line for re-reading.  As far as I
> can work out this is not possible, at least directly, in Stata.
> 
> In fact the problem is even worse than this description implies because many
> of the variables have the form "123*" where 123 is a value and "*" may or
> may not be present and indicates a flag or note.
> 
> There is a way of doing this but to my mind it is clumsy:
> 
> infix str sline 1-30 using ...
> gen ltype=substr(sline, 1, 5)
> gen var1=real(substr(sline, 6, 2)) if ltype=="type1"
> ....
> 
> The user-written routine -strparse- can also be deployed for free format
> data, but again it involves the use of sub-string manipulation.  I cannot
> locate any other user-written routine which provides a better way of doing
> this, but my -net search- terms may not pick up the right keywords.
> 
> I would appreciate any suggestions as to a better way of doing this - or
> should I just resign myself to writing the code required to parse each line.
>  (Incidentally, one reason for my reluctance to do this is that it increases
> the maximum memory size required to hold the initial pass through the data.)
> 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index