Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Reading very complex raw data files


From   Joseph Coveney <[email protected]>
To   Statalist <[email protected]>
Subject   Re: st: Reading very complex raw data files
Date   Tue, 06 Dec 2005 13:32:13 +0900

Austin Nichols wrote:

Michael Mitchell's assertion is demonstrably untrue: any SAS input
statement can be rewritten in Stata syntax (possibly using -file-),
and there are some data management tasks that are much harder in SAS
(e.g. calculating the income of the person with line number defined by
a variable S_LINENO for each person).  What he is no doubt alluding to
(though I have not read the surrounding text) is a hierarchical file
format such as that used by the CPS. SAS and Stata code for reading
these files is available at http://www.nber.org/data/cps_progs.html
(and associated pages).  You can see for yourself which is
conceptually easier--I myself find the Stata code more intuitive.

The main advantage, and disadvantage, of SAS is that it reads data
serially instead of keeping it all in memory.  Thus, reading an entire
SIPP panel (see
http://www.nber.org/data/survey-of-income-and-program-participation-sipp-data.html
for details) into one big file (about 8GB of data) might well be
impossible in Stata due to memory constraints.  This is not due to
complexity so much as size, though.

--------------------------------------------------------------------------------

Austin's points are well taken.  Let me first mention that my original post
wasn't intended to confront Michael or to troll:  I have never encountered
an ASCII or ANSI (or even UTF-7 or UTF-8) data file that I needed to read
into Stata but could not read into Stata, and am genuinely wondering what I
am not aware of.

The power of SAS's data manipulation capability seems to have become
legendary.  It seems, though, that people are carrying-over to today its
well deserved reputation from years ago, when it, PROPHET, BMDP and SPSS
were essentially the only games in town.  SAS might still have advantages in
overall data management capabilities for other reasons, but I now suspect
that Austin is correct in that its core data manipulation power is no great
shakes by contemporary standards.  The DATA step today seems awkward and
limited, even quaint.  Introduction of PROC SQL, from today's vantage point,
seems like SAS Institute's acknowledgment of the DATA step's not aging
gracefully.  And PROC SQL is, well, SQL--if you'd prefer to trade Stata's
powerful, easy and natural data manipulation commands for SQL statements,
then you can always -odbc exec()- with datasets stored in another format.

Someone earlier responded to this post privately, relating his experience
with the very data source that Austin cites for hierarchical data files:
March ADS supplements to NBER's CPS surveys.  Some time ago, this person
switched from Stata to SAS temporarily in order to read in the March ADS
supplements, not realizing the Stata can readily read in these and other
hierarchical data files.

As to large datasets, I suspect that, if you really wanted to, you can
emulate in Stata *anything* that the DATA step is doing with its program
data vector, its RETAIN statement and so on.  In Stata, the analogous
technique would be by use of -file- as Austin mentions and an accumulator
scalar or matrix in Stata or Mata.  It would be no less awkward and
time-consuming than the DATA step, and I doubt that you'd rarely, if ever,
need to do this, even with massive datasets:  consider instead using -infix
if- or -infile- with a dictionary that limits the columns to those of
interest, rather than reading in the entire multigigabyte file only to then
subset it to the variables and observations actually needed.

Joseph Coveney

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index