Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Missed opportunities for Stata I/O

From	Daniel Feenberg <[email protected]>
To	[email protected]
Subject	st: Missed opportunities for Stata I/O
Date	Sun, 8 Sep 2013 18:18:59 -0400 (EDT)

While Statamp can make estimation very fast I feel there are someimportant missed opportunities in Stata I/O that may not be sexy,

but which Amdahl's law makes increasingly important.

In our work datasets are often tens of gigabytes, and sometimes hundredsof gigabytes when multiple years of Medicare data are combined.

First, the good news. For example, the -use- statement takes variablelists and if qualifiers which can dramatically speed up input if only afraction of the data is needed. They also reduce core usage. The varlistsand -if- qualifiers provide an order of magnitude improvement in speed intypical applications here.

Now the disappointments. The -append- statement doesn't take the -if-qualifier, though it does take a varlist. The upshot of this is that whatcould be a simple:


  foreach `year'=2001/2010 {
    append med`year' if diagnosis=="ami", keep( varlist)
  }

becomes instead

  foreach `year'=2001/2010 {
    clear
    use id diagnosis using  med`year' if diagnosis=="ami"
    save ami`year'
  }
  foreach `year'=2002/2010 {
    append od diagnosis using ami`year'
  }

unless you have enough memory to hold the entire dataset in memory, andthe patience to wait for it to load.

-merge- statements are quite slow compared to -use-. Our fairly ordinaryLinux boxes can read 3.4 million rows per second of 10 floats. Mergingthat with a single variable in the workspace runs at only a tenth thatspeed. If only one variable is kept (varlist), or only a tiny percentageof the using rows are kept (-keep(match)-) the speed can be partiallyrestored to about 1.2 million rows/second. It is possible that somethingabout the way data is stored internally makes this impossible to improve,but it is unfortunate.

More than the time element, the limitations of the -merge- statement makefor complicated programming. Suppose there is in core a list of patientswith an AMI, and you wish to merge in the doctors visits of those patientsfrom the annual op (out-patient) files. You might hope to do this:


  forvalues `year'=2002/2010 {
    merge 1:m id using op`year', keep(match)
  }

But that doesn't work because after the first merge, there are duplicateids in core (for multiple doctors visits in the first year). The bestworkaround I can come up with is:


  forvalues `year'=2002/2010 {
    clear
    use ami
    merge 1:m id using op`year',keep(match)
    save ami`year',replace
  }
  forvalues `year'=2002/2010 {
    append using ami`year'
  }

-append- allows multiple files to be concatenated, but as far as I cantell -merge- doesn't allow them to be joined.

The -save- command is much more restriced than other I/O commands - novarlist, -if- or -in- support. So dividing a file into subsets requiresrereading the file for each subset. For example instead of:


  forvalues state=1/50 (
    save state`state' if state=`i'
  }

we have:

  forvalues state=1/50 {
    clear
    use file if state==`i'
    save state`i'
  }

which is needlessly slower and more complex.

Suprisingly the commands -infix-, -fdause- and -fdasave- allow -if-, -in-and a varlist, while -insheet-, -outsheet- and -save- don't allow any ofthose.

I should note that the -in- qualifier isn't as good as it could be. Thatis:


  use med2009 in 1/100

doesn't stop reading at record 100. Instead it seems to read all 143million records, but then drops the records past 100.

I am aware that most users of Stata have only a few thousand observations,and will not notice the wall-clock time differences I cite. However, Ibelieve they are worth addressing, both for the benefit of users with verylarge datasets, and for all users who are tripped up when they fail toremember which of the usual supported options are not supported by I/Ocommand they need.


Daniel Feenberg
NBER
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Missed opportunities for Stata I/O
  - From: David Kantor <[email protected]>
- Re: st: Missed opportunities for Stata I/O
  - From: László Sándor <[email protected]>

Prev by Date: st: Cross-section regression with fixed effects
Next by Date: st: Date: Sun, 8 Sep 2013 23:54:39 +0000
Previous by thread: st: Cross-section regression with fixed effects
Next by thread: Re: st: Missed opportunities for Stata I/O
Index(es):
- Date
- Thread