Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Missed opportunities for Stata I/O


From   Daniel Feenberg <[email protected]>
To   [email protected]
Subject   st: Missed opportunities for Stata I/O
Date   Sun, 8 Sep 2013 18:18:59 -0400 (EDT)

While Statamp can make estimation very fast I feel there are some important missed opportunities in Stata I/O that may not be sexy,
but which Amdahl's law makes increasingly important.

In our work datasets are often tens of gigabytes, and sometimes hundreds of gigabytes when multiple years of Medicare data are combined.

First, the good news. For example, the -use- statement takes variable lists and if qualifiers which can dramatically speed up input if only a fraction of the data is needed. They also reduce core usage. The varlists and -if- qualifiers provide an order of magnitude improvement in speed in typical applications here.

Now the disappointments. The -append- statement doesn't take the -if- qualifier, though it does take a varlist. The upshot of this is that what could be a simple:

  foreach `year'=2001/2010 {
    append med`year' if diagnosis=="ami", keep( varlist)
  }

becomes instead

  foreach `year'=2001/2010 {
    clear
    use id diagnosis using  med`year' if diagnosis=="ami"
    save ami`year'
  }
  foreach `year'=2002/2010 {
    append od diagnosis using ami`year'
  }

unless you have enough memory to hold the entire dataset in memory, and the patience to wait for it to load.

-merge- statements are quite slow compared to -use-. Our fairly ordinary Linux boxes can read 3.4 million rows per second of 10 floats. Merging that with a single variable in the workspace runs at only a tenth that speed. If only one variable is kept (varlist), or only a tiny percentage of the using rows are kept (-keep(match)-) the speed can be partially restored to about 1.2 million rows/second. It is possible that something about the way data is stored internally makes this impossible to improve, but it is unfortunate.

More than the time element, the limitations of the -merge- statement make for complicated programming. Suppose there is in core a list of patients with an AMI, and you wish to merge in the doctors visits of those patients from the annual op (out-patient) files. You might hope to do this:

  forvalues `year'=2002/2010 {
    merge 1:m id using op`year', keep(match)
  }

But that doesn't work because after the first merge, there are duplicate ids in core (for multiple doctors visits in the first year). The best workaround I can come up with is:

  forvalues `year'=2002/2010 {
    clear
    use ami
    merge 1:m id using op`year',keep(match)
    save ami`year',replace
  }
  forvalues `year'=2002/2010 {
    append using ami`year'
  }

-append- allows multiple files to be concatenated, but as far as I can tell -merge- doesn't allow them to be joined.

The -save- command is much more restriced than other I/O commands - no varlist, -if- or -in- support. So dividing a file into subsets requires rereading the file for each subset. For example instead of:

  forvalues state=1/50 (
    save state`state' if state=`i'
  }

we have:

  forvalues state=1/50 {
    clear
    use file if state==`i'
    save state`i'
  }

which is needlessly slower and more complex.

Suprisingly the commands -infix-, -fdause- and -fdasave- allow -if-, -in- and a varlist, while -insheet-, -outsheet- and -save- don't allow any of those.

I should note that the -in- qualifier isn't as good as it could be. That is:

  use med2009 in 1/100

doesn't stop reading at record 100. Instead it seems to read all 143 million records, but then drops the records past 100.

I am aware that most users of Stata have only a few thousand observations, and will not notice the wall-clock time differences I cite. However, I believe they are worth addressing, both for the benefit of users with very large datasets, and for all users who are tripped up when they fail to remember which of the usual supported options are not supported by I/O command they need.

Daniel Feenberg
NBER
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index