Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missed opportunities for Stata I/O


From   László Sándor <[email protected]>
To   [email protected]
Subject   Re: st: Missed opportunities for Stata I/O
Date   Mon, 9 Sep 2013 13:32:59 -0400

I think Amdahl's law applied to the population of Stata users too.
Meaning that however many undergrads "use" Stata with auto.dta, if we
in the right tail struggle, slow (even each other's) research down and
waste resources, I think it is an issue for StataCorp. If the
potential benefit would not be reflected in StataCorp revenues, maybe
we should reconsider the design of the licenses.

But I think they genuinely care.

So perhaps I can be more constructive with this link to show how much
"high performance computing" or "big data" the open-source, free, and
"barely-MP" R community found useful enough to code up:
http://cran.r-project.org/web/views/HighPerformanceComputing.html

Some of this came up before, but maybe there is also some hope for
progress on their "Large memory and out-of-memory data" section
(Hadoop, biglars, MonetDB.R) for Stata, or perhaps some GPU stuff,
esp. for very computationally intensive things like cross-validation
or bootstrap.

On Sun, Sep 8, 2013 at 6:18 PM, Daniel Feenberg <[email protected]> wrote:
> While Statamp can make estimation very fast I feel there are some important
> missed opportunities in Stata I/O that may not be sexy,
> but which Amdahl's law makes increasingly important.
>
> In our work datasets are often tens of gigabytes, and sometimes hundreds of
> gigabytes when multiple years of Medicare data are combined.
>
> First, the good news. For example, the -use- statement takes variable lists
> and if qualifiers which can dramatically speed up input if only a fraction
> of the data is needed. They also reduce core usage. The varlists and -if-
> qualifiers provide an order of magnitude improvement in speed in typical
> applications here.
>
> Now the disappointments. The -append- statement doesn't take the -if-
> qualifier, though it does take a varlist. The upshot of this is that what
> could be a simple:
>
>   foreach `year'=2001/2010 {
>     append med`year' if diagnosis=="ami", keep( varlist)
>   }
>
> becomes instead
>
>   foreach `year'=2001/2010 {
>     clear
>     use id diagnosis using  med`year' if diagnosis=="ami"
>     save ami`year'
>   }
>   foreach `year'=2002/2010 {
>     append od diagnosis using ami`year'
>   }
>
> unless you have enough memory to hold the entire dataset in memory, and the
> patience to wait for it to load.
>
> -merge- statements are quite slow compared to -use-. Our fairly ordinary
> Linux boxes can read 3.4 million rows per second of 10 floats. Merging that
> with a single variable in the workspace runs at only a tenth that speed. If
> only one variable is kept (varlist), or only a tiny percentage of the using
> rows are kept (-keep(match)-) the speed can be partially restored to about
> 1.2 million rows/second. It is possible that something about the way data is
> stored internally makes this impossible to improve, but it is unfortunate.
>
> More than the time element, the limitations of the -merge- statement make
> for complicated programming. Suppose there is in core a list of patients
> with an AMI, and you wish to merge in the doctors visits of those patients
> from the annual op (out-patient) files. You might hope to do this:
>
>   forvalues `year'=2002/2010 {
>     merge 1:m id using op`year', keep(match)
>   }
>
> But that doesn't work because after the first merge, there are duplicate ids
> in core (for multiple doctors visits in the first year). The best workaround
> I can come up with is:
>
>   forvalues `year'=2002/2010 {
>     clear
>     use ami
>     merge 1:m id using op`year',keep(match)
>     save ami`year',replace
>   }
>   forvalues `year'=2002/2010 {
>     append using ami`year'
>   }
>
> -append- allows multiple files to be concatenated, but as far as I can tell
> -merge- doesn't allow them to be joined.
>
> The -save- command is much more restriced than other I/O commands - no
> varlist, -if- or -in- support. So dividing a file into subsets requires
> rereading the file for each subset. For example instead of:
>
>   forvalues state=1/50 (
>     save state`state' if state=`i'
>   }
>
> we have:
>
>   forvalues state=1/50 {
>     clear
>     use file if state==`i'
>     save state`i'
>   }
>
> which is needlessly slower and more complex.
>
> Suprisingly the commands -infix-, -fdause- and -fdasave- allow -if-, -in-
> and a varlist, while -insheet-, -outsheet- and -save- don't allow any of
> those.
>
> I should note that the -in- qualifier isn't as good as it could be. That is:
>
>   use med2009 in 1/100
>
> doesn't stop reading at record 100. Instead it seems to read all 143 million
> records, but then drops the records past 100.
>
> I am aware that most users of Stata have only a few thousand observations,
> and will not notice the wall-clock time differences I cite. However, I
> believe they are worth addressing, both for the benefit of users with very
> large datasets, and for all users who are tripped up when they fail to
> remember which of the usual supported options are not supported by I/O
> command they need.
>
> Daniel Feenberg
> NBER
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index