Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: question: how to collapse data fast for simplified, binned scatter plots


From   László Sándor <[email protected]>
To   [email protected]
Subject   Re: st: question: how to collapse data fast for simplified, binned scatter plots
Date   Tue, 27 Mar 2012 10:40:41 -0400

Thank you, Nick, this is excellent, as always.

That said, I have still have slight hopes that as the run rises (has
risen) in Texas, perhaps someone from StataCorp who know the "compiled
C code" of -tabulate- would comment on whether it is possible that the
fastest way to produce what I meant is to -tabulate, summarize-, log
it, and read and parse the log file. This is our workaround now, and
according to our tests, faster than -egen- or -collapse-, though we
have not tried something in Mata.

I checked the source of -contract-, and it gives me hope that simpler
sorts and indexing in Stata (not Mata) can solve this.

Collapse is a useful command, even came in as second in the recent
contest for the most useful Stata feature on Facebook. My use would be
an excellent case in point. Can it be really simply that slow that
-tab, sum- and log-parsing beats it? (OK, most likely Mata is the
best. But still?!)

Laszlo

On Mon, Mar 26, 2012 at 8:10 PM, Nick Cox <[email protected]> wrote:
>
> -tabulate- is a built-in command, namely compiled C code. If you want
> to look at the code, you will need to get a developer's job at
> StataCorp, but that in essence is why it is fast.
>
> -collapse- by contrast is lots of Stata code to interpret. You can
> look at it in any text editor. including -doedit-.
>
> I remember when writing the first version of what is now -contract-
> that -collapse- was not very fast, which was indeed one reason for
> writing -contract-. So, that is a first tip: look at the code for
> -contract- to see what it does.
>
> In essence, to get sums and counts you can use code like this:
>
> drop if missing(foo)
> bysort groupvar : gen sum = sum(foo)
> by groupvar : gen count = _N
> by groupvar : keep if _n == _N
>
> Once you have sums and counts, you clearly can get means.
>
> Another tip is to do your calculations in Mata.
>
> In short, much of the code for -collapse- is scaffolding to make it
> general enough for many problems. Your own code focusing on what you
> want should be faster.
>
> Nick
>
> 2012/3/26 László Sándor <[email protected]>:
>
> > I have a relatively simple goal, but I am not sure which is the most
> > efficient way to achieve it. Let me describe what it aims to be and
> > how I currently do it under Stata 10.1 for Windows, and then please
> > comment on whether it could be faster.
> >
> > Basically, I want to clarify scatter plots, as in vast datasets it is
> > more informative to plot means (or some quantiles) of y against "bins"
> > of x, where actually it is informative to use some quantiles to bin x
> > (i.e. have even frequencies in the bins instead of, say, even raw
> > distances between the bins). Basically, the graphs could like the
> > second graph here:
> > http://obs.rc.fas.harvard.edu/chetty/value_added.html
> >
> > Yes, it would be great if I could add a plot of linear fit later on,
> > or perhaps plot multiple y variables against the same x, or a single y
> > broken down by a categorical z, or two different quantiles of the same
> > y. Also, for some applications I would want to plot only a residual
> > after some linear fit (including an -areg- absorbing for some averages
> > in some categories).
> >
> > I am not aware of anything built in for this. But once one has the
> > bins of x, it is not that hard to collect the y against it. However,
> > -collapse- is surprisingly slow in this regard (at least with millions
> > or tens of millions of observations), and I had to use a workaround
> > with tabulate and more.
> >
> > I am puzzled that this could be faster than -collapse-, but so it
> > seems. Basically: if -collapse- is not the fastest tool for this (with
> > the fast option), then what is? What does -twoway bar- use underneath,
> > for example? What does -tabulate, summarize- use behind the scenes?
> >
> > Would you suggest an alternative route? Something more efficient?
> > Something built-in? Some polished user-written tool?
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index