Nick Cox <njcoxstata@gmail.com>

statalist@hsphsun2.harvard.edu

Re: st: question: how to collapse data fast for simplified, binned scatter plots

Tue, 27 Mar 2012 01:10:10 +0100

-tabulate- is a built-in command, namely compiled C code. If you want to look at the code, you will need to get a developer's job at StataCorp, but that in essence is why it is fast. -collapse- by contrast is lots of Stata code to interpret. You can look at it in any text editor. including -doedit-. I remember when writing the first version of what is now -contract- that -collapse- was not very fast, which was indeed one reason for writing -contract-. So, that is a first tip: look at the code for -contract- to see what it does. In essence, to get sums and counts you can use code like this: drop if missing(foo) bysort groupvar : gen sum = sum(foo) by groupvar : gen count = _N by groupvar : keep if _n == _N Once you have sums and counts, you clearly can get means. Another tip is to do your calculations in Mata. In short, much of the code for -collapse- is scaffolding to make it general enough for many problems. Your own code focusing on what you want should be faster. Nick 2012/3/26 László Sándor <sandorl@gmail.com>: > I have a relatively simple goal, but I am not sure which is the most > efficient way to achieve it. Let me describe what it aims to be and > how I currently do it under Stata 10.1 for Windows, and then please > comment on whether it could be faster. > > Basically, I want to clarify scatter plots, as in vast datasets it is > more informative to plot means (or some quantiles) of y against "bins" > of x, where actually it is informative to use some quantiles to bin x > (i.e. have even frequencies in the bins instead of, say, even raw > distances between the bins). Basically, the graphs could like the > second graph here: > http://obs.rc.fas.harvard.edu/chetty/value_added.html > > Yes, it would be great if I could add a plot of linear fit later on, > or perhaps plot multiple y variables against the same x, or a single y > broken down by a categorical z, or two different quantiles of the same > y. Also, for some applications I would want to plot only a residual > after some linear fit (including an -areg- absorbing for some averages > in some categories). > > I am not aware of anything built in for this. But once one has the > bins of x, it is not that hard to collect the y against it. However, > -collapse- is surprisingly slow in this regard (at least with millions > or tens of millions of observations), and I had to use a workaround > with tabulate and more. > > I am puzzled that this could be faster than -collapse-, but so it > seems. Basically: if -collapse- is not the fastest tool for this (with > the fast option), then what is? What does -twoway bar- use underneath, > for example? What does -tabulate, summarize- use behind the scenes? > > Would you suggest an alternative route? Something more efficient? > Something built-in? Some polished user-written tool? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

