Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
László Sándor <sandorl@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: question: how to collapse data fast for simplified, binned scatter plots |

Date |
Tue, 27 Mar 2012 10:40:41 -0400 |

Thank you, Nick, this is excellent, as always. That said, I have still have slight hopes that as the run rises (has risen) in Texas, perhaps someone from StataCorp who know the "compiled C code" of -tabulate- would comment on whether it is possible that the fastest way to produce what I meant is to -tabulate, summarize-, log it, and read and parse the log file. This is our workaround now, and according to our tests, faster than -egen- or -collapse-, though we have not tried something in Mata. I checked the source of -contract-, and it gives me hope that simpler sorts and indexing in Stata (not Mata) can solve this. Collapse is a useful command, even came in as second in the recent contest for the most useful Stata feature on Facebook. My use would be an excellent case in point. Can it be really simply that slow that -tab, sum- and log-parsing beats it? (OK, most likely Mata is the best. But still?!) Laszlo On Mon, Mar 26, 2012 at 8:10 PM, Nick Cox <njcoxstata@gmail.com> wrote: > > -tabulate- is a built-in command, namely compiled C code. If you want > to look at the code, you will need to get a developer's job at > StataCorp, but that in essence is why it is fast. > > -collapse- by contrast is lots of Stata code to interpret. You can > look at it in any text editor. including -doedit-. > > I remember when writing the first version of what is now -contract- > that -collapse- was not very fast, which was indeed one reason for > writing -contract-. So, that is a first tip: look at the code for > -contract- to see what it does. > > In essence, to get sums and counts you can use code like this: > > drop if missing(foo) > bysort groupvar : gen sum = sum(foo) > by groupvar : gen count = _N > by groupvar : keep if _n == _N > > Once you have sums and counts, you clearly can get means. > > Another tip is to do your calculations in Mata. > > In short, much of the code for -collapse- is scaffolding to make it > general enough for many problems. Your own code focusing on what you > want should be faster. > > Nick > > 2012/3/26 László Sándor <sandorl@gmail.com>: > > > I have a relatively simple goal, but I am not sure which is the most > > efficient way to achieve it. Let me describe what it aims to be and > > how I currently do it under Stata 10.1 for Windows, and then please > > comment on whether it could be faster. > > > > Basically, I want to clarify scatter plots, as in vast datasets it is > > more informative to plot means (or some quantiles) of y against "bins" > > of x, where actually it is informative to use some quantiles to bin x > > (i.e. have even frequencies in the bins instead of, say, even raw > > distances between the bins). Basically, the graphs could like the > > second graph here: > > http://obs.rc.fas.harvard.edu/chetty/value_added.html > > > > Yes, it would be great if I could add a plot of linear fit later on, > > or perhaps plot multiple y variables against the same x, or a single y > > broken down by a categorical z, or two different quantiles of the same > > y. Also, for some applications I would want to plot only a residual > > after some linear fit (including an -areg- absorbing for some averages > > in some categories). > > > > I am not aware of anything built in for this. But once one has the > > bins of x, it is not that hard to collect the y against it. However, > > -collapse- is surprisingly slow in this regard (at least with millions > > or tens of millions of observations), and I had to use a workaround > > with tabulate and more. > > > > I am puzzled that this could be faster than -collapse-, but so it > > seems. Basically: if -collapse- is not the fastest tool for this (with > > the fast option), then what is? What does -twoway bar- use underneath, > > for example? What does -tabulate, summarize- use behind the scenes? > > > > Would you suggest an alternative route? Something more efficient? > > Something built-in? Some polished user-written tool? > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: question: how to collapse data fast for simplified, binned scatter plots***From:*László Sándor <sandorl@gmail.com>

**References**:**st: question: how to collapse data fast for simplified, binned scatter plots***From:*László Sándor <sandorl@gmail.com>

**Re: st: question: how to collapse data fast for simplified, binned scatter plots***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**st: GLLAMM question** - Next by Date:
**st: work with larger string length** - Previous by thread:
**Re: st: question: how to collapse data fast for simplified, binned scatter plots** - Next by thread:
**Re: st: question: how to collapse data fast for simplified, binned scatter plots** - Index(es):