Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: question: how to collapse data fast for simplified, binned scatter plots


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: question: how to collapse data fast for simplified, binned scatter plots
Date   Tue, 27 Mar 2012 01:10:10 +0100

-tabulate- is a built-in command, namely compiled C code. If you want
to look at the code, you will need to get a developer's job at
StataCorp, but that in essence is why it is fast.

-collapse- by contrast is lots of Stata code to interpret. You can
look at it in any text editor. including -doedit-.

I remember when writing the first version of what is now -contract-
that -collapse- was not very fast, which was indeed one reason for
writing -contract-. So, that is a first tip: look at the code for
-contract- to see what it does.

In essence, to get sums and counts you can use code like this:

drop if missing(foo)
bysort groupvar : gen sum = sum(foo)
by groupvar : gen count = _N
by groupvar : keep if _n == _N

Once you have sums and counts, you clearly can get means.

Another tip is to do your calculations in Mata.

In short, much of the code for -collapse- is scaffolding to make it
general enough for many problems. Your own code focusing on what you
want should be faster.

Nick

2012/3/26 László Sándor <[email protected]>:

> I have a relatively simple goal, but I am not sure which is the most
> efficient way to achieve it. Let me describe what it aims to be and
> how I currently do it under Stata 10.1 for Windows, and then please
> comment on whether it could be faster.
>
> Basically, I want to clarify scatter plots, as in vast datasets it is
> more informative to plot means (or some quantiles) of y against "bins"
> of x, where actually it is informative to use some quantiles to bin x
> (i.e. have even frequencies in the bins instead of, say, even raw
> distances between the bins). Basically, the graphs could like the
> second graph here:
> http://obs.rc.fas.harvard.edu/chetty/value_added.html
>
> Yes, it would be great if I could add a plot of linear fit later on,
> or perhaps plot multiple y variables against the same x, or a single y
> broken down by a categorical z, or two different quantiles of the same
> y. Also, for some applications I would want to plot only a residual
> after some linear fit (including an -areg- absorbing for some averages
> in some categories).
>
> I am not aware of anything built in for this. But once one has the
> bins of x, it is not that hard to collect the y against it. However,
> -collapse- is surprisingly slow in this regard (at least with millions
> or tens of millions of observations), and I had to use a workaround
> with tabulate and more.
>
> I am puzzled that this could be faster than -collapse-, but so it
> seems. Basically: if -collapse- is not the fastest tool for this (with
> the fast option), then what is? What does -twoway bar- use underneath,
> for example? What does -tabulate, summarize- use behind the scenes?
>
> Would you suggest an alternative route? Something more efficient?
> Something built-in? Some polished user-written tool?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index