Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: question: how to collapse data fast for simplified, binned scatter plots


From   László Sándor <sandorl@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: question: how to collapse data fast for simplified, binned scatter plots
Date   Mon, 26 Mar 2012 18:12:13 -0400

Hi all,

I have a relatively simple goal, but I am not sure which is the most
efficient way to achieve it. Let me describe what it aims to be and
how I currently do it under Stata 10.1 for Windows, and then please
comment on whether it could be faster.

Basically, I want to clarify scatter plots, as in vast datasets it is
more informative to plot means (or some quantiles) of y against "bins"
of x, where actually it is informative to use some quantiles to bin x
(i.e. have even frequencies in the bins instead of, say, even raw
distances between the bins). Basically, the graphs could like the
second graph here:
http://obs.rc.fas.harvard.edu/chetty/value_added.html

Yes, it would be great if I could add a plot of linear fit later on,
or perhaps plot multiple y variables against the same x, or a single y
broken down by a categorical z, or two different quantiles of the same
y. Also, for some applications I would want to plot only a residual
after some linear fit (including an -areg- absorbing for some averages
in some categories).

I am not aware of anything built in for this. But once one has the
bins of x, it is not that hard to collect the y against it. However,
-collapse- is surprisingly slow in this regard (at least with millions
or tens of millions of observations), and I had to use a workaround
with tabulate and more.

I am puzzled that this could be faster than -collapse-, but so it
seems. Basically: if -collapse- is not the fastest tool for this (with
the fast option), then what is? What does -twoway bar- use underneath,
for example? What does -tabulate, summarize- use behind the scenes?

Would you suggest an alternative route? Something more efficient?
Something built-in? Some polished user-written tool?

Thank you very much,

Laszlo
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index