Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: question: how to collapse data fast for simplified, binned scatter plots

 From László Sándor To statalist@hsphsun2.harvard.edu Subject Re: st: question: how to collapse data fast for simplified, binned scatter plots Date Thu, 29 Mar 2012 11:36:27 -0400

```Hi again,

I rolled up my sleeves and tried to solve this with Mata. I think
anything else would have used a preserve-restore cycle which is too
much of a cost with my data. But please bear with me and help me
figure out how this could work with Mata.

I found an excellent promising function mm_collapse in Ben Jann's
-moremata-, and I copied all the relevant function definitions to the
end of my ado file. (Long story short: I cannot install third-party
packages where I need to run this.) These functions are mm_collapse,
mm_panels, mm_calff (and their respective families).

Then I hoped to define my own function that would make new variables
with their first few values filled with the averages for the bins of
the original (to-be-plotted) y and x. (This is not elegant as the bins
have nothing to do with the first few observations otherwise, I know.)
It is a void function that takes only existing and new (temp)variable
names as inputs.

However, I get a compile time error and could not dig deeper however I
tried. Please review the error message and the function below.
Probably a quick glance would catch my error, and I would be very
grateful.

ERROR MESSAGE:
nothing found where subexp expected
(131 lines skipped)
r(3000);

MY FUNCTION:
mata:

void coll2(string scalar ym,
string scalar yr,
string scalar xm,
string scalar xr,
string scalar xq,
| string scalar weight) {
real colvector yrvar, xrvar, xqvar, ymvar, xmvar, w
real matrix YX
st_view(yrvar,.,yr)
st_view(xrvar,.,xr)
st_view(xqvar,.,xq)
st_view(YX, ., st_addvar("double", ym + " " + xm))
if (args()<6 | weight=="") w = J(rows(xqvar),1,1)
else st_view(w,.,weight)
YX[.,.] = mm_collapse((yrvar,xrvar),w,xqvar)[.,2..3]
}
end

Yes, I could image a conformability error in the last line, so I might
need an intermediate step to store the output of mm_collapse just to
be able to count its rows and know how many rows of YX to update,
then. But the error does not seem to come from there, is it?

Laszlo

2012/3/27 László Sándor <sandorl@gmail.com>
>
> Thank you, Nick, this is excellent, as always.
>
> That said, I have still have slight hopes that as the run rises (has
> risen) in Texas, perhaps someone from StataCorp who know the "compiled
> C code" of -tabulate- would comment on whether it is possible that the
> fastest way to produce what I meant is to -tabulate, summarize-, log
> it, and read and parse the log file. This is our workaround now, and
> according to our tests, faster than -egen- or -collapse-, though we
> have not tried something in Mata.
>
> I checked the source of -contract-, and it gives me hope that simpler
> sorts and indexing in Stata (not Mata) can solve this.
>
> Collapse is a useful command, even came in as second in the recent
> contest for the most useful Stata feature on Facebook. My use would be
> an excellent case in point. Can it be really simply that slow that
> -tab, sum- and log-parsing beats it? (OK, most likely Mata is the
> best. But still?!)
>
> Laszlo
>
> On Mon, Mar 26, 2012 at 8:10 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> >
> > -tabulate- is a built-in command, namely compiled C code. If you want
> > to look at the code, you will need to get a developer's job at
> > StataCorp, but that in essence is why it is fast.
> >
> > -collapse- by contrast is lots of Stata code to interpret. You can
> > look at it in any text editor. including -doedit-.
> >
> > I remember when writing the first version of what is now -contract-
> > that -collapse- was not very fast, which was indeed one reason for
> > writing -contract-. So, that is a first tip: look at the code for
> > -contract- to see what it does.
> >
> > In essence, to get sums and counts you can use code like this:
> >
> > drop if missing(foo)
> > bysort groupvar : gen sum = sum(foo)
> > by groupvar : gen count = _N
> > by groupvar : keep if _n == _N
> >
> > Once you have sums and counts, you clearly can get means.
> >
> > Another tip is to do your calculations in Mata.
> >
> > In short, much of the code for -collapse- is scaffolding to make it
> > general enough for many problems. Your own code focusing on what you
> > want should be faster.
> >
> > Nick
> >
> > 2012/3/26 László Sándor <sandorl@gmail.com>:
> >
> > > I have a relatively simple goal, but I am not sure which is the most
> > > efficient way to achieve it. Let me describe what it aims to be and
> > > how I currently do it under Stata 10.1 for Windows, and then please
> > > comment on whether it could be faster.
> > >
> > > Basically, I want to clarify scatter plots, as in vast datasets it is
> > > more informative to plot means (or some quantiles) of y against "bins"
> > > of x, where actually it is informative to use some quantiles to bin x
> > > (i.e. have even frequencies in the bins instead of, say, even raw
> > > distances between the bins). Basically, the graphs could like the
> > > second graph here:
> > >
> > > Yes, it would be great if I could add a plot of linear fit later on,
> > > or perhaps plot multiple y variables against the same x, or a single y
> > > broken down by a categorical z, or two different quantiles of the same
> > > y. Also, for some applications I would want to plot only a residual
> > > after some linear fit (including an -areg- absorbing for some averages
> > > in some categories).
> > >
> > > I am not aware of anything built in for this. But once one has the
> > > bins of x, it is not that hard to collect the y against it. However,
> > > -collapse- is surprisingly slow in this regard (at least with millions
> > > or tens of millions of observations), and I had to use a workaround
> > > with tabulate and more.
> > >
> > > I am puzzled that this could be faster than -collapse-, but so it
> > > seems. Basically: if -collapse- is not the fastest tool for this (with
> > > the fast option), then what is? What does -twoway bar- use underneath,
> > > for example? What does -tabulate, summarize- use behind the scenes?
> > >
> > > Would you suggest an alternative route? Something more efficient?
> > > Something built-in? Some polished user-written tool?
> >
> > *
> > *   For searches and help try:
> > *   http://www.stata.com/help.cgi?search
> > *   http://www.stata.com/support/statalist/faq
> > *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```