Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
László Sándor <sandorl@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-) |

Date |
Mon, 2 Apr 2012 23:44:05 -0400 |

Thanks for this, Nick. I found my (plenty and embarrassing) mistakes in my code, below is a neater version that also actually does what it should, or so it seems. That said, it is still rarely faster than logging -tab, sum()- though with many millions of observations, running on many (>4) cores, it at least has a little advantage. (But both beat my bare bones Mata attempts.) I would still be a bit curious how secret the secret sauce of StataCorp is for this, as this "collapsing" is pretty commonplace for many descriptives (also bar graphs, line graphs etc), and while they are rightly proud if they could tweak -tabulate- to run this fast, they perhaps could let us (and themselves) working towards other similar code also running faster. Though, of course, there must be a reason (general purpose etc.) while this is harder elsewhere. Thanks again, Laszlo tempvar wsum tag if ("`y2_var'"!="") local y2 y2 else local y2 "" sort `x_q' `touse' by `x_q' `touse': g byte `tag' = _n == _N if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1') else by `x_q' `touse': g `wsum' = _N foreach v in x y `y2' { if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1') else by `x_q' `touse': g ``v'_mean' = sum(``v'_r') quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.) } On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <njcoxstata@gmail.com> wrote: > > I will look at it tomorrow. > > 2012/4/2 László Sándor <sandorl@gmail.com>: > > Nick, > > > > thanks, I did follow up with your post. Sadly, I could not easily get > > -by- working, or to be precise, to use the variables that it > > generated. Below I have an attempt, if I can take liberty with your > > time and expect you to parse it, I am grateful for comments to get it > > working -- the indexing must be off. It tries to average two (x_r and > > y_r) or three (y2_r extra) variables. It generates too large values > > for some bins (i.e. from U[0,1] variables some averages become larger > > than 20.) > > > > I am happy if someone from StataCorp follows up too! :) > > > > Thanks, > > > > László > > > > tempvar wsum tag ones > > g byte `ones' = 1 > > > > > > if ("`y2_var'"!="") local y2 y2 > > else local y2 "" > > > > > > if ("`weight1'"!="") g `wsum' = sum(`weight1') if `touse' > > else g `wsum' = sum(`ones') if `touse' > > > > > > sort `x_q' > > by `x_q': g byte `tag' = _N if `touse' > > > > foreach v in x y `y2' { > > if "`weight1'"!=""{ > > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1') if `touse' > > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' > > } > > > > else { > > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse' > > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' > > } > > } > > > > > > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <njcoxstata@gmail.com> wrote: > >> > >> We are back to the questions you asked a week ago. Mostly this is for > >> StataCorp. Otherwise please see again my answers at > >> > >> http://www.stata.com/statalist/archive/2012-03/msg01144.html > >> > >> I've had dramatic speed-ups with Mata -- my record is reducing > >> execution time from 5 days to 2 minutes, but that was partly because > >> my original code was so dumb -- but I've not tried anything like the > >> stuff you were using. > >> > >> -tabulate, summarize- is compiled C code. I think the nearest you can > >> get is by using -by:- as explained in the post just quoted. > >> > >> Nick > >> > >> 2012/4/2 László Sándor <sandorl@gmail.com>: > >> > Hi all, > >> > > >> > I had several questions recently on this list about compiling Mata > >> > code. I still could not deal with generating the compile time locals > >> > with loops, but I typed them out and compiled. Now I had my test runs > >> > but they are surprising. Let me ask you why: > >> > > >> > My basic problem was to do a fast "collapse" to make binned scatter > >> > plots. Collapse was unacceptably slow, probably because of the > >> > necessary preserve-restore cycles, or inefficient coding of collapse > >> > (for its general purpose). > >> > > >> > I already had a version that parsed a log of -tabulate, summarize-. > >> > Yes, it is as much of a hack as it sounds like. I was not expecting > >> > this to be fast, at least because of the file I/O and the parsing. > >> > > >> > Now I built a Mata function that "collapses" into new variables with > >> > leaving the data intact otherwise. For this I used Ben Jann's > >> > -mf_mm_collapse-, and compiled all the necessary functions myself in > >> > the ado file. > >> > > >> > And the test run with 100 million observations told me it was slower > >> > than the hack. Before I give up and claim the hack unbeatable, I have > >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with > >> > 12 > >> > cores. Perhaps -tabulate- used all of them, and my code did not. > >> > > >> > Are there guidelines how to speed up Mata in this situation (if it is > >> > not MP-aware to begin with?). > >> > > >> > Or, tentatively, can I ask for some guidance about the magic of > >> > -tabulate, summarize-? Is that magic accessible/reproducible without > >> > just logging its output? > >> > > >> > Thanks, > >> > > >> > Laszlo > >> > * > >> > * For searches and help try: > >> > * http://www.stata.com/help.cgi?search > >> > * http://www.stata.com/support/statalist/faq > >> > * http://www.ats.ucla.edu/stat/stata/ > >> > >> * > >> * For searches and help try: > >> * http://www.stata.com/help.cgi?search > >> * http://www.stata.com/support/statalist/faq > >> * http://www.ats.ucla.edu/stat/stata/ > > > > * > > * For searches and help try: > > * http://www.stata.com/help.cgi?search > > * http://www.stata.com/support/statalist/faq > > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)***From:*László Sándor <sandorl@gmail.com>

**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)***From:*László Sándor <sandorl@gmail.com>

**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: using Freeman-Tukey arcsine transformation with metan command** - Next by Date:
**Re: st: Logit Model- Controlling for Differences Across Groups (Countries)** - Previous by thread:
**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)** - Next by thread:
**Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)** - Index(es):