Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
Date   Tue, 3 Apr 2012 17:05:11 +0100

I don't know what that - 0.5 * width term is doing there. Some ancient
illogic, I guess.

On Tue, Apr 3, 2012 at 4:58 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> I had a hack at my -binsm- (STB, SJ) to get a slightly more modern
> flavour. Code follows examples.
>
> At present you can have any bins you like so long as they are defined
> by round(xvar, #) in the first instance. But if # is 0, the distinct
> values of the x variable are used, so you can use a previously-defined
> binning variable.
>
> . sysuse auto, clear
> (1978 Automobile Data)
>
> . binmean mpg weight
>
> . binmean mpg weight, width(100)
>
> . binmean mpg weight, width(100) by(foreign, compact)
>
> . binmean mpg weight, width(100) recast(bar) barw(100) base(0)
>
> . binmean mpg weight, width(100) recast(connected)
>
> . binmean turn trunk weight, width(100) recast(connected)
>
> *! 1.0.0 NJC 3 April 2012
> program binmean
>        version 8.2
>        syntax varlist(min=2 numeric) [if] [in] ///
>        [ , Width(numlist min=1) BY(str) PLOT(str asis) ///
>        ADDPLOT(str asis) * ]
>
>        quietly {
>                if "`width'" == "" local width 0
>                local text "bin width `width'"
>
>                if `"`by'"' != "" {
>                        gettoken byvar byrest : by, parse(,)
>                        gettoken comma byrest : byrest, parse(,)
>                        local byby `"by(`byvar', note(`text') `byrest')"'
>                }
>                else local byby "note(`text')"
>
>                marksample touse
>                if "`byvar'" != "" markout `touse' `byvar', strok
>                count if `touse'
>                if r(N) == 0 error 2000
>
>                preserve
>                keep if `touse'
>                keep `varlist' `byvar'
>                local nv : word count `varlist'
>                local x : word `nv' of `varlist'
>                local Y : list varlist - x
>
>                tempvar xbin work
>                clonevar `xbin' = `x'
>                replace `xbin' = round(`x', `width') - 0.5 * `width'
>
>                foreach y of local Y {
>                        tempvar ymean
>                        clonevar `ymean' = `y'
>                        bysort `xbin' `byvar' : replace `ymean' = sum(`y') / _N
>                        local yshow `yshow' `ymean'
>                }
>
>                bysort `xbin' `byvar': keep if _n == _N
>        }
>
>        scatter `yshow' `xbin', ///
>        `byby' `options' || ///
>        || `plot' ///
>        || `addplot'
> end
>
>
> 2012/4/3 László Sándor <sandorl@gmail.com>:
>> Thanks, Nick, this is very helpful.
>>
>> -binsm- does something different, but I'll have a look and see what I
>> could adapt from its source.
>>
>> -twoway__histogram_gen- is about frequencies still, but something like
>> this is a great idea. Actually, if I could find a routine like this
>> for bar or line graphs, it probably does what I need (and then I would
>> be really surprised if that would still be slower than -tab, sum()-
>>
>> Sadly, there is no twoway__line_gen or twoway__bar_gen, and other
>> searches did not help.
>>
>> But this was very educational, thanks again!
>>
>> Laszlo
>>
>> On Tue, Apr 3, 2012 at 5:01 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>
>>> Overnight I remembered -binsm-
>>>
>>> SJ-6-1  gr26_1  . . . . . . . . . . . . . . . . . .  Software update for binsm
>>>        (help binsm if installed) . . . . . . . . . . . . . . . . .  N. J. Cox
>>>        Q1/06   SJ 6(1):151
>>>        rewritten to support modern Stata graphics
>>>
>>> STB-37  gr26  . . . . . . . . . . . Bin smoothing and summary on scatter plots
>>>        (help binsm if installed) . . . . . . . . . . . . . . . . .  N. J. Cox
>>>        5/97    pp.9--12; STB Reprints Vol 7, pp.59--63
>>>        alternative to graph, twoway bands(); produces a scatterplot
>>>        of yvar against xvar with one or more summaries of yvar for bins
>>>        of xvar
>>>
>>> and -twoway__histogram_gen-
>>>
>>> SJ-5-2  gr0014  . . . . . . . Stata tip 20: Generating histogram bin variables
>>>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison
>>>        Q2/05   SJ 5(2):280--281                                 (no commands)
>>>        tip illustrating the use of twoway__histogram_gen for
>>>        creation of complex histograms and other graphs or tables
>>>
>>> My strategic advice is this. You want a reduced dataset for graphing,
>>> so -drop- aggressively. Once you have identified observations "to
>>> use", go
>>>
>>> keep if `touse'
>>> drop `touse'
>>>
>>> Once the mean is in the last observation of every block of
>>> observations, -drop- all the others.
>>>
>>>
>>> 2012/4/3 László Sándor <sandorl@gmail.com>:
>>> > Thanks for this, Nick.
>>> >
>>> > I found my (plenty and embarrassing) mistakes in my code, below is a
>>> > neater version that also actually does what it should, or so it seems.
>>> >
>>> > That said, it is still rarely faster than logging -tab, sum()- though
>>> > with many millions of observations, running on many (>4) cores, it at
>>> > least has a little advantage. (But both beat my bare bones Mata
>>> > attempts.)
>>> >
>>> > I would still be a bit curious how secret the secret sauce of
>>> > StataCorp is for this, as this "collapsing" is pretty commonplace for
>>> > many descriptives (also bar graphs, line graphs etc), and while they
>>> > are rightly proud if they could tweak -tabulate- to run this fast,
>>> > they perhaps could let us (and themselves) working towards other
>>> > similar code also running faster. Though, of course, there must be a
>>> > reason (general purpose etc.) while this is harder elsewhere.
>>> >
>>> > Thanks again,
>>> >
>>> > Laszlo
>>> >
>>> > tempvar wsum tag
>>> >
>>> > if ("`y2_var'"!="") local y2 y2
>>> > else local y2 ""
>>> >
>>> > sort `x_q' `touse'
>>> > by `x_q' `touse': g byte `tag' = _n == _N
>>> > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1')
>>> > else by `x_q' `touse': g `wsum' = _N
>>> >
>>> > foreach v in x y `y2' {
>>> >        if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1')
>>> >        else by `x_q' `touse': g ``v'_mean' = sum(``v'_r')
>>> >
>>> >        quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.)
>>> > }
>>> >
>>> > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> >>
>>> >> I will look at it tomorrow.
>>> >>
>>> >> 2012/4/2 László Sándor <sandorl@gmail.com>:
>>> >> > Nick,
>>> >> >
>>> >> > thanks, I did follow up with your post. Sadly, I could not easily get
>>> >> > -by- working, or to be precise, to use the variables that it
>>> >> > generated. Below I have an attempt, if I can take liberty with your
>>> >> > time and expect you to parse it, I am grateful for comments to get it
>>> >> > working -- the indexing must be off. It tries to average two (x_r and
>>> >> > y_r) or three (y2_r extra) variables. It generates too large values
>>> >> > for some bins (i.e. from U[0,1] variables some averages become larger
>>> >> > than 20.)
>>> >> >
>>> >> > I am happy if someone from StataCorp follows up too! :)
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> > László
>>> >> >
>>> >> > tempvar wsum tag ones
>>> >> > g byte `ones' = 1
>>> >> >
>>> >> >
>>> >> > if ("`y2_var'"!="") local y2 y2
>>> >> > else local y2 ""
>>> >> >
>>> >> >
>>> >> > if ("`weight1'"!="") g `wsum' = sum(`weight1')  if `touse'
>>> >> > else g `wsum' = sum(`ones')  if `touse'
>>> >> >
>>> >> >
>>> >> > sort `x_q'
>>> >> > by `x_q': g byte `tag' = _N if `touse'
>>> >> >
>>> >> > foreach v in x y `y2' {
>>> >> > if "`weight1'"!=""{
>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1')  if `touse'
>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>> >> > }
>>> >> >
>>> >> > else {
>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse'
>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>> >> > }
>>> >> > }
>>> >> >
>>> >> >
>>> >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> >> >>
>>> >> >> We are back to the questions you asked a week ago. Mostly this is for
>>> >> >> StataCorp. Otherwise please see again my answers at
>>> >> >>
>>> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html
>>> >> >>
>>> >> >> I've had dramatic speed-ups with Mata -- my record is reducing
>>> >> >> execution time from 5 days to 2 minutes, but that was partly because
>>> >> >> my original code was so dumb -- but I've not tried anything like the
>>> >> >> stuff you were using.
>>> >> >>
>>> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can
>>> >> >> get is by using -by:- as explained in the post just quoted.
>>> >> >>
>>> >> >> Nick
>>> >> >>
>>> >> >> 2012/4/2 László Sándor <sandorl@gmail.com>:
>>> >> >> > Hi all,
>>> >> >> >
>>> >> >> > I had several questions recently on this list about compiling Mata
>>> >> >> > code. I still could not deal with generating the compile time locals
>>> >> >> > with loops, but I typed them out and compiled. Now I had my test runs
>>> >> >> > but they are surprising. Let me ask you why:
>>> >> >> >
>>> >> >> > My basic problem was to do a fast "collapse" to make binned scatter
>>> >> >> > plots. Collapse was unacceptably slow, probably because of the
>>> >> >> > necessary preserve-restore cycles, or inefficient coding of collapse
>>> >> >> > (for its general purpose).
>>> >> >> >
>>> >> >> > I already had a version that parsed a log of -tabulate, summarize-.
>>> >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting
>>> >> >> > this to be fast, at least because of the file I/O and the parsing.
>>> >> >> >
>>> >> >> > Now I built a Mata function that "collapses" into new variables with
>>> >> >> > leaving the data intact otherwise. For this I used Ben Jann's
>>> >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in
>>> >> >> > the ado file.
>>> >> >> >
>>> >> >> > And the test run with 100 million observations told me it was slower
>>> >> >> > than the hack. Before I give up and claim the hack unbeatable, I have
>>> >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with
>>> >> >> > 12
>>> >> >> > cores. Perhaps -tabulate- used all of them, and my code did not.
>>> >> >> >
>>> >> >> > Are there guidelines how to speed up Mata in this situation (if it is
>>> >> >> > not MP-aware to begin with?).
>>> >> >> >
>>> >> >> > Or, tentatively, can I ask for some guidance about the magic of
>>> >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without
>>> >> >> > just logging its output?
>>> >> >> >
>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index