Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
Date   Tue, 3 Apr 2012 16:58:52 +0100

I had a hack at my -binsm- (STB, SJ) to get a slightly more modern
flavour. Code follows examples.

At present you can have any bins you like so long as they are defined
by round(xvar, #) in the first instance. But if # is 0, the distinct
values of the x variable are used, so you can use a previously-defined
binning variable.

. sysuse auto, clear
(1978 Automobile Data)

. binmean mpg weight

. binmean mpg weight, width(100)

. binmean mpg weight, width(100) by(foreign, compact)

. binmean mpg weight, width(100) recast(bar) barw(100) base(0)

. binmean mpg weight, width(100) recast(connected)

. binmean turn trunk weight, width(100) recast(connected)

*! 1.0.0 NJC 3 April 2012
program binmean
	version 8.2
	syntax varlist(min=2 numeric) [if] [in] ///
	[ , Width(numlist min=1) BY(str) PLOT(str asis) ///
	ADDPLOT(str asis) * ]
	
	quietly {
		if "`width'" == "" local width 0
		local text "bin width `width'"

		if `"`by'"' != "" {
			gettoken byvar byrest : by, parse(,)
			gettoken comma byrest : byrest, parse(,)
			local byby `"by(`byvar', note(`text') `byrest')"'
		}
		else local byby "note(`text')"

		marksample touse
		if "`byvar'" != "" markout `touse' `byvar', strok
		count if `touse'
		if r(N) == 0 error 2000

		preserve
		keep if `touse'
		keep `varlist' `byvar'
		local nv : word count `varlist'
		local x : word `nv' of `varlist'
		local Y : list varlist - x

		tempvar xbin work
		clonevar `xbin' = `x'
		replace `xbin' = round(`x', `width') - 0.5 * `width'
				
		foreach y of local Y {
			tempvar ymean
			clonevar `ymean' = `y'
			bysort `xbin' `byvar' : replace `ymean' = sum(`y') / _N
			local yshow `yshow' `ymean'
		}

		bysort `xbin' `byvar': keep if _n == _N
	}

	scatter `yshow' `xbin', ///
	`byby' `options' || ///
	|| `plot' ///
	|| `addplot'
end


2012/4/3 László Sándor <sandorl@gmail.com>:
> Thanks, Nick, this is very helpful.
>
> -binsm- does something different, but I'll have a look and see what I
> could adapt from its source.
>
> -twoway__histogram_gen- is about frequencies still, but something like
> this is a great idea. Actually, if I could find a routine like this
> for bar or line graphs, it probably does what I need (and then I would
> be really surprised if that would still be slower than -tab, sum()-
>
> Sadly, there is no twoway__line_gen or twoway__bar_gen, and other
> searches did not help.
>
> But this was very educational, thanks again!
>
> Laszlo
>
> On Tue, Apr 3, 2012 at 5:01 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>
>> Overnight I remembered -binsm-
>>
>> SJ-6-1  gr26_1  . . . . . . . . . . . . . . . . . .  Software update for binsm
>>        (help binsm if installed) . . . . . . . . . . . . . . . . .  N. J. Cox
>>        Q1/06   SJ 6(1):151
>>        rewritten to support modern Stata graphics
>>
>> STB-37  gr26  . . . . . . . . . . . Bin smoothing and summary on scatter plots
>>        (help binsm if installed) . . . . . . . . . . . . . . . . .  N. J. Cox
>>        5/97    pp.9--12; STB Reprints Vol 7, pp.59--63
>>        alternative to graph, twoway bands(); produces a scatterplot
>>        of yvar against xvar with one or more summaries of yvar for bins
>>        of xvar
>>
>> and -twoway__histogram_gen-
>>
>> SJ-5-2  gr0014  . . . . . . . Stata tip 20: Generating histogram bin variables
>>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison
>>        Q2/05   SJ 5(2):280--281                                 (no commands)
>>        tip illustrating the use of twoway__histogram_gen for
>>        creation of complex histograms and other graphs or tables
>>
>> My strategic advice is this. You want a reduced dataset for graphing,
>> so -drop- aggressively. Once you have identified observations "to
>> use", go
>>
>> keep if `touse'
>> drop `touse'
>>
>> Once the mean is in the last observation of every block of
>> observations, -drop- all the others.
>>
>>
>> 2012/4/3 László Sándor <sandorl@gmail.com>:
>> > Thanks for this, Nick.
>> >
>> > I found my (plenty and embarrassing) mistakes in my code, below is a
>> > neater version that also actually does what it should, or so it seems.
>> >
>> > That said, it is still rarely faster than logging -tab, sum()- though
>> > with many millions of observations, running on many (>4) cores, it at
>> > least has a little advantage. (But both beat my bare bones Mata
>> > attempts.)
>> >
>> > I would still be a bit curious how secret the secret sauce of
>> > StataCorp is for this, as this "collapsing" is pretty commonplace for
>> > many descriptives (also bar graphs, line graphs etc), and while they
>> > are rightly proud if they could tweak -tabulate- to run this fast,
>> > they perhaps could let us (and themselves) working towards other
>> > similar code also running faster. Though, of course, there must be a
>> > reason (general purpose etc.) while this is harder elsewhere.
>> >
>> > Thanks again,
>> >
>> > Laszlo
>> >
>> > tempvar wsum tag
>> >
>> > if ("`y2_var'"!="") local y2 y2
>> > else local y2 ""
>> >
>> > sort `x_q' `touse'
>> > by `x_q' `touse': g byte `tag' = _n == _N
>> > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1')
>> > else by `x_q' `touse': g `wsum' = _N
>> >
>> > foreach v in x y `y2' {
>> >        if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1')
>> >        else by `x_q' `touse': g ``v'_mean' = sum(``v'_r')
>> >
>> >        quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.)
>> > }
>> >
>> > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> >>
>> >> I will look at it tomorrow.
>> >>
>> >> 2012/4/2 László Sándor <sandorl@gmail.com>:
>> >> > Nick,
>> >> >
>> >> > thanks, I did follow up with your post. Sadly, I could not easily get
>> >> > -by- working, or to be precise, to use the variables that it
>> >> > generated. Below I have an attempt, if I can take liberty with your
>> >> > time and expect you to parse it, I am grateful for comments to get it
>> >> > working -- the indexing must be off. It tries to average two (x_r and
>> >> > y_r) or three (y2_r extra) variables. It generates too large values
>> >> > for some bins (i.e. from U[0,1] variables some averages become larger
>> >> > than 20.)
>> >> >
>> >> > I am happy if someone from StataCorp follows up too! :)
>> >> >
>> >> > Thanks,
>> >> >
>> >> > László
>> >> >
>> >> > tempvar wsum tag ones
>> >> > g byte `ones' = 1
>> >> >
>> >> >
>> >> > if ("`y2_var'"!="") local y2 y2
>> >> > else local y2 ""
>> >> >
>> >> >
>> >> > if ("`weight1'"!="") g `wsum' = sum(`weight1')  if `touse'
>> >> > else g `wsum' = sum(`ones')  if `touse'
>> >> >
>> >> >
>> >> > sort `x_q'
>> >> > by `x_q': g byte `tag' = _N if `touse'
>> >> >
>> >> > foreach v in x y `y2' {
>> >> > if "`weight1'"!=""{
>> >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1')  if `touse'
>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>> >> > }
>> >> >
>> >> > else {
>> >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse'
>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>> >> > }
>> >> > }
>> >> >
>> >> >
>> >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> >> >>
>> >> >> We are back to the questions you asked a week ago. Mostly this is for
>> >> >> StataCorp. Otherwise please see again my answers at
>> >> >>
>> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html
>> >> >>
>> >> >> I've had dramatic speed-ups with Mata -- my record is reducing
>> >> >> execution time from 5 days to 2 minutes, but that was partly because
>> >> >> my original code was so dumb -- but I've not tried anything like the
>> >> >> stuff you were using.
>> >> >>
>> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can
>> >> >> get is by using -by:- as explained in the post just quoted.
>> >> >>
>> >> >> Nick
>> >> >>
>> >> >> 2012/4/2 László Sándor <sandorl@gmail.com>:
>> >> >> > Hi all,
>> >> >> >
>> >> >> > I had several questions recently on this list about compiling Mata
>> >> >> > code. I still could not deal with generating the compile time locals
>> >> >> > with loops, but I typed them out and compiled. Now I had my test runs
>> >> >> > but they are surprising. Let me ask you why:
>> >> >> >
>> >> >> > My basic problem was to do a fast "collapse" to make binned scatter
>> >> >> > plots. Collapse was unacceptably slow, probably because of the
>> >> >> > necessary preserve-restore cycles, or inefficient coding of collapse
>> >> >> > (for its general purpose).
>> >> >> >
>> >> >> > I already had a version that parsed a log of -tabulate, summarize-.
>> >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting
>> >> >> > this to be fast, at least because of the file I/O and the parsing.
>> >> >> >
>> >> >> > Now I built a Mata function that "collapses" into new variables with
>> >> >> > leaving the data intact otherwise. For this I used Ben Jann's
>> >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in
>> >> >> > the ado file.
>> >> >> >
>> >> >> > And the test run with 100 million observations told me it was slower
>> >> >> > than the hack. Before I give up and claim the hack unbeatable, I have
>> >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with
>> >> >> > 12
>> >> >> > cores. Perhaps -tabulate- used all of them, and my code did not.
>> >> >> >
>> >> >> > Are there guidelines how to speed up Mata in this situation (if it is
>> >> >> > not MP-aware to begin with?).
>> >> >> >
>> >> >> > Or, tentatively, can I ask for some guidance about the magic of
>> >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without
>> >> >> > just logging its output?
>> >> >> >
>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index