Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: efficient programming in mata - "meanby()" example


From   Andrew Maurer <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: efficient programming in mata - "meanby()" example
Date   Mon, 24 Feb 2014 18:31:16 +0000

Hi Sergiy,

Thanks for this explanation! Yes, I am using MP2 and when I -set processors 1- the times are much closer. This makes more sense now.

Andrew Maurer

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Sergiy Radyakin
Sent: Monday, February 24, 2014 12:12 PM
To: [email protected]
Subject: Re: st: efficient programming in mata - "meanby()" example

Andrew, are you using MP2?
Try to set processors 1 before running the benchmark to equal the field. Stata wouldn't let you write parallel code in Mata, although it is using parallelization in its own built-ins. After setting processors to 1, your code is not that bad (in 12.0):

. timer list 1
   1:      8.33 /        1 =       8.3290
. timer list 2
   2:      7.60 /        1 =       7.6000

There are probably ways to make it faster, but because of the lack of parallelization in Mata there is no point to compete with true built-in C code.

Note also that plugins will not be a solution. Plugin interface is not afaik thread-safe.

Hope this helps, Sergiy


On Mon, Feb 24, 2014 at 11:10 AM, Andrew Maurer <[email protected]> wrote:
> Hi Statalist,
>
> I'm trying to get an idea of how to program in mata as efficiently as possible. I'm starting by trying to code a program in mata that calculates means-by-groups and runs at least as quickly as stata's "collapse (mean)..., by()" function. My objective at the moment is to learn more about programming technique as opposed to creating something new.
>
> I've whittled down my code to what's shown in the meanby() function below. The idea is to:
>         1) Sort data by panel variable(s)
>         2) Loop through observations, keep a running sum of the "x" variable and of the count of nonmissing x values
>         3) Whenever a new panel is reached, write the previous panel's 
> average to an "out" object and reset the running sums
>
> My version below takes 9 seconds, while Stata's collapse takes 5 seconds with the example data shown. Does anyone have any feedback on how I could improve my code or insight as to what's going on at a low level in Stata's "by... : gen..." syntax that makes it more efficient than the loop I've written?
>
> Thank you,
> Andrew Maurer
>
> ******* excerpt from Stata's "collapse" for comparison ************** 
> `by' gen `ty' `y' = sum(`w'*`x')/sum(cond(`x'<.,`w',0))
> `by' replace `y' = `y'[_N]
> sort `by'
> quietly by `by': keep if _n==_N
> ******* end Sata excerpt ********************************************
>
>
> ***** define meanby() function ******** mata
>
> real matrix meanby(idname, xname)
> {
>         // sort data in stata
>         stata("sort " + idname)
>
>         // load data into mata
>         st_view(id=0, ., idname)
>         st_view(data=0, ., xname)
>
>         // initialize mata objects
>         real matrix out
>         real scalar idnum, val, previd, count, runsum
>         runsum = 0
>         count = 0
>         previd = id[1,.]
>         out = J(1, cols(id)+1, .)
>
>         // loop through observations
>         for (i=1; i<=rows(data); i++) {
>                 idnum = id[i,.]
>                 val = data[i,1]
>                 if (idnum != previd) {
>                         out = out \ (previd, runsum/count)
>                         count = 0
>                         runsum = 0
>                 }
>                 if (val != .) {
>                         count = count + 1
>                         runsum = runsum + val
>                 }
>                 previd = idnum
>         }
>
>         // final row
>         out = out \ (previd, runsum/count)
>
>         // output (exclude "filler" first row)
>         return(out[|(2,1)\(rows(out),2)|])
> }
>
> end
> ***** end meanby() definition *********
>
>
> ***** benchmark meanby vs stata collapse ***** // create some panel 
> data // 30 panels, 100 dates long clear all local n 10000000 set obs 
> `n'
> gen byte panelid = int( 30/`n' * (_n-1) ) gen int date = mod(_n,100) 
> gen x = runiform()
>
> sort panelid date
>
> // time meanby() using same data
> timer on 1
>         qui mata: meanby("panelid date","x") timer off 1 timer list 1
>
> // time Stata's collapse
> timer on 2
> collapse (mean) x, by(panelid date)
> timer off 2
> timer list 2
> ***** end benchmark **************************
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index