Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: efficient programming in mata - "meanby()" example

 From Andrew Maurer To Statalist Statalist Subject st: efficient programming in mata - "meanby()" example Date Mon, 24 Feb 2014 16:10:47 +0000

```Hi Statalist,

I'm trying to get an idea of how to program in mata as efficiently as possible. I'm starting by trying to code a program in mata that calculates means-by-groups and runs at least as quickly as stata's "collapse (mean)..., by()" function. My objective at the moment is to learn more about programming technique as opposed to creating something new.

I've whittled down my code to what's shown in the meanby() function below. The idea is to:
1) Sort data by panel variable(s)
2) Loop through observations, keep a running sum of the "x" variable and of the count of nonmissing x values
3) Whenever a new panel is reached, write the previous panel's average to an "out" object and reset the running sums

My version below takes 9 seconds, while Stata's collapse takes 5 seconds with the example data shown. Does anyone have any feedback on how I could improve my code or insight as to what's going on at a low level in Stata's "by... : gen..." syntax that makes it more efficient than the loop I've written?

Thank you,
Andrew Maurer

******* excerpt from Stata's "collapse" for comparison **************
`by' gen `ty' `y' = sum(`w'*`x')/sum(cond(`x'<.,`w',0))
`by' replace `y' = `y'[_N]
sort `by'
quietly by `by': keep if _n==_N
******* end Sata excerpt ********************************************

***** define meanby() function ********
mata

real matrix meanby(idname, xname)
{
// sort data in stata
stata("sort " + idname)

st_view(id=0, ., idname)
st_view(data=0, ., xname)

// initialize mata objects
real matrix out
real scalar idnum, val, previd, count, runsum
runsum = 0
count = 0
previd = id[1,.]
out = J(1, cols(id)+1, .)

// loop through observations
for (i=1; i<=rows(data); i++) {
idnum = id[i,.]
val = data[i,1]
if (idnum != previd) {
out = out \ (previd, runsum/count)
count = 0
runsum = 0
}
if (val != .) {
count = count + 1
runsum = runsum + val
}
previd = idnum
}

// final row
out = out \ (previd, runsum/count)

// output (exclude "filler" first row)
return(out[|(2,1)\(rows(out),2)|])
}

end
***** end meanby() definition *********

***** benchmark meanby vs stata collapse *****
// create some panel data
// 30 panels, 100 dates long
clear all
local n 10000000
set obs `n'
gen byte panelid = int( 30/`n' * (_n-1) )
gen int date = mod(_n,100)
gen x = runiform()

sort panelid date

// time meanby() using same data
timer on 1
qui mata: meanby("panelid date","x")
timer off 1
timer list 1

// time Stata's collapse
timer on 2
collapse (mean) x, by(panelid date)
timer off 2
timer list 2
***** end benchmark **************************

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```