Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Efficiently using associative arrays in mata


From   Andrew Maurer <[email protected]>
To   Statalist Statalist <[email protected]>
Subject   st: Efficiently using associative arrays in mata
Date   Wed, 5 Mar 2014 00:12:16 +0000

Hi Statalist,

I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.

I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).

I suspect that the issue is in the line:
asarray(mydict, thisid, asarray(mydict, thisid) + thisx)

The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?



************** begin code **********************
clear all

mata
matrix array_runningsum(idname, xname) {

// load data into mata
st_view(id=0, ., idname)
st_view(data=0, ., xname)

// initialize array
mydict = asarray_create("real", cols(id))
asarray_notfound(mydict, 0)

// loop through observations and calculate running sum
for (i=1; i<=rows(data); i++) {
                thisid = id[i,.]
                thisx = data[i,1]
                asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
}

// convert array to matrix
N = asarray_elements(mydict)
A = J(N,(1+cols(id)),.)
i = 1

for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
                A[i++,.] = asarray_key(mydict, loc), asarray_contents(mydict, loc);
}

return(sort(A,1..cols(id)))
}
end

// working example
sysuse auto, clear
mata array_runningsum(("foreign", "mpg"), "price")

// compare to:
sysuse auto, clear
collapse (sum) price, by(foreign mpg)
list
************** end code ************************


Thank you,

Andrew Maurer 



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index