Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Efficiently using associative arrays in mata


From   Andrew Maurer <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: Efficiently using associative arrays in mata
Date   Wed, 5 Mar 2014 17:53:00 +0000

Thanks very much for the response, Aljar. I tried your suggestion and found that the two-lookup method took about 35% longer in the below test (around 8.5 seconds vs 11.5 seconds), which isn't negligible.. 

I'd be interested in hearing your suggestion on alternative code if the number of groups is small. I use this for a few different definitions of the id variable, but on a set of 1billion observations, the final number of observations may be 2000 at the low end or 10million at the high end. 

As an aside - I've noticed that lines of pasted code sometimes merge together when I email to statalist. Does anyone know why this happens? (messages sent using "plain text" format in Outlook).

************ Create sample data *************
// create sample data
// variables: species, breed, animal_id, and date
clear all
local n 1000000
qui set obs `n'
gen byte species = int(4/`n' * (_n-1)) + 1
gen byte breed = int(5*runiform()) + 1
gen long animal_id = int((_n-1)/100)
gen int date = mod(_n,100) + 500
gen x = runiform()
sort species breed date
qui save temp, replace
************ End create sample data *********

*********** Begin timed benchmark ***********
clear all
mata

idname = ("species","breed","animal_id")
xname = "x"

// load data into mata
st_view(id=0, ., idname)
st_view(data=0, ., xname)

// initialize array
mydic = asarray_create("real", cols(id))
asarray_notfound(mydic, 0)

timer_clear()
for (j=1; j<=5; j++) {
timer_on(j)		
// loop through observations
for (i=1; i<=rows(data); i++) {
	thisid = id[i,.]
	thisx = data[i,1]
	asarray(mydic, thisid, 1)
}
timer_off(j)
}

for (j=6; j<=10; j++) {
timer_on(j)		
// loop through observations
for (i=1; i<=rows(data); i++) {
	thisid = id[i,.]
	thisx = data[i,1]
	asarray(mydic, thisid, asarray(mydic, thisid) + thisx)
}
timer_off(j)
}
timer()

end
exit
*********** End timed benchmark *************

*********** Output ***********
  1.       8.63 /        1 =     8.627
  2.       8.77 /        1 =     8.768
  3.       9.03 /        1 =     9.032
  4.        7.5 /        1 =     7.504
  5.        7.6 /        1 =     7.597
  6.       11.6 /        1 =     11.56
  7.       11.2 /        1 =    11.201
  8.       11.5 /        1 =    11.482
  9.         12 /        1 =    12.044
 10.         12 /        1 =    11.996
*********** End output *******

Thank you,

Andrew Maurer

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Aljar Meesters
Sent: Wednesday, March 05, 2014 10:22 AM
To: [email protected]
Subject: Re: st: Efficiently using associative arrays in mata

Dear Andrew,

I don't think that it possible to obtain the memory address of thisid in mydict with the standard implementation of asarray. Although it requires only a minor change in asarray to get this functionality, I don't think it is of great help for you. I have changed the line -asarray(mydict, thisid, asarray(mydict, thisid) + thisx)- into -asarray(mydict, thisid, 1)- to mimic the case you are looking for and although you get a speedup, it is not the speedup of a factor two you are looking for, but maybe on your system it does, so you may try this.
However, do you have some information about the number of observations after the collapse? If the number of groups is small I may have some code that is useful for you although it requires some changes in order to suit your usecase.
Best,

Aljar


2014-03-05 1:12 GMT+01:00 Andrew Maurer <[email protected]>:
> Hi Statalist,
>
> I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.
>
> I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).
>
> I suspect that the issue is in the line:
> asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
>
> The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?
>
>
>
> ************** begin code ********************** clear all
>
> mata
> matrix array_runningsum(idname, xname) {
>
> // load data into mata
> st_view(id=0, ., idname)
> st_view(data=0, ., xname)
>
> // initialize array
> mydict = asarray_create("real", cols(id)) asarray_notfound(mydict, 0)
>
> // loop through observations and calculate running sum for (i=1; 
> i<=rows(data); i++) {
>                 thisid = id[i,.]
>                 thisx = data[i,1]
>                 asarray(mydict, thisid, asarray(mydict, thisid) + 
> thisx) }
>
> // convert array to matrix
> N = asarray_elements(mydict)
> A = J(N,(1+cols(id)),.)
> i = 1
>
> for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
>                 A[i++,.] = asarray_key(mydict, loc), 
> asarray_contents(mydict, loc); }
>
> return(sort(A,1..cols(id)))
> }
> end
>
> // working example
> sysuse auto, clear
> mata array_runningsum(("foreign", "mpg"), "price")
>
> // compare to:
> sysuse auto, clear
> collapse (sum) price, by(foreign mpg)
> list
> ************** end code ************************
>
>
> Thank you,
>
> Andrew Maurer
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index