Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Andrew Maurer <Andrew.Maurer@qrm.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: Efficiently using associative arrays in mata |
Date | Thu, 6 Mar 2014 17:20:15 +0000 |
Hi Aljar, Thank you very much for this. I really appreciate it! I'll take a look at it and let you know how it performs on my data relative to collapse. I did try changing the efficiency parameters, as you suggested earlier, which made a relatively big improvement to performance (I changed asarray()'s minsize to 100,000). Computation time was still substantially slower than collapse, though. Best, Andrew Maurer -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Aljar Meesters Sent: Thursday, March 06, 2014 11:08 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: Efficiently using associative arrays in mata Dear Andrew, I had some spare moments this day and I tried to write a simple but fast implementation of a hash table myself, just for fun. The result is a very rudimentary hash table that out-competes the asarray implementation (but it also has way less functionality). For you this may be interesting though. I have implemented your array_runningsum, which you can find below. Best, Aljar mata *** Begin *** /* This function finds a unique position for the key in vector keys * and returns the position of the key if it is already added to keys */ real scalar findPos(keys, key){ N = rows(keys) pos = hash1(key, N) while(keys[pos, 1] :< . & keys[pos, .] != key) pos = mod(pos, N) + 1 return(pos) } matrix array_runningsum2(idname, xname){ // load data into mata st_view(id=., ., idname) st_view(data=., ., xname) /* tableSize is the size of the hash table * and determines the tradeoff between speed * and memory consumption. * If you set it large, nothing will hapen but * the function will consume a lot of memroy. * If you set it too small (smaller than the number * of groups, the function will hang in an infinite * loop * If you set it equal to the amount of groups, * you will get minimal memory usage, yet, the * performance is really low. * I think that you get good performance if you * set it 3 to 4 times the number of groups. * If you set it equal to the number of rows in * your dataset, what I did over here, ensures * that the function always returns (well, I hope). */ tableSize = ceil(rows(id)) // Create a Keys matrix and a vector for the outcomes keys = J(tableSize, 1, J(1, cols(id), .)) values = J(tableSize, 1, 0) // Initialized at zero for(idx = 1; idx <= rows(id); ++idx){ pos = findPos(keys, id[idx, .]) keys[pos, .] = id[idx, .] values[pos] = values[pos] + data[idx] } return(sort(select((keys, values), keys[.,1] :!= .), 1..cols(id))) } end *** End *** 2014-03-06 11:07 GMT+01:00 Aljar Meesters <aljar.meesters@gmail.com>: > The method that I have in mind uses the mata function > -minindex(v,k,i,w)-, where v is a real vector that holds your unique > group ids,k is in your case the number of groups that you have (but > you can set this to missing as well), i, and w, are vectors in which > the function puts information about group sizes and elements. I have > added a proof of concept below this e-mail. There are however two > caveats with this approach. Although the algorithm is, as far as I > know, not disclosed, I think that it is an O(n * k) algorithm (where k > the number of groups). If k is large relative to n, you get worse > performance than the O(nlogn) of the sorting algorithm. The other > caveat is that you have not one variable that identifies your groups > but multiple. So, you have to combine these variables into one holds > unique identifiers. If this is done incorrectly you will end up with > non-unique identifiers. Also, the probability that things go wrong > increases with the number of groups. > As a side note, if you have a reasonable idea about how many groups > you have, you may include this information when creating your hash > table (third argument in asarray_create). This ensures that you do not > get any rebuilds of the hash table while adding elements, which may > increase your performance especially if you have a lot of groups. > Best, > > Aljar > > > *** begin example *** > clear all > > set obs 100000 > > gen x = ceil(runiform() * 1000) > gen y = rnormal() > > timer on 1 > mata > st_view(x=., ., "x y", .) > minindex(x[., 1], ., i=., v=.) > N = rows(v) > v[.,2] = runningsum(v[., 2]) > t = J(N, 1, .) > for(idx = 1; idx <= N; ++idx){ > idx_x = i[|v[idx, 1], 1 \ v[idx, 2], 1 |] > t[idx] = mean(x[idx_x, 2]) > } > end > *** end example *** > > 2014-03-05 19:40 GMT+01:00 Joe Canner <jcanner1@jhmi.edu>: >> Andrew, >> >> By default, MS Outlook removes "extra" line breaks in plain-text messages. You can restore them for individual messages by clicking where it says "Extra lines breaks in this message were removed", above the "From:" line. >> >> If you want to change the default, go to File->Options->Mail->Message format. See http://support.microsoft.com/kb/287816 for more details. >> >> Regards, >> Joe Canner >> Johns Hopkins University School of Medicine >> >> -----Original Message----- >> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Andrew Maurer >> Sent: Wednesday, March 05, 2014 12:53 PM >> To: statalist@hsphsun2.harvard.edu >> Subject: RE: st: Efficiently using associative arrays in mata >> >> Thanks very much for the response, Aljar. I tried your suggestion and found that the two-lookup method took about 35% longer in the below test (around 8.5 seconds vs 11.5 seconds), which isn't negligible.. >> >> I'd be interested in hearing your suggestion on alternative code if the number of groups is small. I use this for a few different definitions of the id variable, but on a set of 1billion observations, the final number of observations may be 2000 at the low end or 10million at the high end. >> >> As an aside - I've noticed that lines of pasted code sometimes merge together when I email to statalist. Does anyone know why this happens? (messages sent using "plain text" format in Outlook). >> >> ************ Create sample data ************* // create sample data // variables: species, breed, animal_id, and date clear all local n 1000000 qui set obs `n' >> gen byte species = int(4/`n' * (_n-1)) + 1 gen byte breed = int(5*runiform()) + 1 gen long animal_id = int((_n-1)/100) gen int date = mod(_n,100) + 500 gen x = runiform() sort species breed date qui save temp, replace >> ************ End create sample data ********* >> >> *********** Begin timed benchmark *********** clear all mata >> >> idname = ("species","breed","animal_id") xname = "x" >> >> // load data into mata >> st_view(id=0, ., idname) >> st_view(data=0, ., xname) >> >> // initialize array >> mydic = asarray_create("real", cols(id)) asarray_notfound(mydic, 0) >> >> timer_clear() >> for (j=1; j<=5; j++) { >> timer_on(j) >> // loop through observations >> for (i=1; i<=rows(data); i++) { >> thisid = id[i,.] >> thisx = data[i,1] >> asarray(mydic, thisid, 1) >> } >> timer_off(j) >> } >> >> for (j=6; j<=10; j++) { >> timer_on(j) >> // loop through observations >> for (i=1; i<=rows(data); i++) { >> thisid = id[i,.] >> thisx = data[i,1] >> asarray(mydic, thisid, asarray(mydic, thisid) + thisx) } >> timer_off(j) >> } >> timer() >> >> end >> exit >> *********** End timed benchmark ************* >> >> *********** Output *********** >> 1. 8.63 / 1 = 8.627 >> 2. 8.77 / 1 = 8.768 >> 3. 9.03 / 1 = 9.032 >> 4. 7.5 / 1 = 7.504 >> 5. 7.6 / 1 = 7.597 >> 6. 11.6 / 1 = 11.56 >> 7. 11.2 / 1 = 11.201 >> 8. 11.5 / 1 = 11.482 >> 9. 12 / 1 = 12.044 >> 10. 12 / 1 = 11.996 >> *********** End output ******* >> >> Thank you, >> >> Andrew Maurer >> >> -----Original Message----- >> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Aljar Meesters >> Sent: Wednesday, March 05, 2014 10:22 AM >> To: statalist@hsphsun2.harvard.edu >> Subject: Re: st: Efficiently using associative arrays in mata >> >> Dear Andrew, >> >> I don't think that it possible to obtain the memory address of thisid in mydict with the standard implementation of asarray. Although it requires only a minor change in asarray to get this functionality, I don't think it is of great help for you. I have changed the line -asarray(mydict, thisid, asarray(mydict, thisid) + thisx)- into -asarray(mydict, thisid, 1)- to mimic the case you are looking for and although you get a speedup, it is not the speedup of a factor two you are looking for, but maybe on your system it does, so you may try this. >> However, do you have some information about the number of observations after the collapse? If the number of groups is small I may have some code that is useful for you although it requires some changes in order to suit your usecase. >> Best, >> >> Aljar >> >> >> 2014-03-05 1:12 GMT+01:00 Andrew Maurer <Andrew.Maurer@qrm.com>: >>> Hi Statalist, >>> >>> I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete. >>> >>> I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs). >>> >>> I suspect that the issue is in the line: >>> asarray(mydict, thisid, asarray(mydict, thisid) + thisx) >>> >>> The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function? >>> >>> >>> >>> ************** begin code ********************** clear all >>> >>> mata >>> matrix array_runningsum(idname, xname) { >>> >>> // load data into mata >>> st_view(id=0, ., idname) >>> st_view(data=0, ., xname) >>> >>> // initialize array >>> mydict = asarray_create("real", cols(id)) asarray_notfound(mydict, 0) >>> >>> // loop through observations and calculate running sum for (i=1; >>> i<=rows(data); i++) { >>> thisid = id[i,.] >>> thisx = data[i,1] >>> asarray(mydict, thisid, asarray(mydict, thisid) + >>> thisx) } >>> >>> // convert array to matrix >>> N = asarray_elements(mydict) >>> A = J(N,(1+cols(id)),.) >>> i = 1 >>> >>> for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) { >>> A[i++,.] = asarray_key(mydict, loc), >>> asarray_contents(mydict, loc); } >>> >>> return(sort(A,1..cols(id))) >>> } >>> end >>> >>> // working example >>> sysuse auto, clear >>> mata array_runningsum(("foreign", "mpg"), "price") >>> >>> // compare to: >>> sysuse auto, clear >>> collapse (sum) price, by(foreign mpg) >>> list >>> ************** end code ************************ >>> >>> >>> Thank you, >>> >>> Andrew Maurer >>> >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ >> >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/