Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: Efficiently using associative arrays in mata

From	Andrew Maurer <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: Efficiently using associative arrays in mata
Date	Thu, 6 Mar 2014 17:20:15 +0000
Hi Aljar,

Thank you very much for this. I really appreciate it! I'll take a look at it and let you know how it performs on my data relative to collapse.

I did try changing the efficiency parameters, as you suggested earlier, which made a relatively big improvement to performance (I changed asarray()'s minsize to 100,000). Computation time was still substantially slower than collapse, though.

Best,

Andrew Maurer 

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Aljar Meesters
Sent: Thursday, March 06, 2014 11:08 AM
To: [email protected]
Subject: Re: st: Efficiently using associative arrays in mata

Dear Andrew,

I had some spare moments this day and I tried to write a simple but
fast implementation of a hash table myself, just for fun. The result
is a very rudimentary hash table that out-competes the asarray
implementation (but it also has way less functionality). For you this
may be interesting though. I have implemented your array_runningsum,
which you can find below.
Best,

Aljar

mata

*** Begin ***
/* This function finds a unique position for the key in vector keys
 * and returns the position of the key if it is already added to keys
 */
real scalar findPos(keys, key){
    N = rows(keys)
    pos = hash1(key, N)
    while(keys[pos, 1] :< . & keys[pos, .] != key)
        pos = mod(pos, N) + 1

    return(pos)
}

matrix array_runningsum2(idname, xname){

    // load data into mata
    st_view(id=., ., idname)
    st_view(data=., ., xname)

    /* tableSize is the size of the hash table
     * and determines the tradeoff between speed
     * and memory consumption.
     * If you set it large, nothing will hapen but
     * the function will consume a lot of memroy.
     * If you set it too small (smaller than the number
     * of groups, the function will hang in an infinite
     * loop
     * If you set it equal to the amount of groups,
     * you will get minimal memory usage, yet, the
     * performance is really low.
     * I think that you get good performance if you
     * set it 3 to 4 times the number of groups.
     * If you set it equal to the number of rows in
     * your dataset, what I did over here, ensures
     * that the function always returns (well, I hope).
     */
    tableSize = ceil(rows(id))

    // Create a Keys matrix and a vector for the outcomes
    keys = J(tableSize, 1, J(1, cols(id), .))
    values = J(tableSize, 1, 0) // Initialized at zero

    for(idx = 1; idx <= rows(id); ++idx){
        pos = findPos(keys, id[idx, .])
        keys[pos, .] = id[idx, .]
        values[pos] = values[pos] + data[idx]
    }

    return(sort(select((keys, values), keys[.,1] :!= .), 1..cols(id)))
}

end

*** End ***

2014-03-06 11:07 GMT+01:00 Aljar Meesters <[email protected]>:
> The method that I have in mind uses the mata function
> -minindex(v,k,i,w)-, where v is a real vector that holds your unique
> group ids,k is in your case the number of groups that you have (but
> you can set this to missing as well), i, and w, are vectors in which
> the function puts information about group sizes and elements. I have
> added a proof of concept below this e-mail. There are however two
> caveats with this approach. Although the algorithm is, as far as I
> know, not disclosed, I think that it is an O(n * k) algorithm (where k
> the number of groups). If k is large relative to n, you get worse
> performance than the O(nlogn) of the sorting algorithm. The other
> caveat is that you have not one variable that identifies your groups
> but multiple. So, you have to combine these variables into one holds
> unique identifiers. If this is done incorrectly you will end up with
> non-unique identifiers. Also, the probability that things go wrong
> increases with the number of groups.
> As a side note, if you have a reasonable idea about how many groups
> you have, you may include this information when creating your hash
> table (third argument in asarray_create). This ensures that you do not
> get any rebuilds of the hash table while adding elements, which may
> increase your performance especially if you have a lot of groups.
> Best,
>
> Aljar
>
>
> *** begin example ***
> clear all
>
> set obs 100000
>
> gen x = ceil(runiform() *  1000)
> gen y = rnormal()
>
> timer on 1
> mata
> st_view(x=., ., "x y", .)
> minindex(x[., 1], ., i=., v=.)
> N = rows(v)
> v[.,2] = runningsum(v[., 2])
> t = J(N, 1, .)
> for(idx = 1; idx <= N; ++idx){
>     idx_x = i[|v[idx, 1], 1 \ v[idx, 2], 1 |]
>     t[idx] = mean(x[idx_x, 2])
> }
> end
> *** end example ***
>
> 2014-03-05 19:40 GMT+01:00 Joe Canner <[email protected]>:
>> Andrew,
>>
>> By default, MS Outlook removes "extra" line breaks in plain-text messages.  You can restore them for individual messages by clicking where it says "Extra lines breaks in this message were removed", above the "From:" line.
>>
>> If you want to change the default, go to File->Options->Mail->Message format.  See http://support.microsoft.com/kb/287816 for more details.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Maurer
>> Sent: Wednesday, March 05, 2014 12:53 PM
>> To: [email protected]
>> Subject: RE: st: Efficiently using associative arrays in mata
>>
>> Thanks very much for the response, Aljar. I tried your suggestion and found that the two-lookup method took about 35% longer in the below test (around 8.5 seconds vs 11.5 seconds), which isn't negligible..
>>
>> I'd be interested in hearing your suggestion on alternative code if the number of groups is small. I use this for a few different definitions of the id variable, but on a set of 1billion observations, the final number of observations may be 2000 at the low end or 10million at the high end.
>>
>> As an aside - I've noticed that lines of pasted code sometimes merge together when I email to statalist. Does anyone know why this happens? (messages sent using "plain text" format in Outlook).
>>
>> ************ Create sample data ************* // create sample data // variables: species, breed, animal_id, and date clear all local n 1000000 qui set obs `n'
>> gen byte species = int(4/`n' * (_n-1)) + 1 gen byte breed = int(5*runiform()) + 1 gen long animal_id = int((_n-1)/100) gen int date = mod(_n,100) + 500 gen x = runiform() sort species breed date qui save temp, replace
>> ************ End create sample data *********
>>
>> *********** Begin timed benchmark *********** clear all mata
>>
>> idname = ("species","breed","animal_id") xname = "x"
>>
>> // load data into mata
>> st_view(id=0, ., idname)
>> st_view(data=0, ., xname)
>>
>> // initialize array
>> mydic = asarray_create("real", cols(id)) asarray_notfound(mydic, 0)
>>
>> timer_clear()
>> for (j=1; j<=5; j++) {
>> timer_on(j)
>> // loop through observations
>> for (i=1; i<=rows(data); i++) {
>>         thisid = id[i,.]
>>         thisx = data[i,1]
>>         asarray(mydic, thisid, 1)
>> }
>> timer_off(j)
>> }
>>
>> for (j=6; j<=10; j++) {
>> timer_on(j)
>> // loop through observations
>> for (i=1; i<=rows(data); i++) {
>>         thisid = id[i,.]
>>         thisx = data[i,1]
>>         asarray(mydic, thisid, asarray(mydic, thisid) + thisx) }
>> timer_off(j)
>> }
>> timer()
>>
>> end
>> exit
>> *********** End timed benchmark *************
>>
>> *********** Output ***********
>>   1.       8.63 /        1 =     8.627
>>   2.       8.77 /        1 =     8.768
>>   3.       9.03 /        1 =     9.032
>>   4.        7.5 /        1 =     7.504
>>   5.        7.6 /        1 =     7.597
>>   6.       11.6 /        1 =     11.56
>>   7.       11.2 /        1 =    11.201
>>   8.       11.5 /        1 =    11.482
>>   9.         12 /        1 =    12.044
>>  10.         12 /        1 =    11.996
>> *********** End output *******
>>
>> Thank you,
>>
>> Andrew Maurer
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Aljar Meesters
>> Sent: Wednesday, March 05, 2014 10:22 AM
>> To: [email protected]
>> Subject: Re: st: Efficiently using associative arrays in mata
>>
>> Dear Andrew,
>>
>> I don't think that it possible to obtain the memory address of thisid in mydict with the standard implementation of asarray. Although it requires only a minor change in asarray to get this functionality, I don't think it is of great help for you. I have changed the line -asarray(mydict, thisid, asarray(mydict, thisid) + thisx)- into -asarray(mydict, thisid, 1)- to mimic the case you are looking for and although you get a speedup, it is not the speedup of a factor two you are looking for, but maybe on your system it does, so you may try this.
>> However, do you have some information about the number of observations after the collapse? If the number of groups is small I may have some code that is useful for you although it requires some changes in order to suit your usecase.
>> Best,
>>
>> Aljar
>>
>>
>> 2014-03-05 1:12 GMT+01:00 Andrew Maurer <[email protected]>:
>>> Hi Statalist,
>>>
>>> I am trying to write a function in mata that returns sums by groups (like collapse), but without sorting. I am interested in calculating sums over groups for very large datasets (eg, 1billion observations). I have run into issues with an individual "sort" command taking around 60min to complete.
>>>
>>> I suspect that since collapse internally uses a sort, computation time is O(nlogn), or something else worse than linear. However, using associative arrays, the process could be done in linear time (though I'd imagine with higher memory usage). Ie - loop once through the whole dataset, storing the running sum of each groupid separately. I have a working example below, but the "loop through observations" section is slower than expected (using collapse as a benchmark, it's 2-3x slower for up to 10million obs).
>>>
>>> I suspect that the issue is in the line:
>>> asarray(mydict, thisid, asarray(mydict, thisid) + thisx)
>>>
>>> The problem is that asarray() is called twice, so the memory address of thisid in mydict has to be looked up twice. Does anyone have any suggestions on a way to complete this task, only looking up the address of thisid once, without having to rewrite the asarray() function?
>>>
>>>
>>>
>>> ************** begin code ********************** clear all
>>>
>>> mata
>>> matrix array_runningsum(idname, xname) {
>>>
>>> // load data into mata
>>> st_view(id=0, ., idname)
>>> st_view(data=0, ., xname)
>>>
>>> // initialize array
>>> mydict = asarray_create("real", cols(id)) asarray_notfound(mydict, 0)
>>>
>>> // loop through observations and calculate running sum for (i=1;
>>> i<=rows(data); i++) {
>>>                 thisid = id[i,.]
>>>                 thisx = data[i,1]
>>>                 asarray(mydict, thisid, asarray(mydict, thisid) +
>>> thisx) }
>>>
>>> // convert array to matrix
>>> N = asarray_elements(mydict)
>>> A = J(N,(1+cols(id)),.)
>>> i = 1
>>>
>>> for (loc=asarray_first(mydict); loc!=NULL; loc=asarray_next(mydict, loc)) {
>>>                 A[i++,.] = asarray_key(mydict, loc),
>>> asarray_contents(mydict, loc); }
>>>
>>> return(sort(A,1..cols(id)))
>>> }
>>> end
>>>
>>> // working example
>>> sysuse auto, clear
>>> mata array_runningsum(("foreign", "mpg"), "price")
>>>
>>> // compare to:
>>> sysuse auto, clear
>>> collapse (sum) price, by(foreign mpg)
>>> list
>>> ************** end code ************************
>>>
>>>
>>> Thank you,
>>>
>>> Andrew Maurer
>>>
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: Efficiently using associative arrays in mata
  - From: Andrew Maurer <[email protected]>
- Re: st: Efficiently using associative arrays in mata
  - From: Aljar Meesters <[email protected]>
- RE: st: Efficiently using associative arrays in mata
  - From: Andrew Maurer <[email protected]>
- RE: st: Efficiently using associative arrays in mata
  - From: Joe Canner <[email protected]>
- Re: st: Efficiently using associative arrays in mata
  - From: Aljar Meesters <[email protected]>
- Re: st: Efficiently using associative arrays in mata
  - From: Aljar Meesters <[email protected]>
Prev by Date: st: RE: Dropping parts of strings (i.e. Baldwin County becomes Baldwin)
Next by Date: Re: st: Robust Regression
Previous by thread: Re: st: Efficiently using associative arrays in mata
Next by thread: st: Graphics - normal density line
Index(es):
- Date
- Thread