Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   Austin Nichols <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Thu, 22 Aug 2013 17:58:00 -0400

László Sándor <[email protected]> :
I am guessing your -sort- takes 16 minutes, and each -by- calculation
takes 4 or 5 minutes. The first -bysort- sorts the data; subsequent
calls to -bysort- do not need to re-sort the data. Have you tried
using Mata?

On Thu, Aug 22, 2013 at 6:43 AM, László Sándor <[email protected]> wrote:
> For those out there who care:
>
> I wonder why this thing is not more stable. I am confident that now I
> am using all 64 cores on a node of a cluster with Stata/MP 13, with
> plenty of RAM. I generate the 8 byte variables of 20 values with size
> maxlong(). I use the same random sorting before running any of these
> methods, and I try the sequence twice independently.
>
> Now -tab, sum()- got slower, it took roughly 36 minutes, all three
> times I tried.
> -collapse, fast- took less than 20-23 minutes.
> The if loops took around 90-100 minutes.
> The -bys- loop took only 20 minutes once, then LESS THAN 4.5 minutes
> twice (with unsorted data?!).
>
> Of course, now the question is, why doesn't -tab- use the same optimizations…
>
> In any case, perhaps this was useful.
>
> Laszlo
>
> On Tue, Aug 20, 2013 at 4:19 PM, László Sándor <[email protected]> wrote:
>> So, I reran the test on 8 cores, with Stata/MP 13, with 32 GB RAM.
>>
>> I made the following changes:
>> 1. I maxed out the number of observations. (see -h limits- and -h maxlong-)
>> 2. Made ten byte variables taking 20 integer values, this takes up 25
>> GB out of the 32, close to the StataCorp recommendations of leaving
>> 50% extra. But I did not check if virtual memory is touched, maybe I
>> can scale dataset down a bit.
>> 3. So I am taking 20 bins now, in case -tabulate, sum- and loops of
>> -sum if, meansonly- scale differently.
>> 4. I take only oneway tabs, as that's what I need, testing twoway was a mistake.
>> 5. I also try a -bys bins:- "looping".
>> +1. I mentioned I corrected Eric's code about not looping over all
>> values that were "tabbed over". Now the two are comparable.
>>
>> In this setup,
>> -- -tabulate, sum nof noobs nol nost- completes in only 1516.36
>> seconds, or ~25 minutes.
>> -- the simple frequency tab takes only 583.51 s, but again, this is
>> not in the run.
>> -- -collapse, fast- took 4025.64 seconds, much slower than -tab, sum-,
>> very strange. (I am pretty sure I have exclusive use of this compute
>> node, no other process is running or scheduling me).
>> -- the if-loops took 3967s, shockingly comparable to -collapse, fast-,
>> but still much slower than (now oneway) -tab, sum-.
>> -- -bys bins: sum, meanonly- took 3205 s.
>>
>> So -tab, sum- is unbeatable on big data for oneway tabs with a
>> moderate number of bins. Or others can run other tests.
>>
>> So I stick to parsing the log of -tab, sum-.
>>
>> Thanks for all your thoughts,
>>
>> Laszlo
>>
>> On Tue, Aug 20, 2013 at 5:08 AM, László Sándor <[email protected]> wrote:
>>> Thanks, Maarten.
>>>
>>> My understanding of byable commands was that they loop over -if-
>>> conditions anyway, though -in- conditions are supposed to be less
>>> wasteful and would explain why the prefix requires sorted data.
>>>
>>> Trust me, this code is heavily used on big data, if each run can save
>>> us minutes, it is still worth it. And my current tests with maxing out
>>> the code in this thread with -maxlong()- number of observations (the
>>> limit) and thus 20 GB of data gives a 20-minute lead to -collapse-
>>> over -tab, sum-. However, the key comparison is with the loops here,
>>> and I did not catch that the test was biased in their favor as they
>>> did not loop over all observations. I am rerunning those tests now.
>>>
>>> On Tue, Aug 20, 2013 at 4:21 AM, Maarten Buis <[email protected]> wrote:
>>>> On Mon, Aug 19, 2013 at 7:30 PM, László Sándor wrote:
>>>>> The other option seemed to be to try to keep track of the levels of
>>>>> "bins", and just forval loop over the values, if-ing in a bin at a
>>>>> time to quickly grab the means. This was surprisingly fast, and does
>>>>> not seem to be any slower without a sort beforehand. Again, I am not
>>>>> sure any efficiency of -bys- looping of ifs does not seem to be worth
>>>>> the cost of the initial sorting.
>>>>
>>>> I think you are mixing up advise here: -by: <something>- is likely to
>>>> be faster than a -forvalues- loop combined with -if- conditions. I
>>>> don't think anyone suggested that you sort before that loop. The logic
>>>> is that an -if- condition will each time by necesisty have to go
>>>> through all observations. The alternative would be a single sort with
>>>> -in- conditions, which I guess is what is at the core of the speed of
>>>> the -by- prefix. Depending on how many times you want to use -if-
>>>> conditions, there will be a point where the combination of a single
>>>> -sort- and many -in- conditions will be quicker than many -if-
>>>> conditions. But I don't expect that -sort-ing will help if you choose
>>>> the -forvalues- loop combined with -if- conditions.
>>>>
>>>> On a pragmatic level: how much time have you now spent trying to write
>>>> this code, and how much time do you expect to safe with that? Are you
>>>> sure that you don't end up with a nett loss of time?
>>>>
>>>> -- Maarten

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index