Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 18:06:45 +0100

I believe whatever you say about times to read in the data again, and so forth.

But if you -sort- first and then follow with lots of statements
including -if- there is little or no gain. Stata still has to loop
through all the observations and test every one using the -if-
condition. The point about -sort-ing here is that calculations -by:-
are then faster (and indeed possible at all). That's what the -egen-
code exploits.

But I am very possibly very confused on what you are calculating. This
thread started with reference to -tabulate, summarize- but now
(somehow) you are talking about deciles. Mean and standard deviations
don't require prior sorting but deciles do.

I think I'd better bail out now, just in cause I am adding more
confusion to a convoluted thread.

Nick
[email protected]


On 19 August 2013 17:56, László Sándor <[email protected]> wrote:
> Thanks, Nick.
>
> 1. I will experiment with looping over 10-20 if conditions, without
> the sorting. I cannot imagine sorted data speeding up anything that
> much to spend the upfront time cost on sorting.
>
> 2. -collapse- is a no-go. I need the original data, as I am producing
> many graphs, one after another, with different outcomes, different
> sample restrictions etc. So no, I cannot lose (or even preserve) it
> just to load the whole thing back in, which takes minutes if not more
> with our current drives.
>
> And by the way, though are tests have not been extensive, the if-loop
> was not much slower than -collapse, fast-, strangely enough.
>
> Thanks again,
>
> Laszlo
>
> On Mon, Aug 19, 2013 at 12:44 PM, Nick Cox <[email protected]> wrote:
>> Clearly the rest of us don't have your data and can't experiment.
>> (This is not a veiled request to send me those data, thanks!)
>>
>> Also, the interest of this for others I suggest lies solely in your
>> problem, generalised, and so anything that is quirky about what you
>> want is fine by me (us?) but not compelling for anybody else. I am
>> concerned with general strategy as I understand it.
>>
>> You make two points here.
>>
>> 1. You would rather not -sort- if you can avoid it. Well, I think we
>> all agree with that. But I've learned not to avoid it for many
>> problems.
>>
>> 2. All you want to plot are ten deciles. You probably mentioned that
>> several threads ago, or earlier in this thread. I agree that makes the
>> graphical problem easier. (It seems a pity that with so much data you
>> don't plot a lot more detail!) But if the main purpose of the
>> calculation is to get a reduced dataset for graphing, -collapse- seems
>> to re-enter the discussion. (Underneath the hood -graph- does an awful
>> lot of -collapse-s.)
>> Nick
>> [email protected]
>>
>>
>> On 19 August 2013 17:13, László Sándor <[email protected]> wrote:
>>> Nick,
>>>
>>> I hope this helps to keep (or make?) this a fruitful discussion:
>>>
>>> I have tens of millions of observations, or more. E.g. all taxpayers
>>> over many years. I would rather not sort them. Doesn't sorting scale
>>> worse than -if- checks? I always have only a few bins, so I never loop
>>> over more than a few dozens bin-values… But I would always have to
>>> sort all observations. The sorting variable taking only a limited
>>> number of values does not matter that much here, does it?
>>>
>>> And this also comes back to -scatteri-. I cannot plot (say) 10 points
>>> for the ten deciles without having variables for the deciles, which is
>>> scary. (No I don't want to associate the second decile with my second
>>> observation, not even in a tempvar — and even generating tempvars and
>>> plotting them with many tens of million missing values is very slow).
>>> Why not plot the ten deciles directly?
>>>
>>> On Mon, Aug 19, 2013 at 12:01 PM, Nick Cox <[email protected]> wrote:
>>>> These reservations don't change my advice.
>>>>
>>>> 1. For graphics it is easiest by far to have variables to graph. I use
>>>> -scatteri- for special effects, but you'd have to factor in
>>>> programming time to get that right.
>>>>
>>>> 2. Trying to avoid a -sort- strikes me as false economy here. -by:-
>>>> goes hand in hand with a -sort- and the alternative of some kind of
>>>> loop and -if- can't compete well, in my experience.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 19 August 2013 16:39, László Sándor <[email protected]> wrote:
>>>>> Thanks, Nick.
>>>>>
>>>>> I would much prefer not to have a new dataset. I want to plot (and
>>>>> maybe log) the binned means, and to use the immediate -scatteri-, I
>>>>> need no new variables but probably a few macros with the means stored
>>>>> in them.
>>>>>
>>>>> I am not sure if Mata would be faster than the looping tests we ran
>>>>> for otherwise optimised -sum, meanonly-, they were surprisingly fast.
>>>>> No, I haven't tried either.
>>>>>
>>>>> -_gmean.ado- does start with a sort, which is prohibitive with big
>>>>> data as the primary use case.
>>>>>
>>>>> On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
>>>>>> I got lost somewhere early on in this thread, but I don't see mention
>>>>>> towards the end of some other possibilities.
>>>>>>
>>>>>> If I understand it correctly, László  wants a fast implementation
>>>>>> equivalent to -tabulate, summarize- in place, i.e. without replacing
>>>>>> the dataset by a reduced dataset, but with results saved to the new
>>>>>> dataset.
>>>>>>
>>>>>> As earlier emphasised, he can't borrow or steal code from -tabulate,
>>>>>> summarize- because that is compiled C code invisible to anyone except
>>>>>> Stata developers (and they aren't about to show you (and you almost
>>>>>> certainly wouldn't benefit from the code any way as it probably
>>>>>> depends on lots of other C code (if it's typical of C code of this
>>>>>> kind (or at least that's my guess)))).
>>>>>>
>>>>>> To that end, my thoughts are
>>>>>>
>>>>>> 1. Never use reduction commands if you don't want data reduction,
>>>>>> unless you can be sure that the overhead of reading the data in again
>>>>>> can be ignored and you can't think of a better method.
>>>>>>
>>>>>> 2. The possibility of using Mata was not mentioned (but no, I don't
>>>>>> have code up my sleeve).
>>>>>>
>>>>>> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
>>>>>> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
>>>>>> and although there are a few lines to be interpreted I would expect it
>>>>>> to be pretty fast.
>>>>>>
>>>>>> Nick
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>>>>>>> Thanks, all.
>>>>>>>
>>>>>>> I am still confused how I could combine the speed of sum of the
>>>>>>> methods like -collapse- without losing my data, which usually takes
>>>>>>> dozens of GBs.
>>>>>>>
>>>>>>> Otherwise I think we are only talking about -tabulate- versus -table-
>>>>>>> but both need log-parsing, or some -bys: summarize- and collecting
>>>>>>> locals, which I did not attempt.
>>>>>>>
>>>>>>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>>>>>>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>>>>>>> dataset many times just ifs in different observations.
>>>>>>>
>>>>>>> On an 8-core StataMP 13 for Linux,
>>>>>>> full -tabulate, sum- itself took ~140s
>>>>>>> -tab, matcell- took <5s, but indeed generates frequencies only.
>>>>>>> a second -tabulate, sum-, even with nof and nost, took the same
>>>>>>> also with the caplog wrapper
>>>>>>> a -collapse, fast- took 36s, but of course this loses the data
>>>>>>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>>>>>>> still cannot scatteri the results in the same Stata instance…
>>>>>>>
>>>>>>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>>>>>>> full -tabulate, sum- itself took ~195s
>>>>>>> -tab, matcell-: 8s
>>>>>>> again same speed without frequencies and standard deviations, or with
>>>>>>> the wrapper, for -tab, sum-
>>>>>>> -collapse- took 60s
>>>>>>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index