Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   László Sándor <[email protected]>
To   [email protected]
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 12:13:02 -0400

Nick,

I hope this helps to keep (or make?) this a fruitful discussion:

I have tens of millions of observations, or more. E.g. all taxpayers
over many years. I would rather not sort them. Doesn't sorting scale
worse than -if- checks? I always have only a few bins, so I never loop
over more than a few dozens bin-values… But I would always have to
sort all observations. The sorting variable taking only a limited
number of values does not matter that much here, does it?

And this also comes back to -scatteri-. I cannot plot (say) 10 points
for the ten deciles without having variables for the deciles, which is
scary. (No I don't want to associate the second decile with my second
observation, not even in a tempvar — and even generating tempvars and
plotting them with many tens of million missing values is very slow).
Why not plot the ten deciles directly?

On Mon, Aug 19, 2013 at 12:01 PM, Nick Cox <[email protected]> wrote:
> These reservations don't change my advice.
>
> 1. For graphics it is easiest by far to have variables to graph. I use
> -scatteri- for special effects, but you'd have to factor in
> programming time to get that right.
>
> 2. Trying to avoid a -sort- strikes me as false economy here. -by:-
> goes hand in hand with a -sort- and the alternative of some kind of
> loop and -if- can't compete well, in my experience.
>
> Nick
> [email protected]
>
>
> On 19 August 2013 16:39, László Sándor <[email protected]> wrote:
>> Thanks, Nick.
>>
>> I would much prefer not to have a new dataset. I want to plot (and
>> maybe log) the binned means, and to use the immediate -scatteri-, I
>> need no new variables but probably a few macros with the means stored
>> in them.
>>
>> I am not sure if Mata would be faster than the looping tests we ran
>> for otherwise optimised -sum, meanonly-, they were surprisingly fast.
>> No, I haven't tried either.
>>
>> -_gmean.ado- does start with a sort, which is prohibitive with big
>> data as the primary use case.
>>
>> On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
>>> I got lost somewhere early on in this thread, but I don't see mention
>>> towards the end of some other possibilities.
>>>
>>> If I understand it correctly, László  wants a fast implementation
>>> equivalent to -tabulate, summarize- in place, i.e. without replacing
>>> the dataset by a reduced dataset, but with results saved to the new
>>> dataset.
>>>
>>> As earlier emphasised, he can't borrow or steal code from -tabulate,
>>> summarize- because that is compiled C code invisible to anyone except
>>> Stata developers (and they aren't about to show you (and you almost
>>> certainly wouldn't benefit from the code any way as it probably
>>> depends on lots of other C code (if it's typical of C code of this
>>> kind (or at least that's my guess)))).
>>>
>>> To that end, my thoughts are
>>>
>>> 1. Never use reduction commands if you don't want data reduction,
>>> unless you can be sure that the overhead of reading the data in again
>>> can be ignored and you can't think of a better method.
>>>
>>> 2. The possibility of using Mata was not mentioned (but no, I don't
>>> have code up my sleeve).
>>>
>>> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
>>> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
>>> and although there are a few lines to be interpreted I would expect it
>>> to be pretty fast.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>>>> Thanks, all.
>>>>
>>>> I am still confused how I could combine the speed of sum of the
>>>> methods like -collapse- without losing my data, which usually takes
>>>> dozens of GBs.
>>>>
>>>> Otherwise I think we are only talking about -tabulate- versus -table-
>>>> but both need log-parsing, or some -bys: summarize- and collecting
>>>> locals, which I did not attempt.
>>>>
>>>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>>>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>>>> dataset many times just ifs in different observations.
>>>>
>>>> On an 8-core StataMP 13 for Linux,
>>>> full -tabulate, sum- itself took ~140s
>>>> -tab, matcell- took <5s, but indeed generates frequencies only.
>>>> a second -tabulate, sum-, even with nof and nost, took the same
>>>> also with the caplog wrapper
>>>> a -collapse, fast- took 36s, but of course this loses the data
>>>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>>>> still cannot scatteri the results in the same Stata instance…
>>>>
>>>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>>>> full -tabulate, sum- itself took ~195s
>>>> -tab, matcell-: 8s
>>>> again same speed without frequencies and standard deviations, or with
>>>> the wrapper, for -tab, sum-
>>>> -collapse- took 60s
>>>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index