Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 17:44:41 +0100

Clearly the rest of us don't have your data and can't experiment.
(This is not a veiled request to send me those data, thanks!)

Also, the interest of this for others I suggest lies solely in your
problem, generalised, and so anything that is quirky about what you
want is fine by me (us?) but not compelling for anybody else. I am
concerned with general strategy as I understand it.

You make two points here.

1. You would rather not -sort- if you can avoid it. Well, I think we
all agree with that. But I've learned not to avoid it for many
problems.

2. All you want to plot are ten deciles. You probably mentioned that
several threads ago, or earlier in this thread. I agree that makes the
graphical problem easier. (It seems a pity that with so much data you
don't plot a lot more detail!) But if the main purpose of the
calculation is to get a reduced dataset for graphing, -collapse- seems
to re-enter the discussion. (Underneath the hood -graph- does an awful
lot of -collapse-s.)
Nick
[email protected]


On 19 August 2013 17:13, László Sándor <[email protected]> wrote:
> Nick,
>
> I hope this helps to keep (or make?) this a fruitful discussion:
>
> I have tens of millions of observations, or more. E.g. all taxpayers
> over many years. I would rather not sort them. Doesn't sorting scale
> worse than -if- checks? I always have only a few bins, so I never loop
> over more than a few dozens bin-values… But I would always have to
> sort all observations. The sorting variable taking only a limited
> number of values does not matter that much here, does it?
>
> And this also comes back to -scatteri-. I cannot plot (say) 10 points
> for the ten deciles without having variables for the deciles, which is
> scary. (No I don't want to associate the second decile with my second
> observation, not even in a tempvar — and even generating tempvars and
> plotting them with many tens of million missing values is very slow).
> Why not plot the ten deciles directly?
>
> On Mon, Aug 19, 2013 at 12:01 PM, Nick Cox <[email protected]> wrote:
>> These reservations don't change my advice.
>>
>> 1. For graphics it is easiest by far to have variables to graph. I use
>> -scatteri- for special effects, but you'd have to factor in
>> programming time to get that right.
>>
>> 2. Trying to avoid a -sort- strikes me as false economy here. -by:-
>> goes hand in hand with a -sort- and the alternative of some kind of
>> loop and -if- can't compete well, in my experience.
>>
>> Nick
>> [email protected]
>>
>>
>> On 19 August 2013 16:39, László Sándor <[email protected]> wrote:
>>> Thanks, Nick.
>>>
>>> I would much prefer not to have a new dataset. I want to plot (and
>>> maybe log) the binned means, and to use the immediate -scatteri-, I
>>> need no new variables but probably a few macros with the means stored
>>> in them.
>>>
>>> I am not sure if Mata would be faster than the looping tests we ran
>>> for otherwise optimised -sum, meanonly-, they were surprisingly fast.
>>> No, I haven't tried either.
>>>
>>> -_gmean.ado- does start with a sort, which is prohibitive with big
>>> data as the primary use case.
>>>
>>> On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
>>>> I got lost somewhere early on in this thread, but I don't see mention
>>>> towards the end of some other possibilities.
>>>>
>>>> If I understand it correctly, László  wants a fast implementation
>>>> equivalent to -tabulate, summarize- in place, i.e. without replacing
>>>> the dataset by a reduced dataset, but with results saved to the new
>>>> dataset.
>>>>
>>>> As earlier emphasised, he can't borrow or steal code from -tabulate,
>>>> summarize- because that is compiled C code invisible to anyone except
>>>> Stata developers (and they aren't about to show you (and you almost
>>>> certainly wouldn't benefit from the code any way as it probably
>>>> depends on lots of other C code (if it's typical of C code of this
>>>> kind (or at least that's my guess)))).
>>>>
>>>> To that end, my thoughts are
>>>>
>>>> 1. Never use reduction commands if you don't want data reduction,
>>>> unless you can be sure that the overhead of reading the data in again
>>>> can be ignored and you can't think of a better method.
>>>>
>>>> 2. The possibility of using Mata was not mentioned (but no, I don't
>>>> have code up my sleeve).
>>>>
>>>> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
>>>> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
>>>> and although there are a few lines to be interpreted I would expect it
>>>> to be pretty fast.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>>>>> Thanks, all.
>>>>>
>>>>> I am still confused how I could combine the speed of sum of the
>>>>> methods like -collapse- without losing my data, which usually takes
>>>>> dozens of GBs.
>>>>>
>>>>> Otherwise I think we are only talking about -tabulate- versus -table-
>>>>> but both need log-parsing, or some -bys: summarize- and collecting
>>>>> locals, which I did not attempt.
>>>>>
>>>>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>>>>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>>>>> dataset many times just ifs in different observations.
>>>>>
>>>>> On an 8-core StataMP 13 for Linux,
>>>>> full -tabulate, sum- itself took ~140s
>>>>> -tab, matcell- took <5s, but indeed generates frequencies only.
>>>>> a second -tabulate, sum-, even with nof and nost, took the same
>>>>> also with the caplog wrapper
>>>>> a -collapse, fast- took 36s, but of course this loses the data
>>>>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>>>>> still cannot scatteri the results in the same Stata instance…
>>>>>
>>>>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>>>>> full -tabulate, sum- itself took ~195s
>>>>> -tab, matcell-: 8s
>>>>> again same speed without frequencies and standard deviations, or with
>>>>> the wrapper, for -tab, sum-
>>>>> -collapse- took 60s
>>>>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index