Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   László Sándor <[email protected]>
To   [email protected]
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 11:21:03 -0400

But of course, the fastest example above is cheating a bit, as it know
the values of v1 and v2. A simple -bysort- to circumvent that would
immediately punish us heavily with sorting dozens of gigabytes.

But-but-but, my main use case uses the discrete values of a variable.
Is -levelsof- faster than -bys- (then why isn't it used more often?).

Or as in most cases the discrete values come from a previous xtiling,
I know the value of this variable, or might even keep track of the
quantiles in a local somewhere.

Thanks for any thoughts on speeding up binned averaging.

On Mon, Aug 19, 2013 at 10:50 AM, László Sándor <[email protected]> wrote:
> Credit where credit is due: I meant Eric and Phil's tests, of course,
> I apologize, with Roger's thoughts also much appreciated.
>
> I am still surprised that loops of interpreted code beats the built-in
> C. So maybe -tabulate- was not heavily optimized in the end.
>
> Thanks for everything!
>
> Laszlo
>
> On Mon, Aug 19, 2013 at 10:45 AM, László Sándor <[email protected]> wrote:
>> Thanks, all.
>>
>> I am still confused how I could combine the speed of sum of the
>> methods like -collapse- without losing my data, which usually takes
>> dozens of GBs.
>>
>> Otherwise I think we are only talking about -tabulate- versus -table-
>> but both need log-parsing, or some -bys: summarize- and collecting
>> locals, which I did not attempt.
>>
>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>> dataset many times just ifs in different observations.
>>
>> On an 8-core StataMP 13 for Linux,
>> full -tabulate, sum- itself took ~140s
>> -tab, matcell- took <5s, but indeed generates frequencies only.
>> a second -tabulate, sum-, even with nof and nost, took the same
>> also with the caplog wrapper
>> a -collapse, fast- took 36s, but of course this loses the data
>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>> still cannot scatteri the results in the same Stata instance…
>>
>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>> full -tabulate, sum- itself took ~195s
>> -tab, matcell-: 8s
>> again same speed without frequencies and standard deviations, or with
>> the wrapper, for -tab, sum-
>> -collapse- took 60s
>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>
>> Thanks!
>>
>> Laszlo
>>
>> On Mon, Aug 19, 2013 at 8:59 AM, Phil Clayton
>> <[email protected]> wrote:
>>> There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method.
>>>
>>> Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit)
>>>
>>> Phil
>>>
>>> On 19/08/2013, at 10:16 PM, László Sándor <[email protected]> wrote:
>>>
>>>> Thanks for all this.
>>>>
>>>> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not
>>>> the fastest method by far.
>>>>
>>>> But indeed, having multiple datasets in memory is the bottleneck, so I
>>>> am not sure whether postfile or logout would solve much of the problem
>>>> — as for the results from the new files, I'd need to lose the current
>>>> data (or preserve and restore it).
>>>>
>>>> Currently, I am working on reading in the tabulated values into macros
>>>> to plug them into -scatteri-, but it is a hack.
>>>>
>>>> Thanks again,
>>>>
>>>> Laszlo
>>>>
>>>> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson
>>>> <[email protected]> wrote:
>>>>> The main problem with this solution is that you have to put in a lot more
>>>>> programming time, especially if you want to conserve the variable labels,
>>>>> value labels etc. of the by-variables. (That at least is my excuse for the
>>>>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still
>>>>> tend to use.)
>>>>>
>>>>> IMHO it is a major limitation of Stata that it cannot store any number of
>>>>> datasets (or dataframes) in the memory at a time. If it could, then we would
>>>>> not be forced to use -preserve- and -restore- so often and burn computer
>>>>> time in file I/O, just to conserve person-days.
>>>>>
>>>>> On the other hand, R (the main serious non-legacy competitor to Stata
>>>>> nowadays) has the even greater limitation that it doesn't have anything
>>>>> quite like Mata. Plus only a few of my colleagues seem to be confident using
>>>>> R!!!
>>>>>
>>>>>
>>>>> Best wishes
>>>>>
>>>>> Roger
>>>>>
>>>>> Roger B Newson BSc MSc DPhil
>>>>> Lecturer in Medical Statistics
>>>>> Respiratory Epidemiology and Public Health Group
>>>>> National Heart and Lung Institute
>>>>> Imperial College London
>>>>> Royal Brompton Campus
>>>>> Room 33, Emmanuel Kaye Building
>>>>> 1B Manresa Road
>>>>> London SW3 6LR
>>>>> UNITED KINGDOM
>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>> Fax: +44 (0)20 7351 8322
>>>>> Email: [email protected]
>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>> Departmental Web page:
>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>
>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>
>>>>> On 19/08/2013 01:06, Phil Clayton wrote:
>>>>>>
>>>>>> If you can avoid the -preserve- and -restore- you save loads of time (at
>>>>>> least on my modest system...)
>>>>>>
>>>>>> *--ex5.  using summarize and postfile**
>>>>>> tempname post
>>>>>> tempfile postfile
>>>>>> postfile `post' v1 v2 mean sd n using "`postfile'"
>>>>>> forval x = 4(-1)1 {
>>>>>>        forval y = 3(-1)1 {
>>>>>>                display "v1=`x', v2=`y'"
>>>>>>                qui sum v3 if v1==`x' & v2 == `y'
>>>>>>                post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
>>>>>>        } //end of y loop
>>>>>> } //end of x loop
>>>>>> postclose `post'
>>>>>> use "`postfile'", clear
>>>>>>
>>>>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:
>>>>>>
>>>>>>> <>
>>>>>>> Hi Laszlo:   I agree that it would be nice if -tabulate,summarize()-
>>>>>>> stored values but it doesnt.  There are several options available to
>>>>>>> store those values and then use them elsewhere.  The issues seem to be
>>>>>>> (1) ease of parsing the values into a format that you can use for
>>>>>>> other analyses and (2) (and more important for you) the speed with
>>>>>>> which you can calculate, store, parse, and then use those values.
>>>>>>>
>>>>>>> Some alternatives to collapse include logging the -tabulate,
>>>>>>> summarize()- output and then parsing it, using -collapse- to get your
>>>>>>> values,  or using the compiled  -summarize- command to obtain the
>>>>>>> values of interest and store them for use elsewhere.  I'm sure there
>>>>>>> are other options, but below is a comparison of these methods against
>>>>>>> the speed of the desired -tabulate, summarize()- solution on a
>>>>>>> large-ish fake dataset.
>>>>>>>
>>>>>>> This is not a clean comparison and the values I store for later use
>>>>>>> are not exactly the same in every example, but it gives you an idea of
>>>>>>> the speed differences of the steps that might be involved for each
>>>>>>> approach (that is, preserving the data, summarizing or collapsing or
>>>>>>> XX, storing and parsing the output, and restoring the data).  The
>>>>>>> upshot is that, for this example on my computer, it seems that running
>>>>>>> -summarize- in a loop to grab the values you want and store them in a
>>>>>>> dataset was the quickest non-tab, summarize()- option I tried (example
>>>>>>> 4 below), but this would be slower on a lot of data points.  Plus,
>>>>>>> both Examples 3 & 4 below are both faster than running -tabulate,
>>>>>>> summarize()-.
>>>>>>>
>>>>>>> Using -tabulate, summarize()-  to get values takes about 101 seconds
>>>>>>> to run in my example.
>>>>>>> Example 1 is regular tabulate example with cells stored in a matrix --
>>>>>>> this took about 9 seconds, but doesnt require any calculation of means
>>>>>>> or what not.  Ex 2 is using -logout- to parse the syntax (you could do
>>>>>>> this manually too) and took the longest at about 109 seconds.  Ex 3
>>>>>>> uses -collapse- with preserve/restore and takes about 36 seconds.  Ex
>>>>>>> 4 uses a loop to grab means from summarize for certain values and
>>>>>>> takes about 27 seconds.
>>>>>>>
>>>>>>> *********************! Begin Example
>>>>>>> //intro stuff//
>>>>>>> clear all
>>>>>>> timer clear
>>>>>>> set rmsg on
>>>>>>> *--install  packages for the example
>>>>>>> cap which logout
>>>>>>> if _rc ssc install logout , replace
>>>>>>> *--make fake data
>>>>>>> sa master.dta, replace emptyok //for later
>>>>>>> set obs `=2^25' //run on a big dataset
>>>>>>> forval x = 1/10 {
>>>>>>>   g v`x' = round(runiform()*5)
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> //examples//
>>>>>>>   **
>>>>>>>   tabulate v1 v2, summarize(v3)  //for ref. takes c.108 Seconds
>>>>>>>   **
>>>>>>>
>>>>>>> *--ex1. time working with -tab- stored values**
>>>>>>> **this doesnt get the values you need..
>>>>>>> **but allows us to compare speed of these approaches somewhat
>>>>>>> tab v1 v2,  matcell(A)
>>>>>>> mat list A
>>>>>>> preserve
>>>>>>>  clear
>>>>>>> svmat A, names(A)
>>>>>>> keep A1
>>>>>>> keep in 1/3 //parse
>>>>>>> l
>>>>>>> restore
>>>>>>>
>>>>>>>
>>>>>>> *--ex2.  parsing the tab, summarize() output**
>>>>>>> *logout*
>>>>>>> preserve
>>>>>>>     caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof
>>>>>>> nost
>>>>>>>     logout, use(mystuff.txt) save(mytable) clear dta replace
>>>>>>> u mytable.dta, clear
>>>>>>> keep v1 v2
>>>>>>> keep in 4/6 //parse as needed
>>>>>>> restore
>>>>>>> *! or just log this and parse it yourself, probably faster to do so
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *--ex3. using collapse**
>>>>>>>  *this might be your best option if you have a lot of datapoints to
>>>>>>> calculate/store*!
>>>>>>> preserve
>>>>>>> collapse (mean) v3 , by(v1 v2)
>>>>>>> keep v2 v3
>>>>>>> keep in 2/5 //parse
>>>>>>> l
>>>>>>> restore
>>>>>>>
>>>>>>>
>>>>>>> *--ex4.  using summarize**
>>>>>>>  forval x = 4(-1)1 {
>>>>>>>    forval y = 3(-1)1 {
>>>>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly
>>>>>>> loc val`x' `r(mean)'
>>>>>>> preserve
>>>>>>> clear
>>>>>>> set obs 1
>>>>>>> g name = "`x' and `y'"
>>>>>>> g v1 = `val`x'' in 1
>>>>>>> append using master.dta
>>>>>>> sa master.dta, replace  //values you need are in this dta file
>>>>>>> restore
>>>>>>>  } //end of y loop
>>>>>>> } //end of x loop
>>>>>>> *********************! End Example
>>>>>>> note: -timer- was reseting after the internal programming of -logout-
>>>>>>> was clearing the timer each time, so I just added up across the -rmsg-
>>>>>>> timings.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> Eric
>>>>>>> ___
>>>>>>> Eric A. Booth
>>>>>>> Research Scientist
>>>>>>> Gibson Consulting Group
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks again!
>>>>>>>>
>>>>>>>> I am not sure if those preserve-and-restore the data, but I should
>>>>>>>> check.
>>>>>>>>
>>>>>>>> I think what will happen is that I log the -tab, sum()-, and somehow
>>>>>>>> read in numbers from the log file without opening a new dataset, and
>>>>>>>> plot "immediately" with -scatteri-.
>>>>>>>>
>>>>>>>> Laszlo
>>>>>>>>
>>>>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> One way of doing what you want is probably to use the -xcontract- and
>>>>>>>>> -xcollapse- packages, which you can download from SSC. These are
>>>>>>>>> extended
>>>>>>>>> versions of -collapse- and -contract-, which can save the output
>>>>>>>>> datasets
>>>>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can
>>>>>>>>> do all
>>>>>>>>> kinds of plotting and tabulating.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>>
>>>>>>>>> Roger
>>>>>>>>>
>>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>>> Lecturer in Medical Statistics
>>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>>> National Heart and Lung Institute
>>>>>>>>> Imperial College London
>>>>>>>>> Royal Brompton Campus
>>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>>> 1B Manresa Road
>>>>>>>>> London SW3 6LR
>>>>>>>>> UNITED KINGDOM
>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>>> Email: [email protected]
>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>>> Departmental Web page:
>>>>>>>>>
>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>>>
>>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>>>
>>>>>>>>> On 18/08/2013 21:49, László Sándor wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, Roger.
>>>>>>>>>>
>>>>>>>>>> I never meant that StataCorp should give away their source. I was only
>>>>>>>>>> hoping to squeeze out some more interoperability. And so much of the
>>>>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see.
>>>>>>>>>>
>>>>>>>>>> I should have thought of -which-.
>>>>>>>>>>
>>>>>>>>>> I only wanted to capture some of the results/output without logging
>>>>>>>>>> and parsing the log.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Laszlo
>>>>>>>>>>
>>>>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think you'll find that everything really is in the executable
>>>>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is
>>>>>>>>>>> because
>>>>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp
>>>>>>>>>>> have to
>>>>>>>>>>> make a living, and would probably not be able to do so if it was
>>>>>>>>>>> open-source
>>>>>>>>>>> and users could make generic copies.
>>>>>>>>>>>
>>>>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in
>>>>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata,
>>>>>>>>>>>
>>>>>>>>>>> which tabulate
>>>>>>>>>>>
>>>>>>>>>>> then Stata will answer
>>>>>>>>>>>
>>>>>>>>>>> built-in command:  tabulate
>>>>>>>>>>>
>>>>>>>>>>> meaning that there is no file -tabulate.ado-.
>>>>>>>>>>>
>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>>
>>>>>>>>>>> Roger
>>>>>>>>>>>
>>>>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>>>>> Lecturer in Medical Statistics
>>>>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>>>>> National Heart and Lung Institute
>>>>>>>>>>> Imperial College London
>>>>>>>>>>> Royal Brompton Campus
>>>>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>>>>> 1B Manresa Road
>>>>>>>>>>> London SW3 6LR
>>>>>>>>>>> UNITED KINGDOM
>>>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>>>>> Departmental Web page:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>>>>>
>>>>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 18/08/2013 21:21, László Sándor wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I
>>>>>>>>>>>> understand that much of it is written in C code, but I would still
>>>>>>>>>>>> expect to find some black boxes of files that do the magic. I think
>>>>>>>>>>>> I
>>>>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata
>>>>>>>>>>>> on my mac, and even checked the package contents of
>>>>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any
>>>>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix
>>>>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
>>>>>>>>>>>> Really?
>>>>>>>>>>>>
>>>>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code
>>>>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is
>>>>>>>>>>>> everything in the compiled binary instead?
>>>>>>>>>>>>
>>>>>>>>>>>> Well, something must add up those 33.9 MBs…
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for any thoughts,
>>>>>>>>>>>>
>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>
>>>>>>>>>>>> *
>>>>>>>>>>>> *   For searches and help try:
>>>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>>>>
>>>>>>>>>>> *
>>>>>>>>>>> *   For searches and help try:
>>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *
>>>>>>>>>> *   For searches and help try:
>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>>
>>>>>>>>> *
>>>>>>>>> *   For searches and help try:
>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>>>
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>>
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index