Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   László Sándor <[email protected]>
To   [email protected]
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 10:45:25 -0400

Thanks, all.

I am still confused how I could combine the speed of sum of the
methods like -collapse- without losing my data, which usually takes
dozens of GBs.

Otherwise I think we are only talking about -tabulate- versus -table-
but both need log-parsing, or some -bys: summarize- and collecting
locals, which I did not attempt.

FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
of the many lines of -summarize, meanonly-, esp. as it runs over the
dataset many times just ifs in different observations.

On an 8-core StataMP 13 for Linux,
full -tabulate, sum- itself took ~140s
-tab, matcell- took <5s, but indeed generates frequencies only.
a second -tabulate, sum-, even with nof and nost, took the same
also with the caplog wrapper
a -collapse, fast- took 36s, but of course this loses the data
the -summarize- took 92s without the postfiles, 34.53s with — but I
still cannot scatteri the results in the same Stata instance…

On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
full -tabulate, sum- itself took ~195s
-tab, matcell-: 8s
again same speed without frequencies and standard deviations, or with
the wrapper, for -tab, sum-
-collapse- took 60s
the loops of -summarize- took 160s now without the postfiles, 47s with.

Thanks!

Laszlo

On Mon, Aug 19, 2013 at 8:59 AM, Phil Clayton
<[email protected]> wrote:
> There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method.
>
> Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit)
>
> Phil
>
> On 19/08/2013, at 10:16 PM, László Sándor <[email protected]> wrote:
>
>> Thanks for all this.
>>
>> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not
>> the fastest method by far.
>>
>> But indeed, having multiple datasets in memory is the bottleneck, so I
>> am not sure whether postfile or logout would solve much of the problem
>> — as for the results from the new files, I'd need to lose the current
>> data (or preserve and restore it).
>>
>> Currently, I am working on reading in the tabulated values into macros
>> to plug them into -scatteri-, but it is a hack.
>>
>> Thanks again,
>>
>> Laszlo
>>
>> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson
>> <[email protected]> wrote:
>>> The main problem with this solution is that you have to put in a lot more
>>> programming time, especially if you want to conserve the variable labels,
>>> value labels etc. of the by-variables. (That at least is my excuse for the
>>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still
>>> tend to use.)
>>>
>>> IMHO it is a major limitation of Stata that it cannot store any number of
>>> datasets (or dataframes) in the memory at a time. If it could, then we would
>>> not be forced to use -preserve- and -restore- so often and burn computer
>>> time in file I/O, just to conserve person-days.
>>>
>>> On the other hand, R (the main serious non-legacy competitor to Stata
>>> nowadays) has the even greater limitation that it doesn't have anything
>>> quite like Mata. Plus only a few of my colleagues seem to be confident using
>>> R!!!
>>>
>>>
>>> Best wishes
>>>
>>> Roger
>>>
>>> Roger B Newson BSc MSc DPhil
>>> Lecturer in Medical Statistics
>>> Respiratory Epidemiology and Public Health Group
>>> National Heart and Lung Institute
>>> Imperial College London
>>> Royal Brompton Campus
>>> Room 33, Emmanuel Kaye Building
>>> 1B Manresa Road
>>> London SW3 6LR
>>> UNITED KINGDOM
>>> Tel: +44 (0)20 7352 8121 ext 3381
>>> Fax: +44 (0)20 7351 8322
>>> Email: [email protected]
>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>> Departmental Web page:
>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>
>>> Opinions expressed are those of the author, not of the institution.
>>>
>>> On 19/08/2013 01:06, Phil Clayton wrote:
>>>>
>>>> If you can avoid the -preserve- and -restore- you save loads of time (at
>>>> least on my modest system...)
>>>>
>>>> *--ex5.  using summarize and postfile**
>>>> tempname post
>>>> tempfile postfile
>>>> postfile `post' v1 v2 mean sd n using "`postfile'"
>>>> forval x = 4(-1)1 {
>>>>        forval y = 3(-1)1 {
>>>>                display "v1=`x', v2=`y'"
>>>>                qui sum v3 if v1==`x' & v2 == `y'
>>>>                post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
>>>>        } //end of y loop
>>>> } //end of x loop
>>>> postclose `post'
>>>> use "`postfile'", clear
>>>>
>>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:
>>>>
>>>>> <>
>>>>> Hi Laszlo:   I agree that it would be nice if -tabulate,summarize()-
>>>>> stored values but it doesnt.  There are several options available to
>>>>> store those values and then use them elsewhere.  The issues seem to be
>>>>> (1) ease of parsing the values into a format that you can use for
>>>>> other analyses and (2) (and more important for you) the speed with
>>>>> which you can calculate, store, parse, and then use those values.
>>>>>
>>>>> Some alternatives to collapse include logging the -tabulate,
>>>>> summarize()- output and then parsing it, using -collapse- to get your
>>>>> values,  or using the compiled  -summarize- command to obtain the
>>>>> values of interest and store them for use elsewhere.  I'm sure there
>>>>> are other options, but below is a comparison of these methods against
>>>>> the speed of the desired -tabulate, summarize()- solution on a
>>>>> large-ish fake dataset.
>>>>>
>>>>> This is not a clean comparison and the values I store for later use
>>>>> are not exactly the same in every example, but it gives you an idea of
>>>>> the speed differences of the steps that might be involved for each
>>>>> approach (that is, preserving the data, summarizing or collapsing or
>>>>> XX, storing and parsing the output, and restoring the data).  The
>>>>> upshot is that, for this example on my computer, it seems that running
>>>>> -summarize- in a loop to grab the values you want and store them in a
>>>>> dataset was the quickest non-tab, summarize()- option I tried (example
>>>>> 4 below), but this would be slower on a lot of data points.  Plus,
>>>>> both Examples 3 & 4 below are both faster than running -tabulate,
>>>>> summarize()-.
>>>>>
>>>>> Using -tabulate, summarize()-  to get values takes about 101 seconds
>>>>> to run in my example.
>>>>> Example 1 is regular tabulate example with cells stored in a matrix --
>>>>> this took about 9 seconds, but doesnt require any calculation of means
>>>>> or what not.  Ex 2 is using -logout- to parse the syntax (you could do
>>>>> this manually too) and took the longest at about 109 seconds.  Ex 3
>>>>> uses -collapse- with preserve/restore and takes about 36 seconds.  Ex
>>>>> 4 uses a loop to grab means from summarize for certain values and
>>>>> takes about 27 seconds.
>>>>>
>>>>> *********************! Begin Example
>>>>> //intro stuff//
>>>>> clear all
>>>>> timer clear
>>>>> set rmsg on
>>>>> *--install  packages for the example
>>>>> cap which logout
>>>>> if _rc ssc install logout , replace
>>>>> *--make fake data
>>>>> sa master.dta, replace emptyok //for later
>>>>> set obs `=2^25' //run on a big dataset
>>>>> forval x = 1/10 {
>>>>>   g v`x' = round(runiform()*5)
>>>>> }
>>>>>
>>>>>
>>>>> //examples//
>>>>>   **
>>>>>   tabulate v1 v2, summarize(v3)  //for ref. takes c.108 Seconds
>>>>>   **
>>>>>
>>>>> *--ex1. time working with -tab- stored values**
>>>>> **this doesnt get the values you need..
>>>>> **but allows us to compare speed of these approaches somewhat
>>>>> tab v1 v2,  matcell(A)
>>>>> mat list A
>>>>> preserve
>>>>>  clear
>>>>> svmat A, names(A)
>>>>> keep A1
>>>>> keep in 1/3 //parse
>>>>> l
>>>>> restore
>>>>>
>>>>>
>>>>> *--ex2.  parsing the tab, summarize() output**
>>>>> *logout*
>>>>> preserve
>>>>>     caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof
>>>>> nost
>>>>>     logout, use(mystuff.txt) save(mytable) clear dta replace
>>>>> u mytable.dta, clear
>>>>> keep v1 v2
>>>>> keep in 4/6 //parse as needed
>>>>> restore
>>>>> *! or just log this and parse it yourself, probably faster to do so
>>>>>
>>>>>
>>>>>
>>>>> *--ex3. using collapse**
>>>>>  *this might be your best option if you have a lot of datapoints to
>>>>> calculate/store*!
>>>>> preserve
>>>>> collapse (mean) v3 , by(v1 v2)
>>>>> keep v2 v3
>>>>> keep in 2/5 //parse
>>>>> l
>>>>> restore
>>>>>
>>>>>
>>>>> *--ex4.  using summarize**
>>>>>  forval x = 4(-1)1 {
>>>>>    forval y = 3(-1)1 {
>>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly
>>>>> loc val`x' `r(mean)'
>>>>> preserve
>>>>> clear
>>>>> set obs 1
>>>>> g name = "`x' and `y'"
>>>>> g v1 = `val`x'' in 1
>>>>> append using master.dta
>>>>> sa master.dta, replace  //values you need are in this dta file
>>>>> restore
>>>>>  } //end of y loop
>>>>> } //end of x loop
>>>>> *********************! End Example
>>>>> note: -timer- was reseting after the internal programming of -logout-
>>>>> was clearing the timer each time, so I just added up across the -rmsg-
>>>>> timings.
>>>>>
>>>>>
>>>>>
>>>>> HTH,
>>>>>
>>>>> Eric
>>>>> ___
>>>>> Eric A. Booth
>>>>> Research Scientist
>>>>> Gibson Consulting Group
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Thanks again!
>>>>>>
>>>>>> I am not sure if those preserve-and-restore the data, but I should
>>>>>> check.
>>>>>>
>>>>>> I think what will happen is that I log the -tab, sum()-, and somehow
>>>>>> read in numbers from the log file without opening a new dataset, and
>>>>>> plot "immediately" with -scatteri-.
>>>>>>
>>>>>> Laszlo
>>>>>>
>>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> One way of doing what you want is probably to use the -xcontract- and
>>>>>>> -xcollapse- packages, which you can download from SSC. These are
>>>>>>> extended
>>>>>>> versions of -collapse- and -contract-, which can save the output
>>>>>>> datasets
>>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can
>>>>>>> do all
>>>>>>> kinds of plotting and tabulating.
>>>>>>>
>>>>>>>
>>>>>>> Best wishes
>>>>>>>
>>>>>>> Roger
>>>>>>>
>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>> Lecturer in Medical Statistics
>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>> National Heart and Lung Institute
>>>>>>> Imperial College London
>>>>>>> Royal Brompton Campus
>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>> 1B Manresa Road
>>>>>>> London SW3 6LR
>>>>>>> UNITED KINGDOM
>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>> Email: [email protected]
>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>> Departmental Web page:
>>>>>>>
>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>
>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>
>>>>>>> On 18/08/2013 21:49, László Sándor wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Roger.
>>>>>>>>
>>>>>>>> I never meant that StataCorp should give away their source. I was only
>>>>>>>> hoping to squeeze out some more interoperability. And so much of the
>>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see.
>>>>>>>>
>>>>>>>> I should have thought of -which-.
>>>>>>>>
>>>>>>>> I only wanted to capture some of the results/output without logging
>>>>>>>> and parsing the log.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Laszlo
>>>>>>>>
>>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think you'll find that everything really is in the executable
>>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is
>>>>>>>>> because
>>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp
>>>>>>>>> have to
>>>>>>>>> make a living, and would probably not be able to do so if it was
>>>>>>>>> open-source
>>>>>>>>> and users could make generic copies.
>>>>>>>>>
>>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in
>>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata,
>>>>>>>>>
>>>>>>>>> which tabulate
>>>>>>>>>
>>>>>>>>> then Stata will answer
>>>>>>>>>
>>>>>>>>> built-in command:  tabulate
>>>>>>>>>
>>>>>>>>> meaning that there is no file -tabulate.ado-.
>>>>>>>>>
>>>>>>>>> I hope this helps.
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>>
>>>>>>>>> Roger
>>>>>>>>>
>>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>>> Lecturer in Medical Statistics
>>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>>> National Heart and Lung Institute
>>>>>>>>> Imperial College London
>>>>>>>>> Royal Brompton Campus
>>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>>> 1B Manresa Road
>>>>>>>>> London SW3 6LR
>>>>>>>>> UNITED KINGDOM
>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>>> Email: [email protected]
>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>>> Departmental Web page:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>>>
>>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 18/08/2013 21:21, László Sándor wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I
>>>>>>>>>> understand that much of it is written in C code, but I would still
>>>>>>>>>> expect to find some black boxes of files that do the magic. I think
>>>>>>>>>> I
>>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata
>>>>>>>>>> on my mac, and even checked the package contents of
>>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any
>>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix
>>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
>>>>>>>>>> Really?
>>>>>>>>>>
>>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code
>>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is
>>>>>>>>>> everything in the compiled binary instead?
>>>>>>>>>>
>>>>>>>>>> Well, something must add up those 33.9 MBs…
>>>>>>>>>>
>>>>>>>>>> Thanks for any thoughts,
>>>>>>>>>>
>>>>>>>>>> Laszlo
>>>>>>>>>>
>>>>>>>>>> *
>>>>>>>>>> *   For searches and help try:
>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>>
>>>>>>>>> *
>>>>>>>>> *   For searches and help try:
>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index