Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?

From	László Sándor <[email protected]>
To	[email protected]
Subject	Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date	Mon, 19 Aug 2013 10:50:12 -0400

Credit where credit is due: I meant Eric and Phil's tests, of course,
I apologize, with Roger's thoughts also much appreciated.

I am still surprised that loops of interpreted code beats the built-in
C. So maybe -tabulate- was not heavily optimized in the end.

Thanks for everything!

Laszlo

On Mon, Aug 19, 2013 at 10:45 AM, László Sándor <[email protected]> wrote:
> Thanks, all.
>
> I am still confused how I could combine the speed of sum of the
> methods like -collapse- without losing my data, which usually takes
> dozens of GBs.
>
> Otherwise I think we are only talking about -tabulate- versus -table-
> but both need log-parsing, or some -bys: summarize- and collecting
> locals, which I did not attempt.
>
> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
> of the many lines of -summarize, meanonly-, esp. as it runs over the
> dataset many times just ifs in different observations.
>
> On an 8-core StataMP 13 for Linux,
> full -tabulate, sum- itself took ~140s
> -tab, matcell- took <5s, but indeed generates frequencies only.
> a second -tabulate, sum-, even with nof and nost, took the same
> also with the caplog wrapper
> a -collapse, fast- took 36s, but of course this loses the data
> the -summarize- took 92s without the postfiles, 34.53s with — but I
> still cannot scatteri the results in the same Stata instance…
>
> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
> full -tabulate, sum- itself took ~195s
> -tab, matcell-: 8s
> again same speed without frequencies and standard deviations, or with
> the wrapper, for -tab, sum-
> -collapse- took 60s
> the loops of -summarize- took 160s now without the postfiles, 47s with.
>
> Thanks!
>
> Laszlo
>
> On Mon, Aug 19, 2013 at 8:59 AM, Phil Clayton
> <[email protected]> wrote:
>> There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method.
>>
>> Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit)
>>
>> Phil
>>
>> On 19/08/2013, at 10:16 PM, László Sándor <[email protected]> wrote:
>>
>>> Thanks for all this.
>>>
>>> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not
>>> the fastest method by far.
>>>
>>> But indeed, having multiple datasets in memory is the bottleneck, so I
>>> am not sure whether postfile or logout would solve much of the problem
>>> — as for the results from the new files, I'd need to lose the current
>>> data (or preserve and restore it).
>>>
>>> Currently, I am working on reading in the tabulated values into macros
>>> to plug them into -scatteri-, but it is a hack.
>>>
>>> Thanks again,
>>>
>>> Laszlo
>>>
>>> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson
>>> <[email protected]> wrote:
>>>> The main problem with this solution is that you have to put in a lot more
>>>> programming time, especially if you want to conserve the variable labels,
>>>> value labels etc. of the by-variables. (That at least is my excuse for the
>>>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still
>>>> tend to use.)
>>>>
>>>> IMHO it is a major limitation of Stata that it cannot store any number of
>>>> datasets (or dataframes) in the memory at a time. If it could, then we would
>>>> not be forced to use -preserve- and -restore- so often and burn computer
>>>> time in file I/O, just to conserve person-days.
>>>>
>>>> On the other hand, R (the main serious non-legacy competitor to Stata
>>>> nowadays) has the even greater limitation that it doesn't have anything
>>>> quite like Mata. Plus only a few of my colleagues seem to be confident using
>>>> R!!!
>>>>
>>>>
>>>> Best wishes
>>>>
>>>> Roger
>>>>
>>>> Roger B Newson BSc MSc DPhil
>>>> Lecturer in Medical Statistics
>>>> Respiratory Epidemiology and Public Health Group
>>>> National Heart and Lung Institute
>>>> Imperial College London
>>>> Royal Brompton Campus
>>>> Room 33, Emmanuel Kaye Building
>>>> 1B Manresa Road
>>>> London SW3 6LR
>>>> UNITED KINGDOM
>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>> Fax: +44 (0)20 7351 8322
>>>> Email: [email protected]
>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>> Departmental Web page:
>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>
>>>> Opinions expressed are those of the author, not of the institution.
>>>>
>>>> On 19/08/2013 01:06, Phil Clayton wrote:
>>>>>
>>>>> If you can avoid the -preserve- and -restore- you save loads of time (at
>>>>> least on my modest system...)
>>>>>
>>>>> *--ex5.  using summarize and postfile**
>>>>> tempname post
>>>>> tempfile postfile
>>>>> postfile `post' v1 v2 mean sd n using "`postfile'"
>>>>> forval x = 4(-1)1 {
>>>>>        forval y = 3(-1)1 {
>>>>>                display "v1=`x', v2=`y'"
>>>>>                qui sum v3 if v1==`x' & v2 == `y'
>>>>>                post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
>>>>>        } //end of y loop
>>>>> } //end of x loop
>>>>> postclose `post'
>>>>> use "`postfile'", clear
>>>>>
>>>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:
>>>>>
>>>>>> <>
>>>>>> Hi Laszlo:   I agree that it would be nice if -tabulate,summarize()-
>>>>>> stored values but it doesnt.  There are several options available to
>>>>>> store those values and then use them elsewhere.  The issues seem to be
>>>>>> (1) ease of parsing the values into a format that you can use for
>>>>>> other analyses and (2) (and more important for you) the speed with
>>>>>> which you can calculate, store, parse, and then use those values.
>>>>>>
>>>>>> Some alternatives to collapse include logging the -tabulate,
>>>>>> summarize()- output and then parsing it, using -collapse- to get your
>>>>>> values,  or using the compiled  -summarize- command to obtain the
>>>>>> values of interest and store them for use elsewhere.  I'm sure there
>>>>>> are other options, but below is a comparison of these methods against
>>>>>> the speed of the desired -tabulate, summarize()- solution on a
>>>>>> large-ish fake dataset.
>>>>>>
>>>>>> This is not a clean comparison and the values I store for later use
>>>>>> are not exactly the same in every example, but it gives you an idea of
>>>>>> the speed differences of the steps that might be involved for each
>>>>>> approach (that is, preserving the data, summarizing or collapsing or
>>>>>> XX, storing and parsing the output, and restoring the data).  The
>>>>>> upshot is that, for this example on my computer, it seems that running
>>>>>> -summarize- in a loop to grab the values you want and store them in a
>>>>>> dataset was the quickest non-tab, summarize()- option I tried (example
>>>>>> 4 below), but this would be slower on a lot of data points.  Plus,
>>>>>> both Examples 3 & 4 below are both faster than running -tabulate,
>>>>>> summarize()-.
>>>>>>
>>>>>> Using -tabulate, summarize()-  to get values takes about 101 seconds
>>>>>> to run in my example.
>>>>>> Example 1 is regular tabulate example with cells stored in a matrix --
>>>>>> this took about 9 seconds, but doesnt require any calculation of means
>>>>>> or what not.  Ex 2 is using -logout- to parse the syntax (you could do
>>>>>> this manually too) and took the longest at about 109 seconds.  Ex 3
>>>>>> uses -collapse- with preserve/restore and takes about 36 seconds.  Ex
>>>>>> 4 uses a loop to grab means from summarize for certain values and
>>>>>> takes about 27 seconds.
>>>>>>
>>>>>> *********************! Begin Example
>>>>>> //intro stuff//
>>>>>> clear all
>>>>>> timer clear
>>>>>> set rmsg on
>>>>>> *--install  packages for the example
>>>>>> cap which logout
>>>>>> if _rc ssc install logout , replace
>>>>>> *--make fake data
>>>>>> sa master.dta, replace emptyok //for later
>>>>>> set obs `=2^25' //run on a big dataset
>>>>>> forval x = 1/10 {
>>>>>>   g v`x' = round(runiform()*5)
>>>>>> }
>>>>>>
>>>>>>
>>>>>> //examples//
>>>>>>   **
>>>>>>   tabulate v1 v2, summarize(v3)  //for ref. takes c.108 Seconds
>>>>>>   **
>>>>>>
>>>>>> *--ex1. time working with -tab- stored values**
>>>>>> **this doesnt get the values you need..
>>>>>> **but allows us to compare speed of these approaches somewhat
>>>>>> tab v1 v2,  matcell(A)
>>>>>> mat list A
>>>>>> preserve
>>>>>>  clear
>>>>>> svmat A, names(A)
>>>>>> keep A1
>>>>>> keep in 1/3 //parse
>>>>>> l
>>>>>> restore
>>>>>>
>>>>>>
>>>>>> *--ex2.  parsing the tab, summarize() output**
>>>>>> *logout*
>>>>>> preserve
>>>>>>     caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof
>>>>>> nost
>>>>>>     logout, use(mystuff.txt) save(mytable) clear dta replace
>>>>>> u mytable.dta, clear
>>>>>> keep v1 v2
>>>>>> keep in 4/6 //parse as needed
>>>>>> restore
>>>>>> *! or just log this and parse it yourself, probably faster to do so
>>>>>>
>>>>>>
>>>>>>
>>>>>> *--ex3. using collapse**
>>>>>>  *this might be your best option if you have a lot of datapoints to
>>>>>> calculate/store*!
>>>>>> preserve
>>>>>> collapse (mean) v3 , by(v1 v2)
>>>>>> keep v2 v3
>>>>>> keep in 2/5 //parse
>>>>>> l
>>>>>> restore
>>>>>>
>>>>>>
>>>>>> *--ex4.  using summarize**
>>>>>>  forval x = 4(-1)1 {
>>>>>>    forval y = 3(-1)1 {
>>>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly
>>>>>> loc val`x' `r(mean)'
>>>>>> preserve
>>>>>> clear
>>>>>> set obs 1
>>>>>> g name = "`x' and `y'"
>>>>>> g v1 = `val`x'' in 1
>>>>>> append using master.dta
>>>>>> sa master.dta, replace  //values you need are in this dta file
>>>>>> restore
>>>>>>  } //end of y loop
>>>>>> } //end of x loop
>>>>>> *********************! End Example
>>>>>> note: -timer- was reseting after the internal programming of -logout-
>>>>>> was clearing the timer each time, so I just added up across the -rmsg-
>>>>>> timings.
>>>>>>
>>>>>>
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> Eric
>>>>>> ___
>>>>>> Eric A. Booth
>>>>>> Research Scientist
>>>>>> Gibson Consulting Group
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks again!
>>>>>>>
>>>>>>> I am not sure if those preserve-and-restore the data, but I should
>>>>>>> check.
>>>>>>>
>>>>>>> I think what will happen is that I log the -tab, sum()-, and somehow
>>>>>>> read in numbers from the log file without opening a new dataset, and
>>>>>>> plot "immediately" with -scatteri-.
>>>>>>>
>>>>>>> Laszlo
>>>>>>>
>>>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> One way of doing what you want is probably to use the -xcontract- and
>>>>>>>> -xcollapse- packages, which you can download from SSC. These are
>>>>>>>> extended
>>>>>>>> versions of -collapse- and -contract-, which can save the output
>>>>>>>> datasets
>>>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can
>>>>>>>> do all
>>>>>>>> kinds of plotting and tabulating.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>>
>>>>>>>> Roger
>>>>>>>>
>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>> Lecturer in Medical Statistics
>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>> National Heart and Lung Institute
>>>>>>>> Imperial College London
>>>>>>>> Royal Brompton Campus
>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>> 1B Manresa Road
>>>>>>>> London SW3 6LR
>>>>>>>> UNITED KINGDOM
>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>> Email: [email protected]
>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>> Departmental Web page:
>>>>>>>>
>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>>
>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>>
>>>>>>>> On 18/08/2013 21:49, László Sándor wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Roger.
>>>>>>>>>
>>>>>>>>> I never meant that StataCorp should give away their source. I was only
>>>>>>>>> hoping to squeeze out some more interoperability. And so much of the
>>>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see.
>>>>>>>>>
>>>>>>>>> I should have thought of -which-.
>>>>>>>>>
>>>>>>>>> I only wanted to capture some of the results/output without logging
>>>>>>>>> and parsing the log.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Laszlo
>>>>>>>>>
>>>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think you'll find that everything really is in the executable
>>>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is
>>>>>>>>>> because
>>>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp
>>>>>>>>>> have to
>>>>>>>>>> make a living, and would probably not be able to do so if it was
>>>>>>>>>> open-source
>>>>>>>>>> and users could make generic copies.
>>>>>>>>>>
>>>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in
>>>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata,
>>>>>>>>>>
>>>>>>>>>> which tabulate
>>>>>>>>>>
>>>>>>>>>> then Stata will answer
>>>>>>>>>>
>>>>>>>>>> built-in command:  tabulate
>>>>>>>>>>
>>>>>>>>>> meaning that there is no file -tabulate.ado-.
>>>>>>>>>>
>>>>>>>>>> I hope this helps.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>>
>>>>>>>>>> Roger
>>>>>>>>>>
>>>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>>>> Lecturer in Medical Statistics
>>>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>>>> National Heart and Lung Institute
>>>>>>>>>> Imperial College London
>>>>>>>>>> Royal Brompton Campus
>>>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>>>> 1B Manresa Road
>>>>>>>>>> London SW3 6LR
>>>>>>>>>> UNITED KINGDOM
>>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>>>> Email: [email protected]
>>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>>>> Departmental Web page:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>>>>
>>>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 18/08/2013 21:21, László Sándor wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I
>>>>>>>>>>> understand that much of it is written in C code, but I would still
>>>>>>>>>>> expect to find some black boxes of files that do the magic. I think
>>>>>>>>>>> I
>>>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata
>>>>>>>>>>> on my mac, and even checked the package contents of
>>>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any
>>>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix
>>>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
>>>>>>>>>>> Really?
>>>>>>>>>>>
>>>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code
>>>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is
>>>>>>>>>>> everything in the compiled binary instead?
>>>>>>>>>>>
>>>>>>>>>>> Well, something must add up those 33.9 MBs…
>>>>>>>>>>>
>>>>>>>>>>> Thanks for any thoughts,
>>>>>>>>>>>
>>>>>>>>>>> Laszlo
>>>>>>>>>>>
>>>>>>>>>>> *
>>>>>>>>>>> *   For searches and help try:
>>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>>>
>>>>>>>>>> *
>>>>>>>>>> *   For searches and help try:
>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *
>>>>>>>>> *   For searches and help try:
>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>>
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>>
>>>>>
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>

References:
- st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: "Roger B. Newson" <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: "Roger B. Newson" <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: "Eric A. Booth" <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: Phil Clayton <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: "Roger B. Newson" <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: Phil Clayton <[email protected]>
- Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
  - From: László Sándor <[email protected]>

Prev by Date: Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Next by Date: st: How do I split a string variable without spaces by capital letters?
Previous by thread: Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Next by thread: Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Index(es):
- Date
- Thread