Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   Phil Clayton <[email protected]>
To   [email protected]
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 22:59:11 +1000

There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method.

Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit)

Phil

On 19/08/2013, at 10:16 PM, László Sándor <[email protected]> wrote:

> Thanks for all this.
> 
> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not
> the fastest method by far.
> 
> But indeed, having multiple datasets in memory is the bottleneck, so I
> am not sure whether postfile or logout would solve much of the problem
> — as for the results from the new files, I'd need to lose the current
> data (or preserve and restore it).
> 
> Currently, I am working on reading in the tabulated values into macros
> to plug them into -scatteri-, but it is a hack.
> 
> Thanks again,
> 
> Laszlo
> 
> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson
> <[email protected]> wrote:
>> The main problem with this solution is that you have to put in a lot more
>> programming time, especially if you want to conserve the variable labels,
>> value labels etc. of the by-variables. (That at least is my excuse for the
>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still
>> tend to use.)
>> 
>> IMHO it is a major limitation of Stata that it cannot store any number of
>> datasets (or dataframes) in the memory at a time. If it could, then we would
>> not be forced to use -preserve- and -restore- so often and burn computer
>> time in file I/O, just to conserve person-days.
>> 
>> On the other hand, R (the main serious non-legacy competitor to Stata
>> nowadays) has the even greater limitation that it doesn't have anything
>> quite like Mata. Plus only a few of my colleagues seem to be confident using
>> R!!!
>> 
>> 
>> Best wishes
>> 
>> Roger
>> 
>> Roger B Newson BSc MSc DPhil
>> Lecturer in Medical Statistics
>> Respiratory Epidemiology and Public Health Group
>> National Heart and Lung Institute
>> Imperial College London
>> Royal Brompton Campus
>> Room 33, Emmanuel Kaye Building
>> 1B Manresa Road
>> London SW3 6LR
>> UNITED KINGDOM
>> Tel: +44 (0)20 7352 8121 ext 3381
>> Fax: +44 (0)20 7351 8322
>> Email: [email protected]
>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>> Departmental Web page:
>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>> 
>> Opinions expressed are those of the author, not of the institution.
>> 
>> On 19/08/2013 01:06, Phil Clayton wrote:
>>> 
>>> If you can avoid the -preserve- and -restore- you save loads of time (at
>>> least on my modest system...)
>>> 
>>> *--ex5.  using summarize and postfile**
>>> tempname post
>>> tempfile postfile
>>> postfile `post' v1 v2 mean sd n using "`postfile'"
>>> forval x = 4(-1)1 {
>>>        forval y = 3(-1)1 {
>>>                display "v1=`x', v2=`y'"
>>>                qui sum v3 if v1==`x' & v2 == `y'
>>>                post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
>>>        } //end of y loop
>>> } //end of x loop
>>> postclose `post'
>>> use "`postfile'", clear
>>> 
>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:
>>> 
>>>> <>
>>>> Hi Laszlo:   I agree that it would be nice if -tabulate,summarize()-
>>>> stored values but it doesnt.  There are several options available to
>>>> store those values and then use them elsewhere.  The issues seem to be
>>>> (1) ease of parsing the values into a format that you can use for
>>>> other analyses and (2) (and more important for you) the speed with
>>>> which you can calculate, store, parse, and then use those values.
>>>> 
>>>> Some alternatives to collapse include logging the -tabulate,
>>>> summarize()- output and then parsing it, using -collapse- to get your
>>>> values,  or using the compiled  -summarize- command to obtain the
>>>> values of interest and store them for use elsewhere.  I'm sure there
>>>> are other options, but below is a comparison of these methods against
>>>> the speed of the desired -tabulate, summarize()- solution on a
>>>> large-ish fake dataset.
>>>> 
>>>> This is not a clean comparison and the values I store for later use
>>>> are not exactly the same in every example, but it gives you an idea of
>>>> the speed differences of the steps that might be involved for each
>>>> approach (that is, preserving the data, summarizing or collapsing or
>>>> XX, storing and parsing the output, and restoring the data).  The
>>>> upshot is that, for this example on my computer, it seems that running
>>>> -summarize- in a loop to grab the values you want and store them in a
>>>> dataset was the quickest non-tab, summarize()- option I tried (example
>>>> 4 below), but this would be slower on a lot of data points.  Plus,
>>>> both Examples 3 & 4 below are both faster than running -tabulate,
>>>> summarize()-.
>>>> 
>>>> Using -tabulate, summarize()-  to get values takes about 101 seconds
>>>> to run in my example.
>>>> Example 1 is regular tabulate example with cells stored in a matrix --
>>>> this took about 9 seconds, but doesnt require any calculation of means
>>>> or what not.  Ex 2 is using -logout- to parse the syntax (you could do
>>>> this manually too) and took the longest at about 109 seconds.  Ex 3
>>>> uses -collapse- with preserve/restore and takes about 36 seconds.  Ex
>>>> 4 uses a loop to grab means from summarize for certain values and
>>>> takes about 27 seconds.
>>>> 
>>>> *********************! Begin Example
>>>> //intro stuff//
>>>> clear all
>>>> timer clear
>>>> set rmsg on
>>>> *--install  packages for the example
>>>> cap which logout
>>>> if _rc ssc install logout , replace
>>>> *--make fake data
>>>> sa master.dta, replace emptyok //for later
>>>> set obs `=2^25' //run on a big dataset
>>>> forval x = 1/10 {
>>>>   g v`x' = round(runiform()*5)
>>>> }
>>>> 
>>>> 
>>>> //examples//
>>>>   **
>>>>   tabulate v1 v2, summarize(v3)  //for ref. takes c.108 Seconds
>>>>   **
>>>> 
>>>> *--ex1. time working with -tab- stored values**
>>>> **this doesnt get the values you need..
>>>> **but allows us to compare speed of these approaches somewhat
>>>> tab v1 v2,  matcell(A)
>>>> mat list A
>>>> preserve
>>>>  clear
>>>> svmat A, names(A)
>>>> keep A1
>>>> keep in 1/3 //parse
>>>> l
>>>> restore
>>>> 
>>>> 
>>>> *--ex2.  parsing the tab, summarize() output**
>>>> *logout*
>>>> preserve
>>>>     caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof
>>>> nost
>>>>     logout, use(mystuff.txt) save(mytable) clear dta replace
>>>> u mytable.dta, clear
>>>> keep v1 v2
>>>> keep in 4/6 //parse as needed
>>>> restore
>>>> *! or just log this and parse it yourself, probably faster to do so
>>>> 
>>>> 
>>>> 
>>>> *--ex3. using collapse**
>>>>  *this might be your best option if you have a lot of datapoints to
>>>> calculate/store*!
>>>> preserve
>>>> collapse (mean) v3 , by(v1 v2)
>>>> keep v2 v3
>>>> keep in 2/5 //parse
>>>> l
>>>> restore
>>>> 
>>>> 
>>>> *--ex4.  using summarize**
>>>>  forval x = 4(-1)1 {
>>>>    forval y = 3(-1)1 {
>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly
>>>> loc val`x' `r(mean)'
>>>> preserve
>>>> clear
>>>> set obs 1
>>>> g name = "`x' and `y'"
>>>> g v1 = `val`x'' in 1
>>>> append using master.dta
>>>> sa master.dta, replace  //values you need are in this dta file
>>>> restore
>>>>  } //end of y loop
>>>> } //end of x loop
>>>> *********************! End Example
>>>> note: -timer- was reseting after the internal programming of -logout-
>>>> was clearing the timer each time, so I just added up across the -rmsg-
>>>> timings.
>>>> 
>>>> 
>>>> 
>>>> HTH,
>>>> 
>>>> Eric
>>>> ___
>>>> Eric A. Booth
>>>> Research Scientist
>>>> Gibson Consulting Group
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> Thanks again!
>>>>> 
>>>>> I am not sure if those preserve-and-restore the data, but I should
>>>>> check.
>>>>> 
>>>>> I think what will happen is that I log the -tab, sum()-, and somehow
>>>>> read in numbers from the log file without opening a new dataset, and
>>>>> plot "immediately" with -scatteri-.
>>>>> 
>>>>> Laszlo
>>>>> 
>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> One way of doing what you want is probably to use the -xcontract- and
>>>>>> -xcollapse- packages, which you can download from SSC. These are
>>>>>> extended
>>>>>> versions of -collapse- and -contract-, which can save the output
>>>>>> datasets
>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can
>>>>>> do all
>>>>>> kinds of plotting and tabulating.
>>>>>> 
>>>>>> 
>>>>>> Best wishes
>>>>>> 
>>>>>> Roger
>>>>>> 
>>>>>> Roger B Newson BSc MSc DPhil
>>>>>> Lecturer in Medical Statistics
>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>> National Heart and Lung Institute
>>>>>> Imperial College London
>>>>>> Royal Brompton Campus
>>>>>> Room 33, Emmanuel Kaye Building
>>>>>> 1B Manresa Road
>>>>>> London SW3 6LR
>>>>>> UNITED KINGDOM
>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>> Fax: +44 (0)20 7351 8322
>>>>>> Email: [email protected]
>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>> Departmental Web page:
>>>>>> 
>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>> 
>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>> 
>>>>>> On 18/08/2013 21:49, László Sándor wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks, Roger.
>>>>>>> 
>>>>>>> I never meant that StataCorp should give away their source. I was only
>>>>>>> hoping to squeeze out some more interoperability. And so much of the
>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see.
>>>>>>> 
>>>>>>> I should have thought of -which-.
>>>>>>> 
>>>>>>> I only wanted to capture some of the results/output without logging
>>>>>>> and parsing the log.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Laszlo
>>>>>>> 
>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I think you'll find that everything really is in the executable
>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is
>>>>>>>> because
>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp
>>>>>>>> have to
>>>>>>>> make a living, and would probably not be able to do so if it was
>>>>>>>> open-source
>>>>>>>> and users could make generic copies.
>>>>>>>> 
>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in
>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata,
>>>>>>>> 
>>>>>>>> which tabulate
>>>>>>>> 
>>>>>>>> then Stata will answer
>>>>>>>> 
>>>>>>>> built-in command:  tabulate
>>>>>>>> 
>>>>>>>> meaning that there is no file -tabulate.ado-.
>>>>>>>> 
>>>>>>>> I hope this helps.
>>>>>>>> 
>>>>>>>> Best wishes
>>>>>>>> 
>>>>>>>> Roger
>>>>>>>> 
>>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>>> Lecturer in Medical Statistics
>>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>>> National Heart and Lung Institute
>>>>>>>> Imperial College London
>>>>>>>> Royal Brompton Campus
>>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>>> 1B Manresa Road
>>>>>>>> London SW3 6LR
>>>>>>>> UNITED KINGDOM
>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>>> Email: [email protected]
>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>>> Departmental Web page:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>> 
>>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 18/08/2013 21:21, László Sándor wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I
>>>>>>>>> understand that much of it is written in C code, but I would still
>>>>>>>>> expect to find some black boxes of files that do the magic. I think
>>>>>>>>> I
>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata
>>>>>>>>> on my mac, and even checked the package contents of
>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any
>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix
>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
>>>>>>>>> Really?
>>>>>>>>> 
>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code
>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is
>>>>>>>>> everything in the compiled binary instead?
>>>>>>>>> 
>>>>>>>>> Well, something must add up those 33.9 MBs…
>>>>>>>>> 
>>>>>>>>> Thanks for any thoughts,
>>>>>>>>> 
>>>>>>>>> Laszlo
>>>>>>>>> 
>>>>>>>>> *
>>>>>>>>> *   For searches and help try:
>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>> 
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>> 
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>> 
>>>>> 
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> 
>>>> 
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> 
>>> 
>>> 
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index