Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: where is StataCorp C code located? all in a single executable as compiled binary?


From   "Roger B. Newson" <[email protected]>
To   [email protected]
Subject   Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date   Mon, 19 Aug 2013 12:26:34 +0100

The main problem with this solution is that you have to put in a lot more programming time, especially if you want to conserve the variable labels, value labels etc. of the by-variables. (That at least is my excuse for the CPU-intensive, near-SAS-like and 20th-century-looking method that I still tend to use.)

IMHO it is a major limitation of Stata that it cannot store any number of datasets (or dataframes) in the memory at a time. If it could, then we would not be forced to use -preserve- and -restore- so often and burn computer time in file I/O, just to conserve person-days.

On the other hand, R (the main serious non-legacy competitor to Stata nowadays) has the even greater limitation that it doesn't have anything quite like Mata. Plus only a few of my colleagues seem to be confident using R!!!

Best wishes

Roger

Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:
http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/

Opinions expressed are those of the author, not of the institution.

On 19/08/2013 01:06, Phil Clayton wrote:
If you can avoid the -preserve- and -restore- you save loads of time (at least on my modest system...)

*--ex5.  using summarize and postfile**
tempname post
tempfile postfile
postfile `post' v1 v2 mean sd n using "`postfile'"
forval x = 4(-1)1 {
	forval y = 3(-1)1 {
		display "v1=`x', v2=`y'"
		qui sum v3 if v1==`x' & v2 == `y'
		post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
	} //end of y loop
} //end of x loop
postclose `post'
use "`postfile'", clear

On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:

<>
Hi Laszlo:   I agree that it would be nice if -tabulate,summarize()-
stored values but it doesnt.  There are several options available to
store those values and then use them elsewhere.  The issues seem to be
(1) ease of parsing the values into a format that you can use for
other analyses and (2) (and more important for you) the speed with
which you can calculate, store, parse, and then use those values.

Some alternatives to collapse include logging the -tabulate,
summarize()- output and then parsing it, using -collapse- to get your
values,  or using the compiled  -summarize- command to obtain the
values of interest and store them for use elsewhere.  I'm sure there
are other options, but below is a comparison of these methods against
the speed of the desired -tabulate, summarize()- solution on a
large-ish fake dataset.

This is not a clean comparison and the values I store for later use
are not exactly the same in every example, but it gives you an idea of
the speed differences of the steps that might be involved for each
approach (that is, preserving the data, summarizing or collapsing or
XX, storing and parsing the output, and restoring the data).  The
upshot is that, for this example on my computer, it seems that running
-summarize- in a loop to grab the values you want and store them in a
dataset was the quickest non-tab, summarize()- option I tried (example
4 below), but this would be slower on a lot of data points.  Plus,
both Examples 3 & 4 below are both faster than running -tabulate,
summarize()-.

Using -tabulate, summarize()-  to get values takes about 101 seconds
to run in my example.
Example 1 is regular tabulate example with cells stored in a matrix --
this took about 9 seconds, but doesnt require any calculation of means
or what not.  Ex 2 is using -logout- to parse the syntax (you could do
this manually too) and took the longest at about 109 seconds.  Ex 3
uses -collapse- with preserve/restore and takes about 36 seconds.  Ex
4 uses a loop to grab means from summarize for certain values and
takes about 27 seconds.

*********************! Begin Example
//intro stuff//
clear all
timer clear
set rmsg on
*--install  packages for the example
cap which logout
if _rc ssc install logout , replace
*--make fake data
sa master.dta, replace emptyok //for later
set obs `=2^25' //run on a big dataset
forval x = 1/10 {
   g v`x' = round(runiform()*5)
}


//examples//
   **
   tabulate v1 v2, summarize(v3)  //for ref. takes c.108 Seconds
   **

*--ex1. time working with -tab- stored values**
**this doesnt get the values you need..
**but allows us to compare speed of these approaches somewhat
tab v1 v2,  matcell(A)
mat list A
preserve
  clear
svmat A, names(A)
keep A1
keep in 1/3 //parse
l
restore


*--ex2.  parsing the tab, summarize() output**
*logout*
preserve
     caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof nost
     logout, use(mystuff.txt) save(mytable) clear dta replace
u mytable.dta, clear
keep v1 v2
keep in 4/6 //parse as needed
restore
*! or just log this and parse it yourself, probably faster to do so



*--ex3. using collapse**
  *this might be your best option if you have a lot of datapoints to
calculate/store*!
preserve
collapse (mean) v3 , by(v1 v2)
keep v2 v3
keep in 2/5 //parse
l
restore


*--ex4.  using summarize**
  forval x = 4(-1)1 {
    forval y = 3(-1)1 {
qui sum v3 if v1==`x' & v2 == `y', meanonly
loc val`x' `r(mean)'
preserve
clear
set obs 1
g name = "`x' and `y'"
g v1 = `val`x'' in 1
append using master.dta
sa master.dta, replace  //values you need are in this dta file
restore
  } //end of y loop
} //end of x loop
*********************! End Example
note: -timer- was reseting after the internal programming of -logout-
was clearing the timer each time, so I just added up across the -rmsg-
timings.



HTH,

Eric
___
Eric A. Booth
Research Scientist
Gibson Consulting Group
[email protected]




On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:

Thanks again!

I am not sure if those preserve-and-restore the data, but I should check.

I think what will happen is that I log the -tab, sum()-, and somehow
read in numbers from the log file without opening a new dataset, and
plot "immediately" with -scatteri-.

Laszlo

On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
<[email protected]> wrote:
One way of doing what you want is probably to use the -xcontract- and
-xcollapse- packages, which you can download from SSC. These are extended
versions of -collapse- and -contract-, which can save the output datasets
(or resultssets) to Stata .dta files on disk, with which the user can do all
kinds of plotting and tabulating.


Best wishes

Roger

Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:
http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/

Opinions expressed are those of the author, not of the institution.

On 18/08/2013 21:49, László Sándor wrote:

Thanks, Roger.

I never meant that StataCorp should give away their source. I was only
hoping to squeeze out some more interoperability. And so much of the
rest of the code is in smaller chunks. Not -tabulate-, I see.

I should have thought of -which-.

I only wanted to capture some of the results/output without logging
and parsing the log.

Thanks,

Laszlo

On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
<[email protected]> wrote:

I think you'll find that everything really is in the executable
"/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is because
Stata is not open-source, and was never supposed to be. StataCorp have to
make a living, and would probably not be able to do so if it was
open-source
and users could make generic copies.

A lot of the code for a lot of official Stata is open-source (ie in
ado-files), but -tabulate- isn't. If you type, in Stata,

which tabulate

then Stata will answer

built-in command:  tabulate

meaning that there is no file -tabulate.ado-.

I hope this helps.

Best wishes

Roger

Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:

http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/

Opinions expressed are those of the author, not of the institution.


On 18/08/2013 21:21, László Sándor wrote:


Hi all,

I am trying to understand how -tabulate, summarize- works. I
understand that much of it is written in C code, but I would still
expect to find some black boxes of files that do the magic. I think I
checked all folders, incl. hidden folders within /Applications/Stata
on my mac, and even checked the package contents of
/Applications/Stata/StataMP. I found no trace of -tabulate-, or any
other plugin/DLL whatsoever. Is everything wrapped into the Unix
executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
Really?

As I only need the results of -tab, sum()-, I hope to see some code
calling -_tab.ado- or some other code to display the results. Is
everything in the compiled binary instead?

Well, something must add up those 33.9 MBs…

Thanks for any thoughts,

Laszlo

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index