Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | László Sándor <sandorl@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: where is StataCorp C code located? all in a single executable as compiled binary? |
Date | Mon, 19 Aug 2013 10:50:12 -0400 |
Credit where credit is due: I meant Eric and Phil's tests, of course, I apologize, with Roger's thoughts also much appreciated. I am still surprised that loops of interpreted code beats the built-in C. So maybe -tabulate- was not heavily optimized in the end. Thanks for everything! Laszlo On Mon, Aug 19, 2013 at 10:45 AM, László Sándor <sandorl@gmail.com> wrote: > Thanks, all. > > I am still confused how I could combine the speed of sum of the > methods like -collapse- without losing my data, which usually takes > dozens of GBs. > > Otherwise I think we are only talking about -tabulate- versus -table- > but both need log-parsing, or some -bys: summarize- and collecting > locals, which I did not attempt. > > FWIW, I also ran Roger's tests. Actually, I am surprised by the speed > of the many lines of -summarize, meanonly-, esp. as it runs over the > dataset many times just ifs in different observations. > > On an 8-core StataMP 13 for Linux, > full -tabulate, sum- itself took ~140s > -tab, matcell- took <5s, but indeed generates frequencies only. > a second -tabulate, sum-, even with nof and nost, took the same > also with the caplog wrapper > a -collapse, fast- took 36s, but of course this loses the data > the -summarize- took 92s without the postfiles, 34.53s with — but I > still cannot scatteri the results in the same Stata instance… > > On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing): > full -tabulate, sum- itself took ~195s > -tab, matcell-: 8s > again same speed without frequencies and standard deviations, or with > the wrapper, for -tab, sum- > -collapse- took 60s > the loops of -summarize- took 160s now without the postfiles, 47s with. > > Thanks! > > Laszlo > > On Mon, Aug 19, 2013 at 8:59 AM, Phil Clayton > <philclayton@internode.on.net> wrote: >> There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method. >> >> Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit) >> >> Phil >> >> On 19/08/2013, at 10:16 PM, László Sándor <sandorl@gmail.com> wrote: >> >>> Thanks for all this. >>> >>> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not >>> the fastest method by far. >>> >>> But indeed, having multiple datasets in memory is the bottleneck, so I >>> am not sure whether postfile or logout would solve much of the problem >>> — as for the results from the new files, I'd need to lose the current >>> data (or preserve and restore it). >>> >>> Currently, I am working on reading in the tabulated values into macros >>> to plug them into -scatteri-, but it is a hack. >>> >>> Thanks again, >>> >>> Laszlo >>> >>> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson >>> <r.newson@imperial.ac.uk> wrote: >>>> The main problem with this solution is that you have to put in a lot more >>>> programming time, especially if you want to conserve the variable labels, >>>> value labels etc. of the by-variables. (That at least is my excuse for the >>>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still >>>> tend to use.) >>>> >>>> IMHO it is a major limitation of Stata that it cannot store any number of >>>> datasets (or dataframes) in the memory at a time. If it could, then we would >>>> not be forced to use -preserve- and -restore- so often and burn computer >>>> time in file I/O, just to conserve person-days. >>>> >>>> On the other hand, R (the main serious non-legacy competitor to Stata >>>> nowadays) has the even greater limitation that it doesn't have anything >>>> quite like Mata. Plus only a few of my colleagues seem to be confident using >>>> R!!! >>>> >>>> >>>> Best wishes >>>> >>>> Roger >>>> >>>> Roger B Newson BSc MSc DPhil >>>> Lecturer in Medical Statistics >>>> Respiratory Epidemiology and Public Health Group >>>> National Heart and Lung Institute >>>> Imperial College London >>>> Royal Brompton Campus >>>> Room 33, Emmanuel Kaye Building >>>> 1B Manresa Road >>>> London SW3 6LR >>>> UNITED KINGDOM >>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>> Fax: +44 (0)20 7351 8322 >>>> Email: r.newson@imperial.ac.uk >>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>> Departmental Web page: >>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>> >>>> Opinions expressed are those of the author, not of the institution. >>>> >>>> On 19/08/2013 01:06, Phil Clayton wrote: >>>>> >>>>> If you can avoid the -preserve- and -restore- you save loads of time (at >>>>> least on my modest system...) >>>>> >>>>> *--ex5. using summarize and postfile** >>>>> tempname post >>>>> tempfile postfile >>>>> postfile `post' v1 v2 mean sd n using "`postfile'" >>>>> forval x = 4(-1)1 { >>>>> forval y = 3(-1)1 { >>>>> display "v1=`x', v2=`y'" >>>>> qui sum v3 if v1==`x' & v2 == `y' >>>>> post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)') >>>>> } //end of y loop >>>>> } //end of x loop >>>>> postclose `post' >>>>> use "`postfile'", clear >>>>> >>>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <eric.a.booth@gmail.com> wrote: >>>>> >>>>>> <> >>>>>> Hi Laszlo: I agree that it would be nice if -tabulate,summarize()- >>>>>> stored values but it doesnt. There are several options available to >>>>>> store those values and then use them elsewhere. The issues seem to be >>>>>> (1) ease of parsing the values into a format that you can use for >>>>>> other analyses and (2) (and more important for you) the speed with >>>>>> which you can calculate, store, parse, and then use those values. >>>>>> >>>>>> Some alternatives to collapse include logging the -tabulate, >>>>>> summarize()- output and then parsing it, using -collapse- to get your >>>>>> values, or using the compiled -summarize- command to obtain the >>>>>> values of interest and store them for use elsewhere. I'm sure there >>>>>> are other options, but below is a comparison of these methods against >>>>>> the speed of the desired -tabulate, summarize()- solution on a >>>>>> large-ish fake dataset. >>>>>> >>>>>> This is not a clean comparison and the values I store for later use >>>>>> are not exactly the same in every example, but it gives you an idea of >>>>>> the speed differences of the steps that might be involved for each >>>>>> approach (that is, preserving the data, summarizing or collapsing or >>>>>> XX, storing and parsing the output, and restoring the data). The >>>>>> upshot is that, for this example on my computer, it seems that running >>>>>> -summarize- in a loop to grab the values you want and store them in a >>>>>> dataset was the quickest non-tab, summarize()- option I tried (example >>>>>> 4 below), but this would be slower on a lot of data points. Plus, >>>>>> both Examples 3 & 4 below are both faster than running -tabulate, >>>>>> summarize()-. >>>>>> >>>>>> Using -tabulate, summarize()- to get values takes about 101 seconds >>>>>> to run in my example. >>>>>> Example 1 is regular tabulate example with cells stored in a matrix -- >>>>>> this took about 9 seconds, but doesnt require any calculation of means >>>>>> or what not. Ex 2 is using -logout- to parse the syntax (you could do >>>>>> this manually too) and took the longest at about 109 seconds. Ex 3 >>>>>> uses -collapse- with preserve/restore and takes about 36 seconds. Ex >>>>>> 4 uses a loop to grab means from summarize for certain values and >>>>>> takes about 27 seconds. >>>>>> >>>>>> *********************! Begin Example >>>>>> //intro stuff// >>>>>> clear all >>>>>> timer clear >>>>>> set rmsg on >>>>>> *--install packages for the example >>>>>> cap which logout >>>>>> if _rc ssc install logout , replace >>>>>> *--make fake data >>>>>> sa master.dta, replace emptyok //for later >>>>>> set obs `=2^25' //run on a big dataset >>>>>> forval x = 1/10 { >>>>>> g v`x' = round(runiform()*5) >>>>>> } >>>>>> >>>>>> >>>>>> //examples// >>>>>> ** >>>>>> tabulate v1 v2, summarize(v3) //for ref. takes c.108 Seconds >>>>>> ** >>>>>> >>>>>> *--ex1. time working with -tab- stored values** >>>>>> **this doesnt get the values you need.. >>>>>> **but allows us to compare speed of these approaches somewhat >>>>>> tab v1 v2, matcell(A) >>>>>> mat list A >>>>>> preserve >>>>>> clear >>>>>> svmat A, names(A) >>>>>> keep A1 >>>>>> keep in 1/3 //parse >>>>>> l >>>>>> restore >>>>>> >>>>>> >>>>>> *--ex2. parsing the tab, summarize() output** >>>>>> *logout* >>>>>> preserve >>>>>> caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof >>>>>> nost >>>>>> logout, use(mystuff.txt) save(mytable) clear dta replace >>>>>> u mytable.dta, clear >>>>>> keep v1 v2 >>>>>> keep in 4/6 //parse as needed >>>>>> restore >>>>>> *! or just log this and parse it yourself, probably faster to do so >>>>>> >>>>>> >>>>>> >>>>>> *--ex3. using collapse** >>>>>> *this might be your best option if you have a lot of datapoints to >>>>>> calculate/store*! >>>>>> preserve >>>>>> collapse (mean) v3 , by(v1 v2) >>>>>> keep v2 v3 >>>>>> keep in 2/5 //parse >>>>>> l >>>>>> restore >>>>>> >>>>>> >>>>>> *--ex4. using summarize** >>>>>> forval x = 4(-1)1 { >>>>>> forval y = 3(-1)1 { >>>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly >>>>>> loc val`x' `r(mean)' >>>>>> preserve >>>>>> clear >>>>>> set obs 1 >>>>>> g name = "`x' and `y'" >>>>>> g v1 = `val`x'' in 1 >>>>>> append using master.dta >>>>>> sa master.dta, replace //values you need are in this dta file >>>>>> restore >>>>>> } //end of y loop >>>>>> } //end of x loop >>>>>> *********************! End Example >>>>>> note: -timer- was reseting after the internal programming of -logout- >>>>>> was clearing the timer each time, so I just added up across the -rmsg- >>>>>> timings. >>>>>> >>>>>> >>>>>> >>>>>> HTH, >>>>>> >>>>>> Eric >>>>>> ___ >>>>>> Eric A. Booth >>>>>> Research Scientist >>>>>> Gibson Consulting Group >>>>>> ebooth@gibsonconsult.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <sandorl@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> Thanks again! >>>>>>> >>>>>>> I am not sure if those preserve-and-restore the data, but I should >>>>>>> check. >>>>>>> >>>>>>> I think what will happen is that I log the -tab, sum()-, and somehow >>>>>>> read in numbers from the log file without opening a new dataset, and >>>>>>> plot "immediately" with -scatteri-. >>>>>>> >>>>>>> Laszlo >>>>>>> >>>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson >>>>>>> <r.newson@imperial.ac.uk> wrote: >>>>>>>> >>>>>>>> One way of doing what you want is probably to use the -xcontract- and >>>>>>>> -xcollapse- packages, which you can download from SSC. These are >>>>>>>> extended >>>>>>>> versions of -collapse- and -contract-, which can save the output >>>>>>>> datasets >>>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can >>>>>>>> do all >>>>>>>> kinds of plotting and tabulating. >>>>>>>> >>>>>>>> >>>>>>>> Best wishes >>>>>>>> >>>>>>>> Roger >>>>>>>> >>>>>>>> Roger B Newson BSc MSc DPhil >>>>>>>> Lecturer in Medical Statistics >>>>>>>> Respiratory Epidemiology and Public Health Group >>>>>>>> National Heart and Lung Institute >>>>>>>> Imperial College London >>>>>>>> Royal Brompton Campus >>>>>>>> Room 33, Emmanuel Kaye Building >>>>>>>> 1B Manresa Road >>>>>>>> London SW3 6LR >>>>>>>> UNITED KINGDOM >>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>>>>> Fax: +44 (0)20 7351 8322 >>>>>>>> Email: r.newson@imperial.ac.uk >>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>>>>> Departmental Web page: >>>>>>>> >>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>>>>> >>>>>>>> Opinions expressed are those of the author, not of the institution. >>>>>>>> >>>>>>>> On 18/08/2013 21:49, László Sándor wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, Roger. >>>>>>>>> >>>>>>>>> I never meant that StataCorp should give away their source. I was only >>>>>>>>> hoping to squeeze out some more interoperability. And so much of the >>>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see. >>>>>>>>> >>>>>>>>> I should have thought of -which-. >>>>>>>>> >>>>>>>>> I only wanted to capture some of the results/output without logging >>>>>>>>> and parsing the log. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Laszlo >>>>>>>>> >>>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson >>>>>>>>> <r.newson@imperial.ac.uk> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think you'll find that everything really is in the executable >>>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is >>>>>>>>>> because >>>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp >>>>>>>>>> have to >>>>>>>>>> make a living, and would probably not be able to do so if it was >>>>>>>>>> open-source >>>>>>>>>> and users could make generic copies. >>>>>>>>>> >>>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in >>>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata, >>>>>>>>>> >>>>>>>>>> which tabulate >>>>>>>>>> >>>>>>>>>> then Stata will answer >>>>>>>>>> >>>>>>>>>> built-in command: tabulate >>>>>>>>>> >>>>>>>>>> meaning that there is no file -tabulate.ado-. >>>>>>>>>> >>>>>>>>>> I hope this helps. >>>>>>>>>> >>>>>>>>>> Best wishes >>>>>>>>>> >>>>>>>>>> Roger >>>>>>>>>> >>>>>>>>>> Roger B Newson BSc MSc DPhil >>>>>>>>>> Lecturer in Medical Statistics >>>>>>>>>> Respiratory Epidemiology and Public Health Group >>>>>>>>>> National Heart and Lung Institute >>>>>>>>>> Imperial College London >>>>>>>>>> Royal Brompton Campus >>>>>>>>>> Room 33, Emmanuel Kaye Building >>>>>>>>>> 1B Manresa Road >>>>>>>>>> London SW3 6LR >>>>>>>>>> UNITED KINGDOM >>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>>>>>>> Fax: +44 (0)20 7351 8322 >>>>>>>>>> Email: r.newson@imperial.ac.uk >>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>>>>>>> Departmental Web page: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>>>>>>> >>>>>>>>>> Opinions expressed are those of the author, not of the institution. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 18/08/2013 21:21, László Sándor wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I >>>>>>>>>>> understand that much of it is written in C code, but I would still >>>>>>>>>>> expect to find some black boxes of files that do the magic. I think >>>>>>>>>>> I >>>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata >>>>>>>>>>> on my mac, and even checked the package contents of >>>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any >>>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix >>>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"? >>>>>>>>>>> Really? >>>>>>>>>>> >>>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code >>>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is >>>>>>>>>>> everything in the compiled binary instead? >>>>>>>>>>> >>>>>>>>>>> Well, something must add up those 33.9 MBs… >>>>>>>>>>> >>>>>>>>>>> Thanks for any thoughts, >>>>>>>>>>> >>>>>>>>>>> Laszlo >>>>>>>>>>> >>>>>>>>>>> * >>>>>>>>>>> * For searches and help try: >>>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>>>> >>>>>>>>>> * >>>>>>>>>> * For searches and help try: >>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> * >>>>>>>>> * For searches and help try: >>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>> >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> >>>>>>> >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>>> >>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/