Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -collapsetofile-


From   Jeph Herrin <info@flyingbuttress.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: -collapsetofile-
Date   Fri, 28 Feb 2014 13:48:24 -0500

I think you can get all the info you need to write Stata files here:

 http://www.stata.com/help.cgi?dta_115


A utility for writing out eg a matrix to a .dta file would be very useful, but I can't seem to find one anywhere. I usually write out as a CSV file and then insheet it, which works well enough.

cheers,
Jeph





On 2/28/2014 1:19 PM, Nick Cox wrote:
-save- is part of the executable

. which save
built-in command:  save

and so its code is not accessible to users.

Nick
njcoxstata@gmail.com


On 28 February 2014 18:06, Andrew Maurer <Andrew.Maurer@qrm.com> wrote:
Hi Statalist,

I've written a pair of program -collapsetofile- and -recover- to allow users to "collapse" data to a file without destroying the dataset like -collapse- does. I don't know if anyone else will have use for this, but it will save me a lot of computer time when dealing with large datasets. I would be very interested if anyone has any input or comments on how to improve coding efficiency / style (the code is still a bit rough).

ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb
ado file (recover.ado) : http://codepad.org/csZhQvb0
sthlp file (collapsetofile.sthlp): http://codepad.org/AsKC79uK

The biggest improvement would come from being able to save directly to a .dta. I assume that this would require either:
1) looking at the format/header/footer of stata dtas in clear text and fwrite()'ing it from mata, and/or
2) looking at the source for a command like save and just copying that (is the source for -save- available?)

Before writing this I found myself waiting for hours when graphing summary statistics of large datasets with sequences of:

use fulldata // this could be >10gb
preserve
collapse (sum) thisvar thatvar, by(byvar1 byvar2)
... some data manipulation
twoway line...
restore

preserve
collapse (sum) anothervar yetanothervar, by(byvar3)
... some data manipulation
twoway line...
restore

...

preserve
collapse (sum) more vars, by(byvar10)
... some data manipulation
twoway line...
restore

For a 20gb dataset with 10 graphs, that makes 10 preserves/restores * 20gb = 200gb written/read to disk. -collapsetofile- writes just the collapsed data to be graphed to a file with no other disk reads/writes:

use fulldata
collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1 byvar2)
collapsetofile (sum) anothervar yetanothervar dataforgraph2, by(byvar3)
...
collapsetofile (sum) more vars, by(byvar10)

recover dataforgraph1, clear
... some data manipulation
twoway line...
...
recover dataforgraph2, clear
... some data manipulation
twoway line...
...

Thanks to Nick Cox for mentioning the importance of saving characteristics/metadata with the dataset.
Thanks to Sergiy Radyakin for making me realize that I could never write a mata program that would compute stats "by" variables as fast as stata's -_mean- in -collapse-, since stata's built-in C code can take advantage of parallelization, while mata code cannot.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index