Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: -collapsetofile-


From   "Radwin, David" <[email protected]>
To   <[email protected]>
Subject   RE: st: -collapsetofile-
Date   Fri, 28 Feb 2014 13:39:41 -0500

Andrew, 

You might look at Roger Newson's -xcollapse- (SSC) to avoid reinventing the wheel.

David
--
David Radwin, Senior Research Associate
Education and Workforce Development
RTI International
2150 Shattuck Ave. Suite 800, Berkeley, CA 94704
Phone: 510-665-8274

www.rti.org/education


> -----Original Message-----
> From: [email protected] [mailto:owner-
> [email protected]] On Behalf Of Jorge Eduardo Pérez Pérez
> Sent: Friday, February 28, 2014 10:24 AM
> To: [email protected]
> Subject: Re: st: -collapsetofile-
> 
> Thanks Andrew, this looks useful.
> 
> Why not submit the code to SSC to make it easier for users to install
> this directly from Stata?
> 
> 
> --------------------------------------------
> Jorge Eduardo Pérez Pérez
> Graduate Student
> Department of Economics
> Brown University
> 
> 
> On Fri, Feb 28, 2014 at 1:19 PM, Nick Cox <[email protected]> wrote:
> > -save- is part of the executable
> >
> > . which save
> > built-in command:  save
> >
> > and so its code is not accessible to users.
> >
> > Nick
> > [email protected]
> >
> >
> > On 28 February 2014 18:06, Andrew Maurer <[email protected]> wrote:
> >> Hi Statalist,
> >>
> >> I've written a pair of program -collapsetofile- and -recover- to allow
> users to "collapse" data to a file without destroying the dataset like -
> collapse- does. I don't know if anyone else will have use for this, but it
> will save me a lot of computer time when dealing with large datasets. I
> would be very interested if anyone has any input or comments on how to
> improve coding efficiency / style (the code is still a bit rough).
> >>
> >> ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb
> >> ado file (recover.ado) : http://codepad.org/csZhQvb0
> >> sthlp file (collapsetofile.sthlp): http://codepad.org/AsKC79uK
> >>
> >> The biggest improvement would come from being able to save directly to
> a .dta. I assume that this would require either:
> >> 1) looking at the format/header/footer of stata dtas in clear text and
> fwrite()'ing it from mata, and/or
> >> 2) looking at the source for a command like save and just copying that
> (is the source for -save- available?)
> >>
> >> Before writing this I found myself waiting for hours when graphing
> summary statistics of large datasets with sequences of:
> >>
> >> use fulldata // this could be >10gb
> >> preserve
> >> collapse (sum) thisvar thatvar, by(byvar1 byvar2)
> >> ... some data manipulation
> >> twoway line...
> >> restore
> >>
> >> preserve
> >> collapse (sum) anothervar yetanothervar, by(byvar3)
> >> ... some data manipulation
> >> twoway line...
> >> restore
> >>
> >> ...
> >>
> >> preserve
> >> collapse (sum) more vars, by(byvar10)
> >> ... some data manipulation
> >> twoway line...
> >> restore
> >>
> >> For a 20gb dataset with 10 graphs, that makes 10 preserves/restores *
> 20gb = 200gb written/read to disk. -collapsetofile- writes just the
> collapsed data to be graphed to a file with no other disk reads/writes:
> >>
> >> use fulldata
> >> collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1
> byvar2)
> >> collapsetofile (sum) anothervar yetanothervar dataforgraph2, by(byvar3)
> >> ...
> >> collapsetofile (sum) more vars, by(byvar10)
> >>
> >> recover dataforgraph1, clear
> >> ... some data manipulation
> >> twoway line...
> >> ...
> >> recover dataforgraph2, clear
> >> ... some data manipulation
> >> twoway line...
> >> ...
> >>
> >> Thanks to Nick Cox for mentioning the importance of saving
> characteristics/metadata with the dataset.
> >> Thanks to Sergiy Radyakin for making me realize that I could never
> write a mata program that would compute stats "by" variables as fast as
> stata's -_mean- in -collapse-, since stata's built-in C code can take
> advantage of parallelization, while mata code cannot.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index