Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -collapsetofile-


From   Nirina F <[email protected]>
To   [email protected]
Subject   Re: st: -collapsetofile-
Date   Fri, 28 Feb 2014 17:07:05 -0500

Great! Was looking for something like this one.


On Fri, Feb 28, 2014 at 2:00 PM, Nick Cox <[email protected]> wrote:
> To get stuff on SSC, you just need to email Kit Baum with the files.
> But as he announced very recently he is away from base right now.
>
> http://repec.org/bocode/s/sscsubmit.html gives full details.
>
> Nick
> [email protected]
>
>
> On 28 February 2014 18:51, Andrew Maurer <[email protected]> wrote:
>> Thanks for the reference, David.
>>
>> Looking at xcollapse.do, it internally does a preserve/save/restore. The whole idea of -collapsetofile- is to save the data without doing a preserve/restore. It looks like the intended purpose of -xcollapse- is to add features to collapse, while the purpose of -collapsetofile- is to save a file faster. (-collapsetofile-, at the moment, does far less then collapse - I still need to spend some time reading through the syntax-parsing portion of collapse to allow syntax like (sum) x1 = y x2 = z...)
>>
>> It looks like I need a RePEc account to post to SSC, if I'm understanding this. I'm looking into it now.
>>
>> Andrew Maurer
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
>> Sent: Friday, February 28, 2014 12:47 PM
>> To: [email protected]
>> Subject: Re: st: -collapsetofile-
>>
>> Using SSC as a medium for distributing user-written programs is naturally entirely optional for user-programmers.
>>
>> It is however I think germane that SSC requires provision of help files as part of a minimum standard for inclusion of packages.
>>
>> Similarly, providing help files would help people to understand exactly what these programs do and help Andrew get good feedback from anyone interested.
>>
>> (If I am missing the help files, please do flag where they are.)
>>
>> Nick
>> [email protected]
>>
>>
>> On 28 February 2014 18:24, Jorge Eduardo Pérez Pérez <[email protected]> wrote:
>>> Thanks Andrew, this looks useful.
>>>
>>> Why not submit the code to SSC to make it easier for users to install
>>> this directly from Stata?
>>>
>>>
>>> --------------------------------------------
>>> Jorge Eduardo Pérez Pérez
>>> Graduate Student
>>> Department of Economics
>>> Brown University
>>>
>>>
>>> On Fri, Feb 28, 2014 at 1:19 PM, Nick Cox <[email protected]> wrote:
>>>> -save- is part of the executable
>>>>
>>>> . which save
>>>> built-in command:  save
>>>>
>>>> and so its code is not accessible to users.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 28 February 2014 18:06, Andrew Maurer <[email protected]> wrote:
>>>>> Hi Statalist,
>>>>>
>>>>> I've written a pair of program -collapsetofile- and -recover- to allow users to "collapse" data to a file without destroying the dataset like -collapse- does. I don't know if anyone else will have use for this, but it will save me a lot of computer time when dealing with large datasets. I would be very interested if anyone has any input or comments on how to improve coding efficiency / style (the code is still a bit rough).
>>>>>
>>>>> ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb ado file
>>>>> (recover.ado) : http://codepad.org/csZhQvb0 sthlp file
>>>>> (collapsetofile.sthlp): http://codepad.org/AsKC79uK
>>>>>
>>>>> The biggest improvement would come from being able to save directly to a .dta. I assume that this would require either:
>>>>> 1) looking at the format/header/footer of stata dtas in clear text
>>>>> and fwrite()'ing it from mata, and/or
>>>>> 2) looking at the source for a command like save and just copying
>>>>> that (is the source for -save- available?)
>>>>>
>>>>> Before writing this I found myself waiting for hours when graphing summary statistics of large datasets with sequences of:
>>>>>
>>>>> use fulldata // this could be >10gb
>>>>> preserve
>>>>> collapse (sum) thisvar thatvar, by(byvar1 byvar2) ... some data
>>>>> manipulation twoway line...
>>>>> restore
>>>>>
>>>>> preserve
>>>>> collapse (sum) anothervar yetanothervar, by(byvar3) ... some data
>>>>> manipulation twoway line...
>>>>> restore
>>>>>
>>>>> ...
>>>>>
>>>>> preserve
>>>>> collapse (sum) more vars, by(byvar10) ... some data manipulation
>>>>> twoway line...
>>>>> restore
>>>>>
>>>>> For a 20gb dataset with 10 graphs, that makes 10 preserves/restores * 20gb = 200gb written/read to disk. -collapsetofile- writes just the collapsed data to be graphed to a file with no other disk reads/writes:
>>>>>
>>>>> use fulldata
>>>>> collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1
>>>>> byvar2) collapsetofile (sum) anothervar yetanothervar dataforgraph2,
>>>>> by(byvar3) ...
>>>>> collapsetofile (sum) more vars, by(byvar10)
>>>>>
>>>>> recover dataforgraph1, clear
>>>>> ... some data manipulation
>>>>> twoway line...
>>>>> ...
>>>>> recover dataforgraph2, clear
>>>>> ... some data manipulation
>>>>> twoway line...
>>>>> ...
>>>>>
>>>>> Thanks to Nick Cox for mentioning the importance of saving characteristics/metadata with the dataset.
>>>>> Thanks to Sergiy Radyakin for making me realize that I could never write a mata program that would compute stats "by" variables as fast as stata's -_mean- in -collapse-, since stata's built-in C code can take advantage of parallelization, while mata code cannot.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index