Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -collapsetofile-


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: -collapsetofile-
Date   Fri, 28 Feb 2014 18:47:38 +0000

Sorry, belay that. They were in plain sight all the time.
Nick
njcoxstata@gmail.com


On 28 February 2014 18:46, Nick Cox <njcoxstata@gmail.com> wrote:
> Using SSC as a medium for distributing user-written programs is
> naturally entirely optional for user-programmers.
>
> It is however I think germane that SSC requires provision of help
> files as part of a minimum standard for inclusion of packages.
>
> Similarly, providing help files would help people to understand
> exactly what these programs do and help Andrew get good feedback from
> anyone interested.
>
> (If I am missing the help files, please do flag where they are.)
>
> Nick
> njcoxstata@gmail.com
>
>
> On 28 February 2014 18:24, Jorge Eduardo Pérez Pérez
> <jorge_perez@brown.edu> wrote:
>> Thanks Andrew, this looks useful.
>>
>> Why not submit the code to SSC to make it easier for users to install
>> this directly from Stata?
>>
>>
>> --------------------------------------------
>> Jorge Eduardo Pérez Pérez
>> Graduate Student
>> Department of Economics
>> Brown University
>>
>>
>> On Fri, Feb 28, 2014 at 1:19 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> -save- is part of the executable
>>>
>>> . which save
>>> built-in command:  save
>>>
>>> and so its code is not accessible to users.
>>>
>>> Nick
>>> njcoxstata@gmail.com
>>>
>>>
>>> On 28 February 2014 18:06, Andrew Maurer <Andrew.Maurer@qrm.com> wrote:
>>>> Hi Statalist,
>>>>
>>>> I've written a pair of program -collapsetofile- and -recover- to allow users to "collapse" data to a file without destroying the dataset like -collapse- does. I don't know if anyone else will have use for this, but it will save me a lot of computer time when dealing with large datasets. I would be very interested if anyone has any input or comments on how to improve coding efficiency / style (the code is still a bit rough).
>>>>
>>>> ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb
>>>> ado file (recover.ado) : http://codepad.org/csZhQvb0
>>>> sthlp file (collapsetofile.sthlp): http://codepad.org/AsKC79uK
>>>>
>>>> The biggest improvement would come from being able to save directly to a .dta. I assume that this would require either:
>>>> 1) looking at the format/header/footer of stata dtas in clear text and fwrite()'ing it from mata, and/or
>>>> 2) looking at the source for a command like save and just copying that (is the source for -save- available?)
>>>>
>>>> Before writing this I found myself waiting for hours when graphing summary statistics of large datasets with sequences of:
>>>>
>>>> use fulldata // this could be >10gb
>>>> preserve
>>>> collapse (sum) thisvar thatvar, by(byvar1 byvar2)
>>>> ... some data manipulation
>>>> twoway line...
>>>> restore
>>>>
>>>> preserve
>>>> collapse (sum) anothervar yetanothervar, by(byvar3)
>>>> ... some data manipulation
>>>> twoway line...
>>>> restore
>>>>
>>>> ...
>>>>
>>>> preserve
>>>> collapse (sum) more vars, by(byvar10)
>>>> ... some data manipulation
>>>> twoway line...
>>>> restore
>>>>
>>>> For a 20gb dataset with 10 graphs, that makes 10 preserves/restores * 20gb = 200gb written/read to disk. -collapsetofile- writes just the collapsed data to be graphed to a file with no other disk reads/writes:
>>>>
>>>> use fulldata
>>>> collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1 byvar2)
>>>> collapsetofile (sum) anothervar yetanothervar dataforgraph2, by(byvar3)
>>>> ...
>>>> collapsetofile (sum) more vars, by(byvar10)
>>>>
>>>> recover dataforgraph1, clear
>>>> ... some data manipulation
>>>> twoway line...
>>>> ...
>>>> recover dataforgraph2, clear
>>>> ... some data manipulation
>>>> twoway line...
>>>> ...
>>>>
>>>> Thanks to Nick Cox for mentioning the importance of saving characteristics/metadata with the dataset.
>>>> Thanks to Sergiy Radyakin for making me realize that I could never write a mata program that would compute stats "by" variables as fast as stata's -_mean- in -collapse-, since stata's built-in C code can take advantage of parallelization, while mata code cannot.
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index