Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: looping over files -- speed and Stata/MP


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: looping over files -- speed and Stata/MP
Date   Wed, 16 Mar 2011 11:46:30 -0400

I'm not sure "hundreds of thousands" of files is compatible with -fs-
and macro storage limits.

Dimitri Szerman <dimitrijoe@gmail.com>:
Short answer: no, there is no large-grain parallelization in MP.
Probably the biggest speed improvement would come from revisions in
the part you have hidden as "(a bunch of Stata commands)" but you can
get large-grain parallelization by writing out a bunch of do files and
starting a new Stata instance to run each in batch mode.  Something
like this untested example:

! dir "mydir" /a-d /b > filelist.txt
file open LIST using "filelist.txt", read
loc i 1
file read LIST line
while r(eof)==0 {
 loc j=1000-1+`i'-mod(`i'-1,1000)
 file open p`j' using p`j'.do, write replace
 file write p`j' " (a bunch of Stata commands)"
 file write p`j' _n "save mydir2/`line', replace" _n "exit"
 file close p`j'
 file read LIST line
 if `i'==`j' winexec stata -b p`j'.do
}
file close LIST

On Wed, Mar 16, 2011 at 11:13 AM, Nick Cox <njcoxstata@gmail.com> wrote:
> -fs- from SSC automates the production of a list of files. It is just
> a wrapper for a standard Stata extended macro function but it would
> obviate the need for a structure based on holding a file open while
> doing lots of other things. At the same time, it is difficult to know
> how much difference to timings that would make beyond reducing your
> use of the OS.
>
> Nick
>
> On Wed, Mar 16, 2011 at 2:48 PM, Dimitri Szerman <dimitrijoe@gmail.com> wrote:
>
>> In constructing a data set, I have to loop over hundreds of thousands
>> of files. Simply put, this is what I do:
>>
>> ! dir "mydir" /a-d /b > filelist.txt         // list of files to be imported
>> file open LIST using "filelist.txt", read
>> file read LIST line
>> while r(eof)==0 {
>>
>>     (a bunch of Stata commands)
>>
>> save mydir2\\`line', replace
>> file read LIST line
>> }
>> file close LIST
>>
>>
>> (In fact, I run a loop like this twice (first to import csv into dta;
>> another to work (clean) the dta files). As it stands now, my code
>> takes around 12 hours to run. My question is: will Stata/MP make it
>> run faster? (For those familiar with Matlab, I guess this boils down
>> to: does Stata/MP have something along the lines of "parfor", i.e., a
>> "parallel-for" command?) More broadly, can anyone think of a way of
>> speeding this up?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index