Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: looping over files -- speed and Stata/MP
From
Austin Nichols <[email protected]>
To
[email protected]
Subject
Re: st: looping over files -- speed and Stata/MP
Date
Wed, 16 Mar 2011 11:46:30 -0400
I'm not sure "hundreds of thousands" of files is compatible with -fs-
and macro storage limits.
Dimitri Szerman <[email protected]>:
Short answer: no, there is no large-grain parallelization in MP.
Probably the biggest speed improvement would come from revisions in
the part you have hidden as "(a bunch of Stata commands)" but you can
get large-grain parallelization by writing out a bunch of do files and
starting a new Stata instance to run each in batch mode. Something
like this untested example:
! dir "mydir" /a-d /b > filelist.txt
file open LIST using "filelist.txt", read
loc i 1
file read LIST line
while r(eof)==0 {
loc j=1000-1+`i'-mod(`i'-1,1000)
file open p`j' using p`j'.do, write replace
file write p`j' " (a bunch of Stata commands)"
file write p`j' _n "save mydir2/`line', replace" _n "exit"
file close p`j'
file read LIST line
if `i'==`j' winexec stata -b p`j'.do
}
file close LIST
On Wed, Mar 16, 2011 at 11:13 AM, Nick Cox <[email protected]> wrote:
> -fs- from SSC automates the production of a list of files. It is just
> a wrapper for a standard Stata extended macro function but it would
> obviate the need for a structure based on holding a file open while
> doing lots of other things. At the same time, it is difficult to know
> how much difference to timings that would make beyond reducing your
> use of the OS.
>
> Nick
>
> On Wed, Mar 16, 2011 at 2:48 PM, Dimitri Szerman <[email protected]> wrote:
>
>> In constructing a data set, I have to loop over hundreds of thousands
>> of files. Simply put, this is what I do:
>>
>> ! dir "mydir" /a-d /b > filelist.txt // list of files to be imported
>> file open LIST using "filelist.txt", read
>> file read LIST line
>> while r(eof)==0 {
>>
>> (a bunch of Stata commands)
>>
>> save mydir2\\`line', replace
>> file read LIST line
>> }
>> file close LIST
>>
>>
>> (In fact, I run a loop like this twice (first to import csv into dta;
>> another to work (clean) the dta files). As it stands now, my code
>> takes around 12 hours to run. My question is: will Stata/MP make it
>> run faster? (For those familiar with Matlab, I guess this boils down
>> to: does Stata/MP have something along the lines of "parfor", i.e., a
>> "parallel-for" command?) More broadly, can anyone think of a way of
>> speeding this up?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/