Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: insheet multi threading


From   Mike Lacy <Michael.Lacy@colostate.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: insheet multi threading
Date   Tue, 03 May 2011 12:17:07 -0600


>Date: Mon, 2 May 2011 09:30:48 -0400

Argyn Kuketayev <akuketayev@mail.primaticsfinancial.com> wrote:

>I'm not talking about some obscure command either. it's a very basic
>task, and I'm sure everyone does it daily: read CSV files. it takes
>over 1 hour on 8-core machine to read 13GB file, because CPU load is
>12% all the time, one core is working.

>it's a junior programmer level assignment to parallelize the parsing
>part, that's why i'm surprised Stata didn't do it. it's frustrating
>because sometime i get CSVs during the day, and have to wait long long
>time before i can upload them into Stata. once in .dta format, all is
>fast: reading and writing. so, it's clearly parsing part that is slow

Here's an inelegant approach that might nevertheless work to parallelize your job. Whether it works depends on what seems to be true on my dual processor machine running Windows and a single processor version of Stata. It seems that a new instance of Stata, running concurrently is allocated (by Windows XP, in my case) to a different processor than the one running the first instance. This claim is based on simultaneously running the same long job in two different instances of Stata, and having it take much less than twice as long.

If this is true, and perhaps generalizes to other operating systems and machines with more processors, you could:

1) use -chunky- (-findit chunky-) to break up your CSV file into multiple CSV files with the same original header with variable names. 2) Take the list of file names that -chunky- returns, and break it into (say) 4 lists. 3) In the current instance of Stata, start -insheeting- the files on the first list and saving them as *.dta files. 4) For each of the 3 remaining lists, start a new instance of Stata to -insheet- each list.
5) Append all the *.dta files together.

This could be automated, I presume, by starting one or more other instances of Stata as batch jobs on your machine from its command line; you presumably could even call these other instances of Stata from within Stata. I freely admit that this approach is clumsy, and would involve a fair amount of extra I/O, but it might be quite a bit faster, if you're right that parsing is the rate-determining aspect of the job.

I think, but I'm not sure, that this does not violate the terms of the one user at a time licensing of Stata.

Regards,
=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy, Assoc. Prof.
Soc. Dept., Colo. State. Univ.
Fort Collins CO 80523 USA
(970)-491-6721

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index