Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Speed issues / which disk for tempfiles


From   "Michael Blasnik" <[email protected]>
To   <[email protected]>
Subject   st: Re: Speed issues / which disk for tempfiles
Date   Wed, 28 Jan 2004 08:55:32 -0500

I too found out that reshape uses a lot of file writing as it temporarily
stores the resulting dataset and goes back and forth between the original
and results sets.  I know of two approaches for making large dataset
reshapes noticeably faster:

1) On my prior computer, I was running Win98 and found a small shareware
device driver called vramdir.  It was able to map the temp directory to ram,
dynamically allocating space as needed.  I found that using this device
driver would speed up large reshapes dramatically as long as the data set
was about 1/3 the size of installed ram or less.  I haven't found the
equivalent for WinXP, but I think all major Operating Systems have ram drive
capability, so you could set up a ram drive and map your temp directory to
it for a substantial speed improvement.  On Windows systems I'm pretty sure
that Stata looks to the temp environment variable to find out where to put
temp files, so you can just change that.  It would be a nice feature if
Stata allowed you to select it's own temp directory.

2)  I also found, as you did, that cutting up the dataset into chunks and
reshaping each was considerably faster.  I even wrote an ado, hardwired for
simple for reshape long commands, that split the master file into chunks,
reshaped each chunk, and appended the results together.  I had it work with
5000 observation chunks (which often became 100,000+ obs in the long
dataset).  It seems that you're doing a reshape wide, so the code would need
some tweaking to make sure that it didn't split the dataset in the middle of
a panel.  It probably wouldn't be too hard to write up.

Michael Blasnik
[email protected]

----- Original Message ----- 
From: "Ernest Berkhout" <[email protected]>
To: <[email protected]>
Sent: Wednesday, January 28, 2004 8:00 AM
Subject: st: Speed issues / which disk for tempfiles


> Hi all,
>
> I like to share some thought mainly on speeding up Stata, in particular
the
> reshape-command. Presently I run a do-file in which i ask Stata 8.2 to do
> some reshaping, including the reshape of some str244-variables. The
> original file is 350.000 obs, the reshaped file somewhere about 6000 obs.
> Of course this takes a very long time (about 10 minutes) and I'm looking
> for tips to shorten this type of tasks.
>
> I already discovered that quite a gain can be made if i split the data set
> in half, then do a reshape on both subsets, and then merge them together.
>
> Also I recognized that during the reshape, the CPU usage on a
> Windows2000-system is only 30-50%, and the memory usage is only 25% of
max.
> Of course 'set virtual' is turned off. At the same time, reading and
> writing to disc is quite heavy. From this I deduced that the best way to
> speed up the process is to speed up the reading/writing to disk, as this
> seems to be the bottleneck.
>
> This raises the question of how that can be done best, and why Stata does
> not use the whole off the available memory. Is it possible to specify on
> which disk or directory Stata writes its temporary files, for instance?
>
> Ernest Berkhout
> Stichting voor Economisch Onderzoek
> Universiteit van Amsterdam


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index