Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: Wishlist st: Re: Large data sets


From   Alan Riley <ariley@stata.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: Wishlist st: Re: Large data sets
Date   Sat, 23 Jun 2007 12:28:08 -0500

SamL (saml@demog.berkeley.edu) asked about the possibility of Stata
providing its own virtual memory system to allow the analysis of
datasets which will not fit in the memory of the computers available
to him:

> This brings up a general problem that I wonder if stata can fix, or has
> fixed.  I routinely use large datasets--by large I mean 4-10 gb.  I use a
> unix system managed by a computer center.  Sometimes my data becomes so
> large that there is no machine large enough to invoke stata and hold all
> the data in memory.

He also indicated that speed is not an issue, so I cannot tell him
"no, you don't want to use virtual memory because it is too slow":

> I should also say that my models routinely take weeks or even months to
> run, so I am resigned to waiting a long time for results.  Speed is not my
> interest.


Ben Jann (ben.jann@gmail.com) suggested that Stata's -set virtual on-
setting might be what Sam needs:

> Type
> 
>  . set virtual on
> 
> (Or do I misunderstand your query?)

Before I respond to Sam's question, I want to address a common
misconception about -set virtual on-.  -set virtual on- does
NOT tell Stata to use virtual memory.  The use of virtual memory
is controlled entirely by the operating system.  If a user causes
Stata to ask for more memory than the operating system can provide
as real memory, the operating system may (and typically does) respond
to that request by allocating some virtual memory to Stata.

When we talk about an operating system providing virtual memory to an
application, we mean that the operating system is storing some of
the application's "memory" on disk, and when the application needs
to access that part of "memory", the operating system swaps the
information from disk into real memory so that the application can
use it.  This is why you will sometimes hear virtual memory referred
to as "swap space" or simply "swap".  The operating system does
these swaps in big chunks.  That is, even if the application asks
to read a single character out of memory, the operating system
will swap a large piece of data surrounding that character from
disk into memory.


So, what does it mean when a user types -set virtual on- in Stata?
-set virtual on- alerts Stata that so much memory has been allocated
to Stata that it is very likely that the operating system is providing
a portion of that memory through virtual memory.  With -set virtual on-,
Stata will then try to optimize the way it is storing data in memory
so that multiple requests to read data out of memory will be close to
each other.  This is intended to minimize the number of times the
operating system will have to swap data between real memory and the
hard disk.


Now, back to Sam's question.  We do not have any immediate plans to
provide Stata's own virtual memory facility for very large datasets.
It certainly has been discussed as a possibility many different times,
but computers keep getting bigger, and indeed, we already know of
several sites using Stata with datasets much larger than Sam mentioned
(and with computers with well over 10 GB of real memory).

That is not to say that Stata may not have such a feature in the
future, but if it did, it would not be soon enough help Sam with his
immediate problem.

Sam says he is using large Unix systems at his site.  On a Unix system,
it is usually not too hard for a system administrator to increase the amount
of swap space on the system, thus increasing the amount of virtual
memory that can be used by an application.

Even if there is not money in this year's budget to upgrade the memory
on those systems to accomodate the analyses he needs to run, Sam
should see if the system administrators at his site are willing to
increase the swap space available on those systems so that he could
allocate more (virtual) memory to Stata.  This can typically be done
fairly easily and may allow him to complete (albeit slowly) the
analyses he needs to perform.

I will warn Sam that his system administrators may not want to do
this for other reasons.  The main reason not to do what I am suggesting
above is that a computer system which is having to use virtual memory
to accomodate one user's application wll be extremely slow and thus
is virtually (no pun intended) unusable for all other users and their
applications.


--Alan
(ariley@stata.com)
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index