Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Error 909: obs must be between 0 and NNN


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Error 909: obs must be between 0 and NNN
Date   Mon, 28 Jan 2008 12:06:17 -0600

Sergiy Radyakin <serjradyakin@gmail.com> has some memory questions, 
all caused by the fact that Sergiy observed:

>       . set memory 1040m
>         <works, output omitted>
>       
>       . set obs 160000000
>       obs must be between 0 and 109051879
>       r(198);
>
>       . set obs 109051879
>       no room to add more observations
>       <output omitted>
>       r(901);

So Sergiy asked, 


> 1. how does Stata determine how many observations can fit into the currently
>    allocated memory when record size is zero? (which is exactly our case -
>    no variables are defined after set mem something). As Stata had
>    suggested, I tried to think "of Stata's data area as the area of a
>    rectangle", but still could not divide 1GB by zero.

-set obs- is a very unreliable way of determining whether something else will
fit into memory.  Stata has a minimum width that it uses in the calculation
when the current width is 0.  I don't remember the number, but it's around 20,
as I recall.  Do not hold me to that.

The purpose of -set obs- is to increase the number of observations of the
dataset currently in memory; it is not to project what the maximum would be in
some other case.

The best estimate available is some other case can be obtained by a dataset
like the one under consideration typing -describe-, and then looking at
r(N_max) by typing -return list-.

In the case of an existing dataset that will fit into memory, you can 
type 

        . use ...
        . describe
        . return list 

In the case of an existing dataset that will not fit into memory, or 
might not, you can type

        . use ... in 1                  <- in one, meaning first obs.
        . describe 
        . return list


> 2. Why the number it determines is wrong?

"It" in the above question refers to the suggested value of 109051879
in the following output:

        . set obs 160000000
        obs must be between 0 and 109051879
        r(198);                           

and we know it is wrong because, when we try to use the suggested value, 
Stata refuses:

        . set obs 109051879
        no room to add more observations
        <output omitted>
        r(901);

What I am about to say also applies to r(N_max) returned by -describe-.

The memory Stata requires for a dataset is given by 

        memory_required = f(obs, a few other other assumptions)      (1)

where f() is a very long, nonlinear expression.  In the case of the suggested
maximum obs, Stata solves equation (1) for obs with memory_required is set
equal to currently allocated memory.  Stata "solves" the equation by using an
approximation formula.  The fact is, however, that Stata does not know whether
its suggested maximum will actually works until it tries it.

There are two issues that make the "solution" problematic:  (1) the quality of
the approximation and (2) some assumptions about future operating system
behavior.  

Concerning (1), the quality of our approximation, we think it is pretty good,
but I cannot promise that it is good in all cases.  It's embarrassing for the
authors of a statistical package to admit that we do not actually solve
equation (1) since obviously the Stata software itself has that capability.
Mainly, the problem is simply writing down and then coding the equation
correctly because it is so long and involved.  Then, even if we did that, just
keeping the equation correct as we we make minor changes over time would be a
challange.  So we use an approximation formula that contains the major
ingredients, along with some extra constants.

Concerning (2), the behavior of the operating system in the future, the
process of resetting maxobs (repartioning memory) requires certain small areas
be freed and reallocated through the operating system.  The overall total
remains approximately unchanged, but the size distribution of the components
is changed as less is allocated for variables and more for observations, 
or vice versa.  We assume the operating system will prove agreeable, but
sometimes it does not.  There can be two reasons for that:  (a) I said the
total remains approximately unchanged, but in some cases, we might ask for
significantly more, although that additional amount is small relative to
the total (1040m in the example), and (b) even if we ask for exactly the same
amount, sometimes operating systems cannot find a place for the changed sizes
of the individual components.  This second problem becomes more likely as the
total allocated apporaches the total on the computer, or the total the OS has
set aside for virtual memory.

Concering (2), the behavior of the operating system in the future can be 
particlarly unpredictable in windowed operating systems.  To make the 
memory allocation problem as easy as possible for the OS, and thus to give the
OS every chance of suceeding, Stata writes to disk all of its memory usage,
frees it, and let's the OS start with a clean slate, from which Stata then
rebuilds itself.  That helps.  The problem is, windowed operating systems also
allocate memory for dialog boxes and other window features, they sometimes
borrow application memory for that, they sometimes put them in particularly
inappropriate places, and Stata knows nothing about that.  Microsoft's Windows
before Vista was particularly notorious on this score.


>  3. Is it wrong only in the case of the zero-length records, or will it also
>     fail to properly compute maximum number of observations in cases when
>     record size is not trivial?

The approximation formula is most accurate in the case of small changes 
for the current allocation.  

The formula should be accurate in the 0-variable case assuming to want to 
have approximately 0 variables, say 1 or 2.  That may not be the case in a
real problem, which is why I gave a different way of obtaining the projected
maximum above.

Nonetheless, in the problem Sergiy showed, we attempted to go from 0 variables
to 0 variables but with more observations, and that proved not to be possible,
so we know the suggestion was wrong by at least 9%.  So either the
approimation fomula itself was the cause, or it was memory distributional
problems.  In Sergiy's case, I suspect it was memory distributional problems.
I suspect that Sergiy is asking for close to physical memory, and just the
distributional change in a few small areas is causing the operating system
difficulties.  Those difficulties are *NOT* a bug in the OS.  OS's face their
own problems and, for reasons of efficiency, OS's use allocation methods that
do not efficiently use every single byte of memory in all cases.


> 4. Given a file of size X on disk, how to compute the memory that I need to
>    set to:  
>        a) simply open the file and be able to see it's contents 
>        b) create an additional variable Y of type Z after the file is open.
>    Here of course we are talking about large files ~1GB, close to the
>    real-World limits of Stata on the 32bit Windows machines. And the answer
>    might be that the file may not be opened at all.

-describe using <filename>- reports the size.  -set memory- to a bit more than
that.


> 5. This is more a wish, then a puzzle, since I am 99% sure the answer is
>    negative: many Stata commands (both base and user-written) create
>    temporary variables during their work. But it is hard to tell how many of
>    those will be created. When working on the margin this becomes important.
>    Is there any reference table for this purpose? Is there any way to
>    automatically monitor the number of created variables, and collect the
>    largest value, say in profiler, or elsewhere?

No, there is no way to monitor that extra usage.  We have done experiments in
the past, however, and the maximum extra required tends to be roughly 5 8-byte
variables, which is to say, a width increment of 40 bytes.  Most programs are
well under that.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index