[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Error 909: obs must be between 0 and NNN

From   "Sergiy Radyakin" <[email protected]>
To   [email protected]
Subject   Re: st: Error 909: obs must be between 0 and NNN
Date   Mon, 28 Jan 2008 17:24:13 -0500

Dear William,

thank you for the detailed explanation, it is really informative. So
if I understood correctly, if I need to determine, if a file can be
appended to the current dataset, I should be looking not to the number
of observations in that file and the max observations reported by set
obs, but to the free memory left and the memory occupied by the file
data, as reported by describe, though in practical situations where
the number of variables is not trivial, both methods must be
approximately correct. (I was trying to simulate the append procedure
by simply telling Stata to create N additional observations and check
if it fails - so that I don't have to actually read the data, just
know how many observations are there).

Another thing is the pointer size, that is needed to correctly
estimate the data overhead. As 32-bit machines are being phased out by
modern 64-bit number crunchers, the pointer size in Stata will also
change from 4bytes to 8bytes? (This will effectively double the
dataset overhead). Is pointer size or machine type reported somewhere?
I looked at creturn list, but I don't immediately see anything like
SizeOf(Pointer). On the other hand, hopefully on 64-bit machines will
not have to worry about memory limits any time soon.

Thank you,
   Sergiy Radyakin

On 1/28/08, William Gould, StataCorp LP <[email protected]> wrote:
> Sergiy Radyakin <[email protected]> has some memory questions,
> all caused by the fact that Sergiy observed:
> >       . set memory 1040m
> >         <works, output omitted>
> >
> >       . set obs 160000000
> >       obs must be between 0 and 109051879
> >       r(198);
> >
> >       . set obs 109051879
> >       no room to add more observations
> >       <output omitted>
> >       r(901);
> So Sergiy asked,
> > 1. how does Stata determine how many observations can fit into the currently
> >    allocated memory when record size is zero? (which is exactly our case -
> >    no variables are defined after set mem something). As Stata had
> >    suggested, I tried to think "of Stata's data area as the area of a
> >    rectangle", but still could not divide 1GB by zero.
> -set obs- is a very unreliable way of determining whether something else will
> fit into memory.  Stata has a minimum width that it uses in the calculation
> when the current width is 0.  I don't remember the number, but it's around 20,
> as I recall.  Do not hold me to that.
> The purpose of -set obs- is to increase the number of observations of the
> dataset currently in memory; it is not to project what the maximum would be in
> some other case.
> The best estimate available is some other case can be obtained by a dataset
> like the one under consideration typing -describe-, and then looking at
> r(N_max) by typing -return list-.
> In the case of an existing dataset that will fit into memory, you can
> type
>        . use ...
>        . describe
>        . return list
> In the case of an existing dataset that will not fit into memory, or
> might not, you can type
>        . use ... in 1                  <- in one, meaning first obs.
>        . describe
>        . return list
> > 2. Why the number it determines is wrong?
> "It" in the above question refers to the suggested value of 109051879
> in the following output:
>        . set obs 160000000
>        obs must be between 0 and 109051879
>        r(198);
> and we know it is wrong because, when we try to use the suggested value,
> Stata refuses:
>        . set obs 109051879
>        no room to add more observations
>        <output omitted>
>        r(901);
> What I am about to say also applies to r(N_max) returned by -describe-.
> The memory Stata requires for a dataset is given by
>        memory_required = f(obs, a few other other assumptions)      (1)
> where f() is a very long, nonlinear expression.  In the case of the suggested
> maximum obs, Stata solves equation (1) for obs with memory_required is set
> equal to currently allocated memory.  Stata "solves" the equation by using an
> approximation formula.  The fact is, however, that Stata does not know whether
> its suggested maximum will actually works until it tries it.
> There are two issues that make the "solution" problematic:  (1) the quality of
> the approximation and (2) some assumptions about future operating system
> behavior.
> Concerning (1), the quality of our approximation, we think it is pretty good,
> but I cannot promise that it is good in all cases.  It's embarrassing for the
> authors of a statistical package to admit that we do not actually solve
> equation (1) since obviously the Stata software itself has that capability.
> Mainly, the problem is simply writing down and then coding the equation
> correctly because it is so long and involved.  Then, even if we did that, just
> keeping the equation correct as we we make minor changes over time would be a
> challange.  So we use an approximation formula that contains the major
> ingredients, along with some extra constants.
> Concerning (2), the behavior of the operating system in the future, the
> process of resetting maxobs (repartioning memory) requires certain small areas
> be freed and reallocated through the operating system.  The overall total
> remains approximately unchanged, but the size distribution of the components
> is changed as less is allocated for variables and more for observations,
> or vice versa.  We assume the operating system will prove agreeable, but
> sometimes it does not.  There can be two reasons for that:  (a) I said the
> total remains approximately unchanged, but in some cases, we might ask for
> significantly more, although that additional amount is small relative to
> the total (1040m in the example), and (b) even if we ask for exactly the same
> amount, sometimes operating systems cannot find a place for the changed sizes
> of the individual components.  This second problem becomes more likely as the
> total allocated apporaches the total on the computer, or the total the OS has
> set aside for virtual memory.
> Concering (2), the behavior of the operating system in the future can be
> particlarly unpredictable in windowed operating systems.  To make the
> memory allocation problem as easy as possible for the OS, and thus to give the
> OS every chance of suceeding, Stata writes to disk all of its memory usage,
> frees it, and let's the OS start with a clean slate, from which Stata then
> rebuilds itself.  That helps.  The problem is, windowed operating systems also
> allocate memory for dialog boxes and other window features, they sometimes
> borrow application memory for that, they sometimes put them in particularly
> inappropriate places, and Stata knows nothing about that.  Microsoft's Windows
> before Vista was particularly notorious on this score.
> >  3. Is it wrong only in the case of the zero-length records, or will it also
> >     fail to properly compute maximum number of observations in cases when
> >     record size is not trivial?
> The approximation formula is most accurate in the case of small changes
> for the current allocation.
> The formula should be accurate in the 0-variable case assuming to want to
> have approximately 0 variables, say 1 or 2.  That may not be the case in a
> real problem, which is why I gave a different way of obtaining the projected
> maximum above.
> Nonetheless, in the problem Sergiy showed, we attempted to go from 0 variables
> to 0 variables but with more observations, and that proved not to be possible,
> so we know the suggestion was wrong by at least 9%.  So either the
> approimation fomula itself was the cause, or it was memory distributional
> problems.  In Sergiy's case, I suspect it was memory distributional problems.
> I suspect that Sergiy is asking for close to physical memory, and just the
> distributional change in a few small areas is causing the operating system
> difficulties.  Those difficulties are *NOT* a bug in the OS.  OS's face their
> own problems and, for reasons of efficiency, OS's use allocation methods that
> do not efficiently use every single byte of memory in all cases.
> > 4. Given a file of size X on disk, how to compute the memory that I need to
> >    set to:
> >        a) simply open the file and be able to see it's contents
> >        b) create an additional variable Y of type Z after the file is open.
> >    Here of course we are talking about large files ~1GB, close to the
> >    real-World limits of Stata on the 32bit Windows machines. And the answer
> >    might be that the file may not be opened at all.
> -describe using <filename>- reports the size.  -set memory- to a bit more than
> that.
> > 5. This is more a wish, then a puzzle, since I am 99% sure the answer is
> >    negative: many Stata commands (both base and user-written) create
> >    temporary variables during their work. But it is hard to tell how many of
> >    those will be created. When working on the margin this becomes important.
> >    Is there any reference table for this purpose? Is there any way to
> >    automatically monitor the number of created variables, and collect the
> >    largest value, say in profiler, or elsewhere?
> No, there is no way to monitor that extra usage.  We have done experiments in
> the past, however, and the maximum extra required tends to be roughly 5 8-byte
> variables, which is to say, a width increment of 40 bytes.  Most programs are
> well under that.
> -- Bill
> [email protected]
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index