[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Error 909: obs must be between 0 and NNN |

Date |
Mon, 28 Jan 2008 12:06:17 -0600 |

Sergiy Radyakin <serjradyakin@gmail.com> has some memory questions, all caused by the fact that Sergiy observed: > . set memory 1040m > <works, output omitted> > > . set obs 160000000 > obs must be between 0 and 109051879 > r(198); > > . set obs 109051879 > no room to add more observations > <output omitted> > r(901); So Sergiy asked, > 1. how does Stata determine how many observations can fit into the currently > allocated memory when record size is zero? (which is exactly our case - > no variables are defined after set mem something). As Stata had > suggested, I tried to think "of Stata's data area as the area of a > rectangle", but still could not divide 1GB by zero. -set obs- is a very unreliable way of determining whether something else will fit into memory. Stata has a minimum width that it uses in the calculation when the current width is 0. I don't remember the number, but it's around 20, as I recall. Do not hold me to that. The purpose of -set obs- is to increase the number of observations of the dataset currently in memory; it is not to project what the maximum would be in some other case. The best estimate available is some other case can be obtained by a dataset like the one under consideration typing -describe-, and then looking at r(N_max) by typing -return list-. In the case of an existing dataset that will fit into memory, you can type . use ... . describe . return list In the case of an existing dataset that will not fit into memory, or might not, you can type . use ... in 1 <- in one, meaning first obs. . describe . return list > 2. Why the number it determines is wrong? "It" in the above question refers to the suggested value of 109051879 in the following output: . set obs 160000000 obs must be between 0 and 109051879 r(198); and we know it is wrong because, when we try to use the suggested value, Stata refuses: . set obs 109051879 no room to add more observations <output omitted> r(901); What I am about to say also applies to r(N_max) returned by -describe-. The memory Stata requires for a dataset is given by memory_required = f(obs, a few other other assumptions) (1) where f() is a very long, nonlinear expression. In the case of the suggested maximum obs, Stata solves equation (1) for obs with memory_required is set equal to currently allocated memory. Stata "solves" the equation by using an approximation formula. The fact is, however, that Stata does not know whether its suggested maximum will actually works until it tries it. There are two issues that make the "solution" problematic: (1) the quality of the approximation and (2) some assumptions about future operating system behavior. Concerning (1), the quality of our approximation, we think it is pretty good, but I cannot promise that it is good in all cases. It's embarrassing for the authors of a statistical package to admit that we do not actually solve equation (1) since obviously the Stata software itself has that capability. Mainly, the problem is simply writing down and then coding the equation correctly because it is so long and involved. Then, even if we did that, just keeping the equation correct as we we make minor changes over time would be a challange. So we use an approximation formula that contains the major ingredients, along with some extra constants. Concerning (2), the behavior of the operating system in the future, the process of resetting maxobs (repartioning memory) requires certain small areas be freed and reallocated through the operating system. The overall total remains approximately unchanged, but the size distribution of the components is changed as less is allocated for variables and more for observations, or vice versa. We assume the operating system will prove agreeable, but sometimes it does not. There can be two reasons for that: (a) I said the total remains approximately unchanged, but in some cases, we might ask for significantly more, although that additional amount is small relative to the total (1040m in the example), and (b) even if we ask for exactly the same amount, sometimes operating systems cannot find a place for the changed sizes of the individual components. This second problem becomes more likely as the total allocated apporaches the total on the computer, or the total the OS has set aside for virtual memory. Concering (2), the behavior of the operating system in the future can be particlarly unpredictable in windowed operating systems. To make the memory allocation problem as easy as possible for the OS, and thus to give the OS every chance of suceeding, Stata writes to disk all of its memory usage, frees it, and let's the OS start with a clean slate, from which Stata then rebuilds itself. That helps. The problem is, windowed operating systems also allocate memory for dialog boxes and other window features, they sometimes borrow application memory for that, they sometimes put them in particularly inappropriate places, and Stata knows nothing about that. Microsoft's Windows before Vista was particularly notorious on this score. > 3. Is it wrong only in the case of the zero-length records, or will it also > fail to properly compute maximum number of observations in cases when > record size is not trivial? The approximation formula is most accurate in the case of small changes for the current allocation. The formula should be accurate in the 0-variable case assuming to want to have approximately 0 variables, say 1 or 2. That may not be the case in a real problem, which is why I gave a different way of obtaining the projected maximum above. Nonetheless, in the problem Sergiy showed, we attempted to go from 0 variables to 0 variables but with more observations, and that proved not to be possible, so we know the suggestion was wrong by at least 9%. So either the approimation fomula itself was the cause, or it was memory distributional problems. In Sergiy's case, I suspect it was memory distributional problems. I suspect that Sergiy is asking for close to physical memory, and just the distributional change in a few small areas is causing the operating system difficulties. Those difficulties are *NOT* a bug in the OS. OS's face their own problems and, for reasons of efficiency, OS's use allocation methods that do not efficiently use every single byte of memory in all cases. > 4. Given a file of size X on disk, how to compute the memory that I need to > set to: > a) simply open the file and be able to see it's contents > b) create an additional variable Y of type Z after the file is open. > Here of course we are talking about large files ~1GB, close to the > real-World limits of Stata on the 32bit Windows machines. And the answer > might be that the file may not be opened at all. -describe using <filename>- reports the size. -set memory- to a bit more than that. > 5. This is more a wish, then a puzzle, since I am 99% sure the answer is > negative: many Stata commands (both base and user-written) create > temporary variables during their work. But it is hard to tell how many of > those will be created. When working on the margin this becomes important. > Is there any reference table for this purpose? Is there any way to > automatically monitor the number of created variables, and collect the > largest value, say in profiler, or elsewhere? No, there is no way to monitor that extra usage. We have done experiments in the past, however, and the maximum extra required tends to be roughly 5 8-byte variables, which is to say, a width increment of 40 bytes. Most programs are well under that. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Error 909: obs must be between 0 and NNN***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**RE: st: How to plot a coefficient vector e(b)** - Next by Date:
**st: factor score predict last estimates not found** - Previous by thread:
**Re: st: Error 909: obs must be between 0 and NNN** - Next by thread:
**Re: st: Error 909: obs must be between 0 and NNN** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |