From
Babigumira Ronnie <rutaremwa_rb@yahoo.com>

To
statalist@hsphsun2.harvard.edu

Subject
st: eqany

Date
Mon, 8 Jul 2002 07:43:13 -0700 (PDT)

Hi Listers I am cleaning data (that's all I seem to be doing) and I am a little puzzled here. A while ago, I asked on how to identify illegal entries when a variable takes on values in batches (e.g. 11 to 19 21 to 25 etc). Nick Cox pointed me to . egen OK = eqany(cropcod2), values(110/120 220/227 330/334 440/446) . list houscode cropcod2 if !OK This has very well for, however, today I tried . egen OK = eqany(inpcode), values(500/505 599 601 1100/1111 /* . */ 1200/1201 2100/2160/ 2200/2220 2299 2300/2302) . list houscode inpcode if !OK I get an error message; . egen OK = eqany(inpcode), values(500/505 599 601 1100/1111 /* > */ 1200/1201 2100/2160 2200/2220 2299 2300/2302) varlist not allowed r(101); Any one familiar with this and a way around it? Roni --- William Gould <wgould@stata.com> wrote: > Salah Mahmud" <salah@eircom.net>, following up on a thread, asked, > > > Is the "observation pointer" the only overhead as far as data storage > is > > concerned? > > to my posting that, > > > The size reported by -describe- is obtained by > > > > > > 1,692,789 * ( 4 + 4 ) = 13,542,312 > > / | \ > > # of obs | \ > > | \ > > width of data plus 4 > > 1 float = 4 bytes > > > > No, the 4 bytes is not all, but it is the important amount and the > answer to > Salah's question really depends on how you define overhead. > > First off, what I said about the number reported by -describe- is > exactly > accurate: that is what -describe- reports. There is, however, more to > a > dataset than the variables and observations, such as variable names, > variable > labels, value labels, display formats, characteristics, etc. > > When -describe- reports the "size" of the data, it ignores all of that, > but > obviously all those things appear in the .dta dataset, so that will tend > to > make the .dta dataset size larger than the number reported by > -describe-, > while the extra 4 bytes per observation, which only gets added when the > data > is copied to memory, makes the .dta dataset smaller. > > Then there is overhead as I tend to think of it: the memory cost of > maintaining the memory image of the data and all of its features. The 4 > bytes > per observation is an example of this, and almost every feature of the > data -- > each value label, each variable label (but not each variable name) -- > also has > the overhead of pointers that track each piece of information. This > amounts > to about 16 bytes per piece of information, and sometimes more. > > This overhead, however, does not usually add up to much because the > number of > pieces of information being tracked is on the order of the number of > variables > in the dataset, rather than the number of observations. It was, > however, > dealing with overhead like this that was the largest issue in producing > Stata/SE, which could allow lots more varibles. > > Anyway, the dataset label and each value label, variable label, and > characteristic adds 16 bytes to the memory image in addition to the > contents > of the information piece itself. The date-and-time stamp adds 16 bytes > (plus the date-and-time stamp). > > Really, the 4 bytes per observation is the important number. > > -- Bill > wgould@stata.com > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ __________________________________________________ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**: **Re: st: data size - how big** *From:* wgould@stata.com (William Gould)

