Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data size - how big

From   [email protected] (William Gould)
To   [email protected]
Subject   Re: st: data size - how big
Date   Mon, 08 Jul 2002 08:14:22 -0500

Salah Mahmud" <[email protected]>, following up on a thread, asked, 

> Is the "observation pointer" the only overhead as far as data storage is
> concerned?

to my posting that, 

> The size reported by -describe- is obtained by
>            1,692,789  * ( 4   +    4  )   =    13,542,312
>               /           |         \
>           # of obs        |          \
>                           |           \
>                     width of data      plus 4
>                    1 float = 4 bytes

No, the 4 bytes is not all, but it is the important amount and the answer to
Salah's question really depends on how you define overhead.

First off, what I said about the number reported by -describe- is exactly
accurate:  that is what -describe- reports.  There is, however, more to a
dataset than the variables and observations, such as variable names, variable
labels, value labels, display formats, characteristics, etc.

When -describe- reports the "size" of the data, it ignores all of that, but
obviously all those things appear in the .dta dataset, so that will tend to
make the .dta dataset size larger than the number reported by -describe-,
while the extra 4 bytes per observation, which only gets added when the data
is copied to memory, makes the .dta dataset smaller.

Then there is overhead as I tend to think of it:  the memory cost of
maintaining the memory image of the data and all of its features.  The 4 bytes
per observation is an example of this, and almost every feature of the data --
each value label, each variable label (but not each variable name) -- also has
the overhead of pointers that track each piece of information.  This amounts
to about 16 bytes per piece of information, and sometimes more.

This overhead, however, does not usually add up to much because the number of
pieces of information being tracked is on the order of the number of variables
in the dataset, rather than the number of observations.  It was, however,
dealing with overhead like this that was the largest issue in producing
Stata/SE, which could allow lots more varibles.

Anyway, the dataset label and each value label, variable label, and 
characteristic adds 16 bytes to the memory image in addition to the contents 
of the information piece itself.  The date-and-time stamp adds 16 bytes 
(plus the date-and-time stamp).

Really, the 4 bytes per observation is the important number.

-- Bill
[email protected]
*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index