Title | Approximating the size of a dataset | |

Author | William Gould, StataCorp | |

Date | December 1999; minor revisions December 2006 |

A back-of-the-envelope calculation for the size of a dataset is

N*V*W + 4*N number of megabytes = M = -------------- 1024^{2}

where

N = number of observations V = number of variables W = average width in bytes of a variable

In approximating W, remember

+-------------------------------------------------------------+ | Type of variable Width | |-------------------------------------------------------------| | Integers, −127 <= x <= 100 1 | | —32,767 <= x <= 32,740 2 | | —2,147,483,647 <= x <= 2,147,483,620 4 | | Floats, | | single precision (default) 4 | | double precision 8 | | strings maximum length | +-------------------------------------------------------------+

Say that you have a 20,000-observation dataset. That dataset contains

1 string identifier of length 20 20 10 small integers (1 byte each) 10 4 standard integers (2 bytes each) 8 5 floating-point numbers (4 bytes each) 20 ----------------------------------------------------- 20 variables total 58

Thus the average width of a variable is W = 58/20 = 2.9 bytes.

The size of your dataset is

N*V*W + 4*N number of megabytes = M = -------------- 1024^{2}20000*20*2.9 + 4*20000 = ---------------------- 1024^{2}= 1.18 megabytes

This result slightly understates the size of the dataset because we have not
included any variable labels, value labels, or notes that you might add to
the data. That does not amount to much. For instance, imagine that you
added variable labels to all 20 variables and that the average length of the
text of the labels was 22 characters. That would amount to a total of
20*22=440 bytes or 440/1024^{2}=.00042 megabytes.

**
Click here for an interactive dataset calculator.**

N*V*W + 4*N number of megabytes = M = -------------- 1024^{2}

N*V*W is, of course, the total size of the data. To that, we added 4*N because Stata secretly stores a 4-byte pointer with each observation.

The 1,024^{2} in the denominator rescales the results to megabytes.
Yes, the result is divided by 1,024^{2} even though
1,000^{2} = a million.

Computer memory comes in binary increments. Although we think of k as
standing for kilo, in the computer business, k is really a
“binary” thousand, 2^{10} = 1,024.

A megabyte is a binary million—a binary k squared:

1 MB = 1024 KB = 1024*1024 = 1,048,576 bytes

With cheap memory, we sometimes talk about a gigabyte. Here is how a binary gig works:

1 GB = 1024 MB = 1024^{3}= 1,073,741,824 bytes