Home  /  Products  /  Features  /  Huge datasets

Stata allows you to process datasets containing more than 2 billion observations if you have a big computer, and by big, we mean 512 GB or more of memory.

Stata stores your data in memory. That makes Stata fast. It also means that datasets you wish to process must fit in memory. That provision is of little consequence these days. On a 16 GB computer, you can process datasets with many millions of observations. On a 256 GB computer, you can process a billion or two of observations. Most users' datasets are only in the thousands or, sometimes, the millions.

Stata/BE and Stata/SE can process up to 2.1 billion observations. On a 256 GB computer, 2.1 billion is roughly the limit of what you could fit into memory anyway. With today's computers with 512 GB, 1 TB, 2 TB, or more memory though, there is enough memory to store more than 2.1 billion observations. Stata/MP, the multi-processor edition of Stata, can handle up to 1 trillion observations in theory, with a practical limit given the size of the largest computers today of just over 24 billion observations. Stata/MP can also handle up to 120,000 variables, up from the maximum of 2,047 in Stata/BE and 32,767 in Stata/SE.

Why did we make Stata/MP capable of handling more? Because even though single-core Stata/BE and Stata/SE are fast, Stata/MP is faster, and when you are processing 2.1 billion observations and more, that matters more than you might imagine.

For instance, on a mere 16 GB off-the-shelf computer, Stata/SE can process many millions of observations. Let's fit a linear regression on 2 million observations with 6 covariates. That will take only 1.2 seconds. That's fast. Now think of how long it would take to fit that same regression on 2 billion observations. 1.2 seconds multiplied by 1,000 is 1,200 seconds, or 20 minutes! That's fast, too, but it doesn't feel fast if you're waiting for the results. Stata/MP with 2 cores reduces that to 10 minutes. Stata/MP with 4 cores reduces that to 5 minutes. Stata/MP with 32 cores reduces that to 37 seconds!

If you are going to process more than 2.1 billion observations, you need Stata/MP.

How many observations you will be able to process depends on the amount of memory on your computer. The formula is simple: divide the memory on your computer by the memory required to store an observation (a.k.a., width), except obviously you can't use all the memory on your computer in the numerator, and there are some other adjustments that need to be made, too.

Here is the adjusted formula and some calculations made with it:

Billions of Observations
128 GB112 GB1.81.41.0
256 GB240 GB3.82.92.1
512 GB496 GB7.96.14.4
1024 GB1008 GB16.212.39.8
1536 GB1520 GB24.418.513.6
$${\rm obs} = \frac{{\rm memory\_used}}{{\rm width} + 24} \times \frac{1024^3}{1000^3}$$

where memory_used = computer's_memory - 16 GB.

The table reports three scenarios:

  1. width = 43 bytes, the same as auto.dta
  2. width = 64 bytes
  3. width = 96 bytes

Scenario 1, equivalent to auto.dta, has only 12 variables. The other two scenarios are more reasonable. Even larger widths would be reasonable, too.

The formula we supplied is basically total memory divided by width of observation, but it incorporates three adjustments.

The first is to substitute memory used for total memory available. Memory used is 16 GB less than the total. We are assuming that Stata is the single major process running on your computer. Stata will consume some of that 16 GB, but enough will be left free for the other processes that usually run on your computers, and a bit more.

The second adjustment was to add 24 bytes to the width, which allocates room for three extra double-precision variables. Stata commands often add extra working variables to your data, at least temporarily.

The third adjustment accounts for the differences between binary and decimal units. A thousand in decimal is 1,000, of course. A thousand in binary (a.k.a. kilo) is 1,024. To get to the billions, we have to cube these numbers.

With Stata/MP, you can process up to 24.4 billion observations on a computer with 1.5 TB of memory. Or more, if there are few enough variables. Or less, if there are more variables.