Home  /  Products  /  Stata 14  /  Huge datasets

Allowing more than 2 billion observations was introduced in Stata 14.

See the latest version of huge datasets.

See the new features in Stata 18.

More than 2 billion observations

More than 2 billion observations

What's this about?

Stata now allows you to process datasets containing more than 2.1 billion observations if you have a big computer, and by big, we mean 512 GB or more of memory.

Stata stores your data in memory. That makes Stata fast. On the other hand, it also limits the size of the datasets you can process. That provision is of little consequence these days. On a 16 GB computer, you can process datasets with many millions of observations. On a 256 GB computer, you can process a billion or two of observations. Most users' datasets are only in the thousands or, sometimes, the millions.

Stata imposes a limit of 2.1 billion observations, and that limit was more theoretical than practical until recently. On a 256 GB computer, 2.1 billion is roughly the limit of what you could fit into memory anyway.

With today's 512 GB, 1 TB, and 1.5 TB computers, there is enough memory to store more than 2.1 billion observations in memory, and some users have requested the limit be relaxed.

That limit is now relaxed with Stata/MP. MP is the multi-processor version of Stata. The other flavors of Stata—Stata/IC and Stata/SE—continue with the previous 2.1 billion observation limit. Why? Because even though single-processor Stata/IC and Stata/SE are fast, Stata/MP is faster, and when you are processing 2.1 billion observations and more, that matters more than you might imagine.

For instance, on a mere 16 GB off-the-shelf computer, Stata/SE can process many millions of observations. Let's fit a linear regression on 2 million observations with 6 covariates. That will take only 1.2 seconds. That's fast. Now think of how long it would take to fit that same regression on 2 billion observations. 1.2 seconds multiplied by 1,000 is 1,200 seconds, or 20 minutes! That's fast, too, but it doesn't feel fast if you're waiting for the results. Stata/MP with 2 processors reduces that to 10 minutes. Stata/MP with 4 processors reduces that to 5 minutes. Stata/MP with 32 processors reduces that to 37 seconds!

If you are going to process more than 2.1 billion observations, you need Stata/MP.

How many observations you will be able to process depends on the amount of memory on your computer. The formula is simple: divide the memory on your computer by the memory required to store an observation (a.k.a., width), except obviously you can't use all the memory on your computer in the numerator, and there are some other adjustments that need to be made, too.

Here is the adjusted formula and some calculations made with it:

Billions of Observations
Computer's
memory
Memory
used
Scenario
(1)(2)(3)
128 GB112 GB1.81.41.0
256 GB240 GB3.82.92.1
512 GB496 GB7.96.14.4
1024 GB1008 GB16.212.39.8
1536 GB1520 GB24.418.513.6
$$ {\rm obs} = \frac{{\rm memory\_used}}{{\rm width} + 24} \times \frac{1024^3}{1000^3} $$

where memory_used = computer's_memory - 16 GB.

The table reports three scenarios:

  1. width = 43 bytes, the same as auto.dta
  2. width = 64 bytes
  3. width = 96 bytes

Scenario 1, equivalent to auto.dta, has only 12 variables. The other two scenarios are more reasonable. Even larger widths would be reasonable, too.

The formula we supplied is basically total memory divided by width of observation, but it incorporates three adjustments.

The first is to substitute memory used for total memory available. Memory used is 16 GB less than the total. We are assuming that Stata is the single major process running on your computer. Stata will consume some of that 16 GB, but enough will be left free for the other processes that usually run on your computers, and a bit more.

The second adjustment was to add 24 bytes to the width, which allocates room for three extra double-precision variables. Stata commands often add extra working variables to your data, at least temporarily.

The third adjustment accounts for the differences between binary and decimal units. A thousand in decimal is 1,000, of course. A thousand in binary (a.k.a. kilo) is 1,024. To get to the billions, we have to cube these numbers.

With Stata/MP, you can now process up to 24.4 billion observations. Or more, if there are few enough variables. Or less, if there are more variables.

Upgrade now Order Stata