More than 2 billion observations

What's this about?

Stata now allows you to process datasets containing more than 2.1 billion observations if you have a big computer, and by big, we mean 512 GB or more of memory.

Stata stores your data in memory. That makes Stata fast. On the other hand, it also limits the size of the datasets you can process. That provision is of little consequence these days. On a 16 GB computer, you can process datasets with many millions of observations. On a 256 GB computer, you can process a billion or two of observations. Most users' datasets are only in the thousands or, sometimes, the millions.

Stata imposes a limit of 2.1 billion observations, and that limit was more theoretical than practical until recently. On a 256 GB computer, 2.1 billion is roughly the limit of what you could fit into memory anyway.

With today's 512 GB, 1 TB, and 1.5 TB computers, there is enough memory to store more than 2.1 billion observations in memory, and some users have requested the limit be relaxed.

That limit is now relaxed with Stata/MP. MP is the multi-processor version of Stata. The other flavors of Stata—Stata/IC and Stata/SE—continue with the previous 2.1 billion observation limit. Why? Because even though single-processor Stata/IC and Stata/SE are fast, Stata/MP is faster, and when you are processing 2.1 billion observations and more, that matters more than you might imagine.

For instance, on a mere 16 GB off-the-shelf computer, Stata/SE can process many millions of observations. Let's fit a linear regression on 2 million observations with 6 covariates. That will take only 1.2 seconds. That's fast. Now think of how long it would take to fit that same regression on 2 billion observations. 1.2 seconds multiplied by 1,000 is 1,200 seconds, or 20 minutes! That's fast, too, but it doesn't feel fast if you're waiting for the results. Stata/MP with 2 processors reduces that to 10 minutes. Stata/MP with 4 processors reduces that to 5 minutes. Stata/MP with 32 processors reduces that to 37 seconds!

If you are going to process more than 2.1 billion observations, you need Stata/MP.

How many observations you will be able to process depends on the amount of memory on your computer. The formula is simple: divide the memory on your computer by the memory required to store an observation (a.k.a., width), except obviously you can't use all the memory on your computer in the numerator, and there are some other adjustments that need to be made, too.

Here is the adjusted formula and some calculations made with it:

		Billions of Observations
Computer's memory	Memory used	Scenario
Computer's memory	Memory used	(1)	(2)	(3)
128 GB	112 GB	1.8	1.4	1.0
256 GB	240 GB	3.8	2.9	2.1
512 GB	496 GB	7.9	6.1	4.4
1024 GB	1008 GB	16.2	12.3	9.8
1536 GB	1520 GB	24.4	18.5	13.6

$$ {\rm obs} = \frac{{\rm memory\_used}}{{\rm width} + 24} \times \frac{1024^3}{1000^3} $$

where memory_used = computer's_memory - 16 GB.

The table reports three scenarios:

width = 43 bytes, the same as auto.dta
width = 64 bytes
width = 96 bytes

Scenario 1, equivalent to auto.dta, has only 12 variables. The other two scenarios are more reasonable. Even larger widths would be reasonable, too.

The formula we supplied is basically total memory divided by width of observation, but it incorporates three adjustments.

The first is to substitute memory used for total memory available. Memory used is 16 GB less than the total. We are assuming that Stata is the single major process running on your computer. Stata will consume some of that 16 GB, but enough will be left free for the other processes that usually run on your computers, and a bit more.

The second adjustment was to add 24 bytes to the width, which allocates room for three extra double-precision variables. Stata commands often add extra working variables to your data, at least temporarily.

The third adjustment accounts for the differences between binary and decimal units. A thousand in decimal is 1,000, of course. A thousand in binary (a.k.a. kilo) is 1,024. To get to the billions, we have to cube these numbers.

With Stata/MP, you can now process up to 24.4 billion observations. Or more, if there are few enough variables. Or less, if there are more variables.

Upgrade now Order Stata

Back to the highlights

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Allowing more than 2 billion observations was introduced in Stata 14.

See the latest version of huge datasets.

See the new features in Stata 18.

More than 2 billion observations

More than 2 billion observations

What's this about?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Allowing more than 2 billion observations was introduced in Stata 14. See the latest version of huge datasets. See the new features in Stata 18.

More than 2 billion observations

More than 2 billion observations

What's this about?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Allowing more than 2 billion observations was introduced in Stata 14.

See the latest version of huge datasets.

See the new features in Stata 18.