More than 2 billion observations

Order

<- See Stata's other features

Stata allows you to process datasets containing more than 2 billion observations if you have a big computer, and by big, we mean 512 GB or more of memory.

Stata stores your data in memory. That makes Stata fast. It also means that datasets you wish to process must fit in memory. That provision is of little consequence these days. On a 16 GB computer, you can process datasets with many millions of observations. On a 256 GB computer, you can process a billion or two of observations. Most users' datasets are only in the thousands or, sometimes, the millions.

Stata/BE and Stata/SE can process up to 2.1 billion observations. On a 256 GB computer, 2.1 billion is roughly the limit of what you could fit into memory anyway. With today's computers with 512 GB, 1 TB, 2 TB, or more memory though, there is enough memory to store more than 2.1 billion observations. Stata/MP, the multi-processor edition of Stata, can handle up to 1 trillion observations in theory, with a practical limit given the size of the largest computers today of just over 24 billion observations. Stata/MP can also handle up to 120,000 variables, up from the maximum of 2,047 in Stata/BE and 32,767 in Stata/SE.

Why did we make Stata/MP capable of handling more? Because even though single-core Stata/BE and Stata/SE are fast, Stata/MP is faster, and when you are processing 2.1 billion observations and more, that matters more than you might imagine.

For instance, on a mere 16 GB off-the-shelf computer, Stata/SE can process many millions of observations. Let's fit a linear regression on 2 million observations with 6 covariates. That will take only 1.2 seconds. That's fast. Now think of how long it would take to fit that same regression on 2 billion observations. 1.2 seconds multiplied by 1,000 is 1,200 seconds, or 20 minutes! That's fast, too, but it doesn't feel fast if you're waiting for the results. Stata/MP with 2 cores reduces that to 10 minutes. Stata/MP with 4 cores reduces that to 5 minutes. Stata/MP with 32 cores reduces that to 37 seconds!

If you are going to process more than 2.1 billion observations, you need Stata/MP.

How many observations you will be able to process depends on the amount of memory on your computer. The formula is simple: divide the memory on your computer by the memory required to store an observation (a.k.a., width), except obviously you can't use all the memory on your computer in the numerator, and there are some other adjustments that need to be made, too.

Here is the adjusted formula and some calculations made with it:

		Billions of Observations
Computer's memory	Memory used	Scenario
Computer's memory	Memory used	(1)	(2)	(3)
128 GB	112 GB	1.8	1.4	1.0
256 GB	240 GB	3.8	2.9	2.1
512 GB	496 GB	7.9	6.1	4.4
1024 GB	1008 GB	16.2	12.3	9.8
1536 GB	1520 GB	24.4	18.5	13.6

$${\rm obs} = \frac{{\rm memory\_used}}{{\rm width} + 24} \times \frac{1024^3}{1000^3}$$

where memory_used = computer's_memory - 16 GB.

The table reports three scenarios:

width = 43 bytes, the same as auto.dta
width = 64 bytes
width = 96 bytes

Scenario 1, equivalent to auto.dta, has only 12 variables. The other two scenarios are more reasonable. Even larger widths would be reasonable, too.

The formula we supplied is basically total memory divided by width of observation, but it incorporates three adjustments.

The first is to substitute memory used for total memory available. Memory used is 16 GB less than the total. We are assuming that Stata is the single major process running on your computer. Stata will consume some of that 16 GB, but enough will be left free for the other processes that usually run on your computers, and a bit more.

The second adjustment was to add 24 bytes to the width, which allocates room for three extra double-precision variables. Stata commands often add extra working variables to your data, at least temporarily.

The third adjustment accounts for the differences between binary and decimal units. A thousand in decimal is 1,000, of course. A thousand in binary (a.k.a. kilo) is 1,024. To get to the billions, we have to cube these numbers.

With Stata/MP, you can process up to 24.4 billion observations on a computer with 1.5 TB of memory. Or more, if there are few enough variables. Or less, if there are more variables.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

More than 2 billion observations

<- See Stata's other features

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies