Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Query dataset size


From   "Julian Reif" <[email protected]>
To   "Statalist \(E-mail\)" <[email protected]>
Subject   Re: st: Query dataset size
Date   Thu, 7 Dec 2006 16:14:17 -0500

Bill,

Thank you for the detailed explanation.  That was exactly what I needed.

Julian

--------------------------------------------------

Date: Wed, 06 Dec 2006 12:46:33 -0600
From: [email protected] (William Gould, Stata)
Subject: Re: st: Query dataset size

Julian Reif <[email protected]> writes, 

> Thanks for the information.  The "data + overhead" number matches what is
> returned from -describe-.  However, it doesn't look like -memory- saves this
> value in r() either.  What does the r(N_cur) value represent?

Julian wants to obtain "data + overhead", reported under that name by 
- -memory- and reported as "size" by describe.  Call the number X.  X is 
defined 

	X = ( r(width) + r(size_ptr) ) * _N
	      -------    -----------
                /                 \
               /                   \
        from -describe-           from -memory-

I.e., 
		quietly memory 
		local size_ptr = r(size_ptr)
		quietly describe
		local X = ( r(width) + `size_ptr' ) * _N

Actually, if you are using Stata/MP, there are two pointers per observation, 
so the formula is X = ( r(width) + 2*r(size_ptr) ) * _N, but we'll ignore 
that.

Julian also asked about r(N_cur) reported by -memory-.  This will take some
explaining.  You know Stata keeps the data in memory.  Let's talk about 
that.  

The data look like this

        a pointer 
         per obs.
             \
              \  | <------- w bytes ------> |
             +---+--------------------------+
             |   | var1 var2 ...            |    
             |   |                          |
             |   |                          |   <- each line is an obs.
             |   |                          |
             |   |                          |
             |   |                          |
             +---+--------------------------+


The width (w) of an observation is just the sum of the widths of the 
invdividual variables.  For auto.dta, that width is 43 (r(width) returned 
by -describe-).  Thus, the data themselves require w*_N bytes.  Associated 
with each observation is a "pointer" -- something technical Stata needs.
The width of that pointer varies across computers.  On 32-bit computers, 
the pointer is 4 bytes wide.  On 64-bit computers, the the pointer is 
8 bytes wide.

The above is the basis of the calculation we just made.

The data exist in a block of memory that is wider and longer than the 
data themselves.  This way, you can add extra variables or extra observations.
The picture looks like this:

        a pointer 
         per obs.
             \
              \  | <------- w bytes ------> |
             +---+--------------------------+----------------------+
   (obs 1)   |   | var1 var2 ...            |                      |
   (obs 2)   |   |                          |                      |
      .      |   |                          |                      |
      .      |   |                          |                      |
      .      |   |                          |                      |
 (obs _N)    |   |                          |                      |
             +---+--------------------------+                      |
 (obs _N+1)  |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
 (obs N_cur) |   |                                                 |
             +---+-------------------------------------------------+
                 | < --------------- w_cur bytes ----------------> |

The total number of bytes is N_cur*(size_ptr + w_cur).

To answer Julian's question, N_cur is the maximum number of observations that
can be stored GIVEN THE CURRENT PARTITIONING.  That is not the same as the 
maximum number of observations because Stata silently changes the current
partitioning -- holding the area constant -- when necessary.  If you start
adding lots of variables, Stata will increase w_cur at the expense of N_cur.
If you instead add lots of observations, Stata will increase N_cur while
reducing w_cur.

Changing the partitioning sounds easy, but it is not.

Who cares?

We at StataCorp care, because we have to verify that everything is working
before we ship.  So -memory- saves in r() a number of things that interest us,
and we have test scripts that put Stata through its paces and verify that
these internal values change in the way they should.  If they don't, Stata
would run more slowly and, in the worst case, could actually corrupt your
data.  Anyway, recorded by -memory- are things like r(n_repart), the number of
repartioning operations performed by Stata, r(n_shift), the number of shift
operations (which I haven't described), and the characteristics of the current
state.  With that information, we can design tests that move Stata to a 
different state, and then we can put the test in a do-file, and we can 
use -assert- to verify that the values before and after are just what they 
should be.  And we can check that Stata did not do too many repartionings, 
(or too few) or shifts, all of which affect performance.

We don't usually talk about this, but Stata's memory manager is an important
reason Stata is so fast.  


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index