Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Identifying unique values with codebook

From	[email protected] (William Gould, StataCorp LP)
To	[email protected]
Subject	Re: st: Identifying unique values with codebook
Date	Wed, 16 Jun 2010 10:42:54 -0500

Walter Garcia-Fontes <[email protected]> asks about -codebook-.
Consider a 10-observation Stata dataset containing variable x, which 
has values 1, 2, ..., 10:  

        . drop _all

        . set obs 10 

        . gen x = _n 

Run -codebook-:

       . codebook
       <output omitted>

-codebook- reports that x takes on 10 unique values.
Now do this, 

       . gen y = 100000000000000000 + x

       . codebook

-codebook- reports y takes on 1 value!

Walter reports that he ran into this probem in real data.  "If values 
are large, they will be identified as the same value," he writes.
"Is this a feature?" he asks.

Answer:  Yes, and Walter needs to take the output seriously.  In the 
example above, -codebook- made no error; x may take on 10 different values,
but y really does take on one value.  One way I can prove that is by typing 

        . assert y==y[1]

-assert- produces no output, meaning the assertion is true.  All values of 
y are equal to the first value of y.  Type -assert x==x[1]- and 
-assert- will report "9 contradictions in 10 observations; assertion is
false".

This is a precision issue.

Stata stores values as floating point numbers by default.  Think of a 
floating point number as being _.______*10^___.  For instance,
100000000000000000 is stored as 1.000000*10^17.  Actually, Stata 
uses binary, _._______________*2^____, and 100000000000000000 is in fact 
stored as 1.01100011001001010111100*2^20 (if I did my arithmetic correctly).

For purposes of understranding, we can pretend the Stata uses base 10.
So let's imagine my computer stores 
100000000000000000 as 1.000000*10^17.  It stores 1 as 
100000000000000000 as 1.000000*10^0.  Let's add the two numbers:

                  1.000000*10^17
                + 1.000000*10^ 0
                ----------------
                  ?.??????*10^??

To perform the addition, I'll need to "normalize" the numbers -- 
to make the powers the same -- so that I can add the significand in 
the usual way.  What I need is to write the second number, 
1.000000*10^0 as ?.??????*10^17.  I know you can do this in your head, 
but let's do it together:

        1.000000*10^0  =  0.100000*10^1    
                       =  0.010000*10^2    
                       =  0.001000*10^3    
                       =  0.000100*10^4    
                       =  0.000010*10^5    
                       =  0.000001*10^6    
                       =  0.000000*10^7    <---

"Stop!" you say. "You made a mistake!  You meant to type 0.0000001*10^7."

No I didn't.  I'm pretending I'm a finite precision, base-10 computer, with 7
digits of precision.  Ergo, when preforming normalization, 0.000001*10^6 =
0.000000*10^7.  I followed my usual normalization-rule:  roll the digits one
to the right, and then increase the power by 1.  It's too bad that 1 at the
end rolled off, but that's my rule.  Now, if you'll excuse me, I need to
finish the normalization:

                       =  0.000000*10^7  
                       =  0.000000*10^8  
                       .
                       .
                       =  0.000000*10^17 

Now I can add the two numbers:

                  1.000000*10^17
                + 0.000000*10^17
                ----------------
                  1.000000*10^17

Thus, I find that 1*10^17 plus 1 is precisely 1*10^17.  Being limited 
to 7 digits of precision, what else could I do?

Stata did the equivalent, but in binary.

What this means for Walter, for me, and for everybody, is that large 
numbers are subject to rounding!  Or more correctly, large numbers 
are subject to rounding when stored in floating-point format.

Stata has another format, integer format.  -byte-, -int-, and -long- 
are Stata's integer formats.  -byte- has 8 binary digits, -int- has 
32 binary digits, and -long- has 64 binary digits.  The integer formats 
are not subject to rounding, but they are subject to underflow and overflow.
A -byte-, for instance, can store a number between -127 and 100.  Numbers 
smaller then -127 or larger than 100 can't be stored in a byte.  Faced 
with storing a number outside the range in a byte, Stata sometimes 
stores missing value (.), and sometimes upgrades the storage format to the
next wider integer type, in this case, -int-.  Stata stores missing when it
takes the storage type you specify seriously, as in -infile-, and Stata rolls
up to the next integer type for other commands, such as -replace- and
-append-, and -merge-.

A -byte- stores -127 to 100, without rounding.

An -int- stores -32,767 to 32,740, without rounding.

A -long- stores -2,147,483,647 to 2,147,483,620, without rounding.

-float- and -double- are Stata's floating-point types.  They too have
ranges, +/-10^38 for -float- and +/-10^323 for -double-.  Those ranges
are so large, you can ignore them.  However, within the range, a fixed
number of digits are allocated to record the significand.  The two
formats are binary, 23 for -float-, 52 for -double-.  Attempting to
quantify those numbers in base 10 is easily subject to a
misunderstanding.  When you think of a base-10 number, and I tell you
it's recorded to four digits, you assume the fifth digit could by any
of 0 through 9.  But these formats are binary.  If I tell you that a
binary number is recorded to four digits, you must assume the fifth
number can only be 0 or 1.  So the first thing you will observe when a number
exceeds binary precision is a rounding to evenness!  Next you see a
rounding to multiples of 4, and so on.

Anyway, let's try with float to find the precision for ourselves.  Because 
of rounding to evenness, it's important we seek an odd number so we'll be 
able to see the rounding when it first occurs:

        . display %12.0g float(1000000+1)
             1000001

        . display %12.0g float(10000000+1)
             10000001

        . display %12.0g float(100000000+1)
            100000000

There it is, 9 digits.  Actually, 9 digits overstates the accuracy, 
because numbers are stored in binary.  What we really discovered is that 
the first number that has the rounding problem is somewhere between 
10,000,000 and 100,000,000.

More refined search yields that the problem first appears at 16,777,216:

        . display %12.0g float(16777214+1)
            16777215

        . display %12.0g float(16777216+1)
            16777216

Thus, the answer is 16,777,215 is the largest number not subject to 
rounding.  We could have figured that out directly when I first mentioned
that -float- assigns 24 binary digits to the significand:  (2^24)-1 is 
the largest number that can be recorded with the digits are 1, and 
(2^24)-1 = 16,777,215.  How many base-10 digits is that?  Most computer 
scientists would say 7.22 digits, 7.22 being the log base 10 of 
16,777,215.

For -double-, 52 binary digits are assigned to the significand, 
and thus the largest integer-valued double not subject to rouding 
is (2^52)-1 =  4,503,599,627,370,495, which means 15.65 digits.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Identifying unique values with codebook
  - From: "Michael N. Mitchell" <[email protected]>
- Re: st: Identifying unique values with codebook
  - From: Walter Garcia-Fontes <[email protected]>

Prev by Date: st: Question about how inclusive values are calculated in nested logit
Next by Date: Re: st: -collapse- command
Previous by thread: Re: st: AW: RE: AW: Identifying unique values with codebook
Next by thread: Re: st: Identifying unique values with codebook
Index(es):
- Date
- Thread