Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Identifying unique values with codebook

From	"Michael N. Mitchell" <[email protected]>
To	[email protected]
Subject	Re: st: Identifying unique values with codebook
Date	Thu, 17 Jun 2010 09:30:38 -0700

Dear All

Many thanks to Bill Gould for providing such an detailed description of issues ofprecision. It seems to me that many of the "gotchas" arise from the fact that the defaultdata type is "float" instead of "double". Yes, it is true that "double" variables taketwice as much space as "float" variables (taking 8 bytes vs. 4 bytes), but this buys youprecision that avoids problems like Walter encountered.


  If Walter (or others) would like to make "double" the default data type, you can type

. set type double

and variables will, by default, be created as double precision (for the duration ofyour Stata session). You can go one step further and type


. set type double, permanently

and this setting will be recorded and saved for future Stata sessions as well. I likethis setting myself and find that it is useful for avoiding many precision issues. (Otherrelated precision issues are described in this weeks Stata Tidbit athttp://www.michaelnormanmitchell.com/stow/floats-and-doubles.html ).


Best regards,

Michael N. Mitchell
Data Management Using Stata      - http://www.stata.com/bookstore/dmus.html
A Visual Guide to Stata Graphics - http://www.stata.com/bookstore/vgsg.html
Stata tidbit of the week         - http://www.MichaelNormanMitchell.com



On 2010-06-16 8.42 AM, William Gould, StataCorp LP wrote:

Walter Garcia-Fontes<[email protected]>  asks about -codebook-.
Consider a 10-observation Stata dataset containing variable x, which
has values 1, 2, ..., 10:

         . drop _all

         . set obs 10

         . gen x = _n

Run -codebook-:

        . codebook
        <output omitted>

-codebook- reports that x takes on 10 unique values.
Now do this,

        . gen y = 100000000000000000 + x

        . codebook

-codebook- reports y takes on 1 value!

Walter reports that he ran into this probem in real data.  "If values
are large, they will be identified as the same value," he writes.
"Is this a feature?" he asks.

Answer:  Yes, and Walter needs to take the output seriously.  In the
example above, -codebook- made no error; x may take on 10 different values,
but y really does take on one value.  One way I can prove that is by typing

         . assert y==y[1]

-assert- produces no output, meaning the assertion is true.  All values of
y are equal to the first value of y.  Type -assert x==x[1]- and
-assert- will report "9 contradictions in 10 observations; assertion is
false".

This is a precision issue.

Stata stores values as floating point numbers by default.  Think of a
floating point number as being _.______*10^___.  For instance,
100000000000000000 is stored as 1.000000*10^17.  Actually, Stata
uses binary, _._______________*2^____, and 100000000000000000 is in fact
stored as 1.01100011001001010111100*2^20 (if I did my arithmetic correctly).

For purposes of understranding, we can pretend the Stata uses base 10.
So let's imagine my computer stores
100000000000000000 as 1.000000*10^17.  It stores 1 as
100000000000000000 as 1.000000*10^0.  Let's add the two numbers:

                   1.000000*10^17
                 + 1.000000*10^ 0
                 ----------------
                   ?.??????*10^??

To perform the addition, I'll need to "normalize" the numbers --
to make the powers the same -- so that I can add the significand in
the usual way.  What I need is to write the second number,
1.000000*10^0 as ?.??????*10^17.  I know you can do this in your head,
but let's do it together:

         1.000000*10^0  =  0.100000*10^1
                        =  0.010000*10^2
                        =  0.001000*10^3
                        =  0.000100*10^4
                        =  0.000010*10^5
                        =  0.000001*10^6
                        =  0.000000*10^7<---

"Stop!" you say. "You made a mistake!  You meant to type 0.0000001*10^7."

No I didn't.  I'm pretending I'm a finite precision, base-10 computer, with 7
digits of precision.  Ergo, when preforming normalization, 0.000001*10^6 =
0.000000*10^7.  I followed my usual normalization-rule:  roll the digits one
to the right, and then increase the power by 1.  It's too bad that 1 at the
end rolled off, but that's my rule.  Now, if you'll excuse me, I need to
finish the normalization:

                        =  0.000000*10^7
                        =  0.000000*10^8
                        .
                        .
                        =  0.000000*10^17

Now I can add the two numbers:

                   1.000000*10^17
                 + 0.000000*10^17
                 ----------------
                   1.000000*10^17

Thus, I find that 1*10^17 plus 1 is precisely 1*10^17.  Being limited
to 7 digits of precision, what else could I do?

Stata did the equivalent, but in binary.

What this means for Walter, for me, and for everybody, is that large
numbers are subject to rounding!  Or more correctly, large numbers
are subject to rounding when stored in floating-point format.

Stata has another format, integer format.  -byte-, -int-, and -long-
are Stata's integer formats.  -byte- has 8 binary digits, -int- has
32 binary digits, and -long- has 64 binary digits.  The integer formats
are not subject to rounding, but they are subject to underflow and overflow.
A -byte-, for instance, can store a number between -127 and 100.  Numbers
smaller then -127 or larger than 100 can't be stored in a byte.  Faced
with storing a number outside the range in a byte, Stata sometimes
stores missing value (.), and sometimes upgrades the storage format to the
next wider integer type, in this case, -int-.  Stata stores missing when it
takes the storage type you specify seriously, as in -infile-, and Stata rolls
up to the next integer type for other commands, such as -replace- and
-append-, and -merge-.

A -byte- stores -127 to 100, without rounding.

An -int- stores -32,767 to 32,740, without rounding.

A -long- stores -2,147,483,647 to 2,147,483,620, without rounding.

-float- and -double- are Stata's floating-point types.  They too have
ranges, +/-10^38 for -float- and +/-10^323 for -double-.  Those ranges
are so large, you can ignore them.  However, within the range, a fixed
number of digits are allocated to record the significand.  The two
formats are binary, 23 for -float-, 52 for -double-.  Attempting to
quantify those numbers in base 10 is easily subject to a
misunderstanding.  When you think of a base-10 number, and I tell you
it's recorded to four digits, you assume the fifth digit could by any
of 0 through 9.  But these formats are binary.  If I tell you that a
binary number is recorded to four digits, you must assume the fifth
number can only be 0 or 1.  So the first thing you will observe when a number
exceeds binary precision is a rounding to evenness!  Next you see a
rounding to multiples of 4, and so on.

Anyway, let's try with float to find the precision for ourselves.  Because
of rounding to evenness, it's important we seek an odd number so we'll be
able to see the rounding when it first occurs:

         . display %12.0g float(1000000+1)
              1000001

         . display %12.0g float(10000000+1)
              10000001

         . display %12.0g float(100000000+1)
             100000000

There it is, 9 digits.  Actually, 9 digits overstates the accuracy,
because numbers are stored in binary.  What we really discovered is that
the first number that has the rounding problem is somewhere between
10,000,000 and 100,000,000.

More refined search yields that the problem first appears at 16,777,216:

         . display %12.0g float(16777214+1)
             16777215

         . display %12.0g float(16777216+1)
             16777216

Thus, the answer is 16,777,215 is the largest number not subject to
rounding.  We could have figured that out directly when I first mentioned
that -float- assigns 24 binary digits to the significand:  (2^24)-1 is
the largest number that can be recorded with the digits are 1, and
(2^24)-1 = 16,777,215.  How many base-10 digits is that?  Most computer
scientists would say 7.22 digits, 7.22 being the log base 10 of
16,777,215.

For -double-, 52 binary digits are assigned to the significand,
and thus the largest integer-valued double not subject to rouding
is (2^52)-1 =  4,503,599,627,370,495, which means 15.65 digits.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Identifying unique values with codebook
  - From: Maarten buis <[email protected]>

References:
- Re: st: Identifying unique values with codebook
  - From: [email protected] (William Gould, StataCorp LP)

Prev by Date: Re: st: matching observations for merging
Next by Date: Re: st: Identifying unique values with codebook
Previous by thread: Re: st: Identifying unique values with codebook
Next by thread: Re: st: Identifying unique values with codebook
Index(es):
- Date
- Thread