Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Identifying unique values with codebook
email@example.com (William Gould, StataCorp LP)
Re: st: Identifying unique values with codebook
Wed, 16 Jun 2010 10:42:54 -0500
Walter Garcia-Fontes <firstname.lastname@example.org> asks about -codebook-.
Consider a 10-observation Stata dataset containing variable x, which
has values 1, 2, ..., 10:
. drop _all
. set obs 10
. gen x = _n
-codebook- reports that x takes on 10 unique values.
Now do this,
. gen y = 100000000000000000 + x
-codebook- reports y takes on 1 value!
Walter reports that he ran into this probem in real data. "If values
are large, they will be identified as the same value," he writes.
"Is this a feature?" he asks.
Answer: Yes, and Walter needs to take the output seriously. In the
example above, -codebook- made no error; x may take on 10 different values,
but y really does take on one value. One way I can prove that is by typing
. assert y==y
-assert- produces no output, meaning the assertion is true. All values of
y are equal to the first value of y. Type -assert x==x- and
-assert- will report "9 contradictions in 10 observations; assertion is
This is a precision issue.
Stata stores values as floating point numbers by default. Think of a
floating point number as being _.______*10^___. For instance,
100000000000000000 is stored as 1.000000*10^17. Actually, Stata
uses binary, _._______________*2^____, and 100000000000000000 is in fact
stored as 1.01100011001001010111100*2^20 (if I did my arithmetic correctly).
For purposes of understranding, we can pretend the Stata uses base 10.
So let's imagine my computer stores
100000000000000000 as 1.000000*10^17. It stores 1 as
100000000000000000 as 1.000000*10^0. Let's add the two numbers:
+ 1.000000*10^ 0
To perform the addition, I'll need to "normalize" the numbers --
to make the powers the same -- so that I can add the significand in
the usual way. What I need is to write the second number,
1.000000*10^0 as ?.??????*10^17. I know you can do this in your head,
but let's do it together:
1.000000*10^0 = 0.100000*10^1
= 0.000000*10^7 <---
"Stop!" you say. "You made a mistake! You meant to type 0.0000001*10^7."
No I didn't. I'm pretending I'm a finite precision, base-10 computer, with 7
digits of precision. Ergo, when preforming normalization, 0.000001*10^6 =
0.000000*10^7. I followed my usual normalization-rule: roll the digits one
to the right, and then increase the power by 1. It's too bad that 1 at the
end rolled off, but that's my rule. Now, if you'll excuse me, I need to
finish the normalization:
Now I can add the two numbers:
Thus, I find that 1*10^17 plus 1 is precisely 1*10^17. Being limited
to 7 digits of precision, what else could I do?
Stata did the equivalent, but in binary.
What this means for Walter, for me, and for everybody, is that large
numbers are subject to rounding! Or more correctly, large numbers
are subject to rounding when stored in floating-point format.
Stata has another format, integer format. -byte-, -int-, and -long-
are Stata's integer formats. -byte- has 8 binary digits, -int- has
32 binary digits, and -long- has 64 binary digits. The integer formats
are not subject to rounding, but they are subject to underflow and overflow.
A -byte-, for instance, can store a number between -127 and 100. Numbers
smaller then -127 or larger than 100 can't be stored in a byte. Faced
with storing a number outside the range in a byte, Stata sometimes
stores missing value (.), and sometimes upgrades the storage format to the
next wider integer type, in this case, -int-. Stata stores missing when it
takes the storage type you specify seriously, as in -infile-, and Stata rolls
up to the next integer type for other commands, such as -replace- and
-append-, and -merge-.
A -byte- stores -127 to 100, without rounding.
An -int- stores -32,767 to 32,740, without rounding.
A -long- stores -2,147,483,647 to 2,147,483,620, without rounding.
-float- and -double- are Stata's floating-point types. They too have
ranges, +/-10^38 for -float- and +/-10^323 for -double-. Those ranges
are so large, you can ignore them. However, within the range, a fixed
number of digits are allocated to record the significand. The two
formats are binary, 23 for -float-, 52 for -double-. Attempting to
quantify those numbers in base 10 is easily subject to a
misunderstanding. When you think of a base-10 number, and I tell you
it's recorded to four digits, you assume the fifth digit could by any
of 0 through 9. But these formats are binary. If I tell you that a
binary number is recorded to four digits, you must assume the fifth
number can only be 0 or 1. So the first thing you will observe when a number
exceeds binary precision is a rounding to evenness! Next you see a
rounding to multiples of 4, and so on.
Anyway, let's try with float to find the precision for ourselves. Because
of rounding to evenness, it's important we seek an odd number so we'll be
able to see the rounding when it first occurs:
. display %12.0g float(1000000+1)
. display %12.0g float(10000000+1)
. display %12.0g float(100000000+1)
There it is, 9 digits. Actually, 9 digits overstates the accuracy,
because numbers are stored in binary. What we really discovered is that
the first number that has the rounding problem is somewhere between
10,000,000 and 100,000,000.
More refined search yields that the problem first appears at 16,777,216:
. display %12.0g float(16777214+1)
. display %12.0g float(16777216+1)
Thus, the answer is 16,777,215 is the largest number not subject to
rounding. We could have figured that out directly when I first mentioned
that -float- assigns 24 binary digits to the significand: (2^24)-1 is
the largest number that can be recorded with the digits are 1, and
(2^24)-1 = 16,777,215. How many base-10 digits is that? Most computer
scientists would say 7.22 digits, 7.22 being the log base 10 of
For -double-, 52 binary digits are assigned to the significand,
and thus the largest integer-valued double not subject to rouding
is (2^52)-1 = 4,503,599,627,370,495, which means 15.65 digits.
* For searches and help try: