Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
wgould@stata.com (William Gould, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Identifying unique values with codebook |

Date |
Wed, 16 Jun 2010 10:42:54 -0500 |

Walter Garcia-Fontes <walter.garcia@upf.edu> asks about -codebook-. Consider a 10-observation Stata dataset containing variable x, which has values 1, 2, ..., 10: . drop _all . set obs 10 . gen x = _n Run -codebook-: . codebook <output omitted> -codebook- reports that x takes on 10 unique values. Now do this, . gen y = 100000000000000000 + x . codebook -codebook- reports y takes on 1 value! Walter reports that he ran into this probem in real data. "If values are large, they will be identified as the same value," he writes. "Is this a feature?" he asks. Answer: Yes, and Walter needs to take the output seriously. In the example above, -codebook- made no error; x may take on 10 different values, but y really does take on one value. One way I can prove that is by typing . assert y==y[1] -assert- produces no output, meaning the assertion is true. All values of y are equal to the first value of y. Type -assert x==x[1]- and -assert- will report "9 contradictions in 10 observations; assertion is false". This is a precision issue. Stata stores values as floating point numbers by default. Think of a floating point number as being _.______*10^___. For instance, 100000000000000000 is stored as 1.000000*10^17. Actually, Stata uses binary, _._______________*2^____, and 100000000000000000 is in fact stored as 1.01100011001001010111100*2^20 (if I did my arithmetic correctly). For purposes of understranding, we can pretend the Stata uses base 10. So let's imagine my computer stores 100000000000000000 as 1.000000*10^17. It stores 1 as 100000000000000000 as 1.000000*10^0. Let's add the two numbers: 1.000000*10^17 + 1.000000*10^ 0 ---------------- ?.??????*10^?? To perform the addition, I'll need to "normalize" the numbers -- to make the powers the same -- so that I can add the significand in the usual way. What I need is to write the second number, 1.000000*10^0 as ?.??????*10^17. I know you can do this in your head, but let's do it together: 1.000000*10^0 = 0.100000*10^1 = 0.010000*10^2 = 0.001000*10^3 = 0.000100*10^4 = 0.000010*10^5 = 0.000001*10^6 = 0.000000*10^7 <--- "Stop!" you say. "You made a mistake! You meant to type 0.0000001*10^7." No I didn't. I'm pretending I'm a finite precision, base-10 computer, with 7 digits of precision. Ergo, when preforming normalization, 0.000001*10^6 = 0.000000*10^7. I followed my usual normalization-rule: roll the digits one to the right, and then increase the power by 1. It's too bad that 1 at the end rolled off, but that's my rule. Now, if you'll excuse me, I need to finish the normalization: = 0.000000*10^7 = 0.000000*10^8 . . = 0.000000*10^17 Now I can add the two numbers: 1.000000*10^17 + 0.000000*10^17 ---------------- 1.000000*10^17 Thus, I find that 1*10^17 plus 1 is precisely 1*10^17. Being limited to 7 digits of precision, what else could I do? Stata did the equivalent, but in binary. What this means for Walter, for me, and for everybody, is that large numbers are subject to rounding! Or more correctly, large numbers are subject to rounding when stored in floating-point format. Stata has another format, integer format. -byte-, -int-, and -long- are Stata's integer formats. -byte- has 8 binary digits, -int- has 32 binary digits, and -long- has 64 binary digits. The integer formats are not subject to rounding, but they are subject to underflow and overflow. A -byte-, for instance, can store a number between -127 and 100. Numbers smaller then -127 or larger than 100 can't be stored in a byte. Faced with storing a number outside the range in a byte, Stata sometimes stores missing value (.), and sometimes upgrades the storage format to the next wider integer type, in this case, -int-. Stata stores missing when it takes the storage type you specify seriously, as in -infile-, and Stata rolls up to the next integer type for other commands, such as -replace- and -append-, and -merge-. A -byte- stores -127 to 100, without rounding. An -int- stores -32,767 to 32,740, without rounding. A -long- stores -2,147,483,647 to 2,147,483,620, without rounding. -float- and -double- are Stata's floating-point types. They too have ranges, +/-10^38 for -float- and +/-10^323 for -double-. Those ranges are so large, you can ignore them. However, within the range, a fixed number of digits are allocated to record the significand. The two formats are binary, 23 for -float-, 52 for -double-. Attempting to quantify those numbers in base 10 is easily subject to a misunderstanding. When you think of a base-10 number, and I tell you it's recorded to four digits, you assume the fifth digit could by any of 0 through 9. But these formats are binary. If I tell you that a binary number is recorded to four digits, you must assume the fifth number can only be 0 or 1. So the first thing you will observe when a number exceeds binary precision is a rounding to evenness! Next you see a rounding to multiples of 4, and so on. Anyway, let's try with float to find the precision for ourselves. Because of rounding to evenness, it's important we seek an odd number so we'll be able to see the rounding when it first occurs: . display %12.0g float(1000000+1) 1000001 . display %12.0g float(10000000+1) 10000001 . display %12.0g float(100000000+1) 100000000 There it is, 9 digits. Actually, 9 digits overstates the accuracy, because numbers are stored in binary. What we really discovered is that the first number that has the rounding problem is somewhere between 10,000,000 and 100,000,000. More refined search yields that the problem first appears at 16,777,216: . display %12.0g float(16777214+1) 16777215 . display %12.0g float(16777216+1) 16777216 Thus, the answer is 16,777,215 is the largest number not subject to rounding. We could have figured that out directly when I first mentioned that -float- assigns 24 binary digits to the significand: (2^24)-1 is the largest number that can be recorded with the digits are 1, and (2^24)-1 = 16,777,215. How many base-10 digits is that? Most computer scientists would say 7.22 digits, 7.22 being the log base 10 of 16,777,215. For -double-, 52 binary digits are assigned to the significand, and thus the largest integer-valued double not subject to rouding is (2^52)-1 = 4,503,599,627,370,495, which means 15.65 digits. -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Identifying unique values with codebook***From:*"Michael N. Mitchell" <Michael.Norman.Mitchell@gmail.com>

**Re: st: Identifying unique values with codebook***From:*Walter Garcia-Fontes <walter.garcia@upf.edu>

- Prev by Date:
**st: Question about how inclusive values are calculated in nested logit** - Next by Date:
**Re: st: -collapse- command** - Previous by thread:
**Re: st: AW: RE: AW: Identifying unique values with codebook** - Next by thread:
**Re: st: Identifying unique values with codebook** - Index(es):