 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: st: Algebra problem

 From wgould@stata.com (William Gould, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: Algebra problem Date Wed, 21 Apr 2010 11:17:28 -0500

Tunga Kantarci <tungakantarci@hotmail.com> asks,

> Why does Stata 11 give me different counts for the following ?
>
>       . count if DU1_65_R3C3_5_==1.1*DU_default_income
>       51
>
>       . count if DU1_65_R3C3_5_/DU_default_income==1.1
>       308

Stas Kolenikov <skolenik@gmail.com> replied,

> Wrap everything in float() and see if the numbers change; [...]

Good, if brief, answer.  To be explicit, Stas wants Tunga to type

. count if float(DU1_65_R3C3_5_) ==
float(float(1.1)*float(DU_default_income))

and

. count if float(float(DU1_65_R3C3_5_)/(float(DU_default_income))
== float(1.1)

although if two variables DU1_65_R3C3_5_ and DU_default_income area already
stored as -float-, Tunga typing float(DU1_65_r3C3_5) and
float(DU_default_income) is not necessary because they already are float,
so the above reduces too,

. count if DU1_65_R3C3_5_ == float(float(1.1)*DU_default_income)

. count if float(DU1_65_R3C3_5_/DU_default_income) == float(1.1)

There are lots of issues here, and Stas when after one or two of them.
The issues are:

1.  Tunga thinks in base 10; Stata thinks (and calculates) in
binary.  Decimal number 1.1 has no exact representation in
binary; it is 1.000110011001100110011001...

2.  Even in binary, Stata does not do infnite precision arithmetic.
It rounds binary 1.000110011001100110011001... to
to 1.000110011001100110011001100110011001100110011001100.
Finite-precision rounding also applies to the values stored
in variables themselves, too.

3.  Even if Stata did all calculations in decimal, and even if Stata
used infinite precision, Tunga never wanted
to count DU1_65_R3C3_5_==1.1*DU_default_income or to count
whether their ratio was 1.1 because Tunga already knew the answer.
If the two variables are reals, the changes of their ratio
being any single exact real is 0.

"Wait!," Tonga says concerning (3).  "These are incomes and are recorded
in dollars and cents.  Cute point you're making about reals, but my numbers
have only two digits after the decimal point.  Moreover, the distribution of
the two numbers is not rectangular; it's humped, and that increases the
chances even more."

Okay; I just did one simulation.
I just 100,000 incomes from N(25000, 2000).
Then I drew another 100,000 incomes from the same distribution.
The incomes are uncorrelated.
I rounded incomes to two digits to the right of the decimal point.
I worked in infinite precision.
I worked in base 10.
I counted the number of cases in which the ratio is exactly 1.1.
The count was 0.
I'll do it again if you would like.

Before Tonga says "But my incomes are correlated!" which, I admit, increases
the chances that the exact ratio is 1.1, I will warn that still, counts
will be roughly 0.  It is true, if we pick the right parameters for the
distribution (make the ratio of means 1.1, for instance), we can drive
the chances of observing 1 or more ratios in 100,000 trials of 1.1 up,
but working in full, infinite precision, you will still be surprised
how rare a ratio of exactly 1.1 is.

The fact is that Tonga never wanted exactly 1.1.  Tonga wanted around 1.1.
Stas's solution to Tonga's problem was, "Perhaps, by around, you mean float
precision."  My response is that float precision is a narrow interval
indeed and that Tonga should think about what he means by around 1.1.

Let's assume that Tonga wants 1.1 +/- .0001.  Then Tonga should type

. gen ratio = DU1_65_R3C3_5_/DU_default_income

. count if ratio>=1.1+.0001 & ratio<=1.1-.0001

I chose to type that in two lines just to make the typing easier, and
to make what I'm doing more obvious to the reader.  Tonga can type
it however he wants.  I expressed the value as a ratio, but Tonga can use
use whatever mathematically equivalent way of expressing it he desires.

But don't you still have to be careful, you might ask.  Aren't there still
decimal/binary issues?

Yes, there are binary issues, and yes, there are rounding issues, and yes,
there are even float vs. double issues.  Put them all together, and the
problem becomes very complicated.  Put put them all together and we are still
talking round-off error.  We have enough precision so that the round-off error
will not matter.  More than enough precision.

We have had accuracy discussions on Statalist before.  This one, however, is
different because this time the finite-precision issues served merely to
uncover what was in fact a substantive issue.  Once Tunga defines
appropriately what means by the income ratios being roughly 1.1, he will find
that the finite-preicision issues will shrink to unimportance.  The precision
that we do have is more than adequate once the problem is properly defined.

Tonga's problem, count ratio==1.1, is very different from the usual precision
and binary/base-10 we have discussed on the list, such as counting income ==
24239.12.  For that problem, the right solution, and the required solution
when the variable stored as float, is income == float(24239.12).

In this case, I am asking Tunga not to type ratio == <some_number>.
I made the obvious point that ratio == some_number is a zero-probability
even in the reals, and very near zero in most real cases, assuming we
make the calculations in the infinite-preicision mathematical way.
Rather than type ratio == some_number, I want Tunga to think
in terms of some_number_1 <= ratio <= <some_number_2, where the
numbers are chosen is some reasonable, population meaningful way.
Once he does that, the precision issues shrink back into the background
where Tunga can ignore them.

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/