Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Wrong results for Wilcoxon signed ranks test when data have decimal places (even using double)


From   Marta García-Granero <mgarciagranero@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Wrong results for Wilcoxon signed ranks test when data have decimal places (even using double)
Date   Thu, 14 Feb 2013 17:58:10 +0100

Since English is not my native tongue, I did not express myself very well. When I talked about rounding, I did not talk about rounding the original data, but applying the round() function to the absolute differences before ranking them.

Time ago, while preparing some slides with Excel (just for classes, I NEVER use Excel for serious research), I found the same problem: some differences that should be the same where in fact different (below the 15th decimal place) an got a wrong rank assigned. I discovered that ranking "round(absdiff,1e-15)" eliminated the problem, since the data where compared only up to th 15th decimal place and declared equal or different correctly. In another message I have sent shortly before this one, I have suggested applying the same method to signrank.ado fixed the problem with wrong ranking (I tested it myself before posting).

Concerning SPSS, since their code is compiled and hidden, and more protected than Coke's formula, I can only guess from the Acrobat documentation and my hand calculations that they have somehow circumvented the problem of those nasty little differences below the 15th decimal place.

Maybe I was a bit too bold (being just a 2 months old Stata user) suggesting the modification of signrank.ado, but I am checking it with different datasets (from statistics books), and the results obtained with Stata, SPSS, and the ones shown in those books agree.

Regards,
MGG

El 14/02/2013 17:32, Nick Cox escribió:
Surprising though it may seem in the face of this carefully presented
evidence, I wouldn't call this a bug, at least not one that is
fixable.

It's an anomaly and it's awkward, but it's not a bug

First off, a look at the code for -signrank- suggests that Stata uses
-double- precision where possible, and that's as far as ado code goes.

It's an anomaly and it's awkward, but if it were a bug there would be
a solution and Marta's suggestion that there be "some rounding",
whatever that means precisely, does not sound like a good solution,
because how is StataCorp supposed to justify what rounding it does,
and how does that fit in with anybody else's idea of what the correct
procedure is, exactly and reproducibly? For example, which
authoritative accounts say you should apply some rounding first to get
reproducible results?

Also, Marta has a solid argument that when you have a rank procedure,
and data that come all presented to 2 decimal places, that you should
get exactly the same result when data are multiplied by 100 and become
integers. That's totally sound logic: the results of ranking are
invariant under multiplication of the originals by a positive
constant. But that's not only the only consideration. The other
consideration is that people reasonably expect this test to be
applicable to non-integer data and so Stata's code has to work within
the constraints that implies.

The underlying fact, often rehearsed on this list, is that Stata does
not do, and does not claim to do, exact decimal arithmetic unless
there is an exact binary equivalent of that decimal calculation. So
the heart of the matter is that Stata will very occasionally give what
look wrong answers to decimal problems, as in the case of

. di %21x  0.70 - 0.65
+1.9999999999990X-005

. di %21x  0.65 - 0.6
+1.99999999999a0X-005

Every smart child knows that the answers to these problems should be
same, but they aren't when mapped to the nearest equivalent problems
in binary.

I can't comment on exactly what SPSS does; that's clearly pertinent too.

Nick

On Thu, Feb 14, 2013 at 4:02 PM, Marta García-Granero
<mgarciagranero@gmail.com> wrote:
Apologies for sending this twice, but yesterday I tried to piggyback into
another thread ("Rounding Errors Stata 12"), although closely related to
this question, and I think my question got lost. Besides, I'm going to
explain the problem a bit more (and better).

I'm converting some class notes (basic statistics) from SPSS to Stata, and I
have found that the way Stata handles ranking tied data in Wilcoxon test can
be sometimes wrong, when data have decimal places, even using -double-
everywhere.

The sample dataset comes from the on-line e-book Statistics at Square One
(exercise at the end of chapter 1). I am using Stata 12.1 64 bits (last
update installed) on W7, but I found the same problem with Stata 12.1 32
bits on Windows XP. The results I get using Stata doesn't match the ones, I
got either with my hand calculations, or with SPSS.

set type double
input copper
0.70
0.45
0.72
0.30
1.16
0.69
0.83
0.74
1.24
0.77
0.65
0.76
0.42
0.94
0.36
0.98
0.64
0.90
0.63
0.55
0.78
0.10
0.52
0.42
0.58
0.62
1.12
0.86
0.74
1.04
0.65
0.66
0.81
0.48
0.85
0.75
0.73
0.50
0.34
0.88
end

* One sample Wilcoxon's test (against population median = 0.6)

signrank copper = 0.6

* Multiply data by 100 to get rid of decimal places and running the test
again (pop. median = 60)
* this time all the output (positive&negative sum of ranks, Z stat&p value)
is correct

generate copper100 = round(copper*100)
signrank copper100 = 60

* Generating the ranks for absolute differences between copper & pop median
for both variables (copper&copper100)
* Ranks should have been the same in both cases, but they are not
* Notice the difference for cases 5/6/7, 18/19, 22/23/24, 29/30, 32/33
* "ranks2" is correct (recognizes all tied data), and leads to the right
Wilcoxon's p-value

egen double ranks1 = rank(abs(copper-0.6))
egen double ranks2 = rank(abs(copper100-60))
generate absdiff = abs(copper-0.6)
sort absdiff
list absdiff ranks1 ranks2

I would label that as a Stata bug. Tied absolute differences are not
recognized as so because there is a difference at the 15th decimal place.
Maybe some rounding should be performed before assigning ranks.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index