Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: re: ksmirnov

From   David Airey <david.airey@Vanderbilt.Edu>
Subject   st: re: ksmirnov
Date   Mon, 14 May 2007 21:09:12 -0500

I have found that this test assumes random sampling. I had someone attempting it with multiple measures within animals.

While I'm on this topic, I found a note in the Statalist saying that the implementation of ksmirnov is faulty for discrete data. As far as I can tell, it has not been fixed, despite Stas and Ben asking:

consider the following small fragment:

set obs 20
g z = invnorm(uniform())
sort z
g cdf = _n/20
ksmirnov z = cdf
ksmirnov z = norm(z)
* those should give insignificant differences, as both are true
distributions at this moment

expand 50
* now I made this a discrete distribution with 50 points at each of 20
point masses
ksmirnov z = norm(z)
* this one is rejected: the distribution is not normal anymore
ksmirnov z = cdf
* but so is this one, with the difference between the empirical cdf
and the theoretical
* one being evaluated as 0.05!

I think the issue is that in the code of -ksmirnov- (and it is not too
difficult to locate) there are lines that looks essentially as

sort x
gen cdf = _n/_N

which is not quite appropriate for discrete data. Something like

bysort x (cdf): replace cdf = cdf[_N]

should be added, so that the cdfs do look like Prob[X <= x], and the
above problem would be solved.

In September 2006, Stas Kolenikov pointed out that the Stata
implementation of the Kolmogorov-Smirnov test (-ksmirnov-) produces
flawed results on discrete data and that this could be fixed easily

As far as I can see, there has been no reaction to Stas' message on
statalist and, although the date of the current -ksmirnov- version is
19dec2006, the program still does not seem to produce correct results
nor does it issue a warning if ties are encountered.

I came across this -ksmirnov- issue today while trying to use the
program in some analyses. After being puzzled for a while I realized
what the problem was an then did a Statalist search. Although I now
have my own adaption of the program to serve my needs, I would really
like to see -ksmirnov- handle discrete data correctly or, at least,
issue a warning or error message. Chances are really high that one
does not notice the problem, if there is only a moderate amount of
ties in the data.


And, by the way, it would be a piece of cake to make the program work
with fweights.

Can't we get this one fixed or what??

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index