Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: re: ksmirnov


From   David Airey <[email protected]>
To   [email protected]
Subject   st: re: ksmirnov
Date   Mon, 14 May 2007 21:09:12 -0500

I have found that this test assumes random sampling. I had someone attempting it with multiple measures within animals.


While I'm on this topic, I found a note in the Statalist saying that the implementation of ksmirnov is faulty for discrete data. As far as I can tell, it has not been fixed, despite Stas and Ben asking:



consider the following small fragment:

set obs 20
g z = invnorm(uniform())
sort z
g cdf = _n/20
ksmirnov z = cdf
ksmirnov z = norm(z)
* those should give insignificant differences, as both are true
distributions at this moment

expand 50
* now I made this a discrete distribution with 50 points at each of 20
point masses
ksmirnov z = norm(z)
* this one is rejected: the distribution is not normal anymore
ksmirnov z = cdf
* but so is this one, with the difference between the empirical cdf
and the theoretical
* one being evaluated as 0.05!

I think the issue is that in the code of -ksmirnov- (and it is not too
difficult to locate) there are lines that looks essentially as

sort x
gen cdf = _n/_N

which is not quite appropriate for discrete data. Something like

bysort x (cdf): replace cdf = cdf[_N]

should be added, so that the cdfs do look like Prob[X <= x], and the
above problem would be solved.


In September 2006, Stas Kolenikov pointed out that the Stata
implementation of the Kolmogorov-Smirnov test (-ksmirnov-) produces
flawed results on discrete data and that this could be fixed easily
(http://www.stata.com/statalist/archive/2006-09/msg00483.html).

As far as I can see, there has been no reaction to Stas' message on
statalist and, although the date of the current -ksmirnov- version is
19dec2006, the program still does not seem to produce correct results
nor does it issue a warning if ties are encountered.

I came across this -ksmirnov- issue today while trying to use the
program in some analyses. After being puzzled for a while I realized
what the problem was an then did a Statalist search. Although I now
have my own adaption of the program to serve my needs, I would really
like to see -ksmirnov- handle discrete data correctly or, at least,
issue a warning or error message. Chances are really high that one
does not notice the problem, if there is only a moderate amount of
ties in the data.

ben

And, by the way, it would be a piece of cake to make the program work
with fweights.

Can't we get this one fixed or what??


-Dave
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index