consider the following small fragment:
set obs 20
g z = invnorm(uniform())
sort z
g cdf = _n/20
ksmirnov z = cdf
ksmirnov z = norm(z)
* those should give insignificant differences, as both are true
distributions at this moment
expand 50
* now I made this a discrete distribution with 50 points at each of 20
point masses
ksmirnov z = norm(z)
* this one is rejected: the distribution is not normal anymore
ksmirnov z = cdf
* but so is this one, with the difference between the empirical cdf
and the theoretical
* one being evaluated as 0.05!
I think the issue is that in the code of -ksmirnov- (and it is not too
difficult to locate) there are lines that looks essentially as
sort x
gen cdf = _n/_N
which is not quite appropriate for discrete data. Something like
bysort x (cdf): replace cdf = cdf[_N]
should be added, so that the cdfs do look like Prob[X <= x], and the
above problem would be solved.
In September 2006, Stas Kolenikov pointed out that the Stata
implementation of the Kolmogorov-Smirnov test (-ksmirnov-) produces
flawed results on discrete data and that this could be fixed easily
(http://www.stata.com/statalist/archive/2006-09/msg00483.html).
As far as I can see, there has been no reaction to Stas' message on
statalist and, although the date of the current -ksmirnov- version is
19dec2006, the program still does not seem to produce correct results
nor does it issue a warning if ties are encountered.
I came across this -ksmirnov- issue today while trying to use the
program in some analyses. After being puzzled for a while I realized
what the problem was an then did a Statalist search. Although I now
have my own adaption of the program to serve my needs, I would really
like to see -ksmirnov- handle discrete data correctly or, at least,
issue a warning or error message. Chances are really high that one
does not notice the problem, if there is only a moderate amount of
ties in the data.
ben
And, by the way, it would be a piece of cake to make the program work
with fweights.