Robert Ostling asked about using -ksmirnov- with discrete data when performing
a two-sample Kolmogorov-Smirnov test. Ben Jann <ben.jann@gmail.com> also
commented on performing the one-sample Kolmogorov-Smirnov test with discrete
data.
The methodologies used by -ksmirnov- for both the one and two-sample tests
were derived for data from continuous distributions.
Ben referenced two articles that discuss a way to perform the a one-sample
Kolmogorov-Smirnov test when you are interested in comparing data to a
discrete theoretical distribution. When making a comparison of this type, the
test statistic should be computed using the method Ben describes as opposed to
the method that -ksmirnov- uses. Currently, there is not a command that
implements this test, although this is something we are looking into adding.
There has also been some discussion regarding the use of the -ksmirnov-
command when ties exist in the data. Theoretically, no ties should exist when
data is sampled from a continuous distribution, but, in practice, this is not
necessarily true. The test statistic that is produced by -ksmirnov- is still
correct when ties exist in a dataset that we wish to compare to a continuous
theoretical distribution. However, if there are a large number of ties, the
approximate p-value that is reported may not be appropriate. In the latest
update, a note was added to -ksmirnov- to inform the user of the number of
ties that exist in his dataset.
Gibbons and Chakraborti (2003, 121) give the following formula for the test
statistic D for the one-sample Kolmogorov-Smirnov test
D = sup|S(x) - F(X) = max[|S(x) - F(x)|, |S(x-e) - F(x)|]
where e is a small positive number. They also mention that it applies even in
the case when ties are present.
Using the example that Ben gave, this would be as follows
x S(x) F(x) S(x)-F(x) S(x-e)-F(x)
1 .1 .2 -.1 -.2
2 .2 .4 -.2 -.3
3 .3 .6 -.3 -.4
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
4 .9 .8 .1 -.5
5 1 1 0 -.1
Therefore, D = .5. This is equivalent to the result that is reported by
-ksmirnov-. However, Ben's data was intended to be compared to a discrete
distribution, so a test for discrete data would be more suitable.
Gibbons, J. D., and S. Chakraborti. Nonparametric Statistical Inference. 4th
ed. New York: Marcel Dekker, Inc.
--Kristin
kmacdonald@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/