# Re: st: Ksmirnov discrete data (again)

 From kmacdonald@stata.com (Kristin MacDonald, StataCorp) To statalist@hsphsun2.harvard.edu Subject Re: st: Ksmirnov discrete data (again) Date Fri, 15 Jun 2007 17:19:50 -0500

```Robert Ostling asked about using -ksmirnov- with discrete data when performing
a two-sample Kolmogorov-Smirnov test.  Ben Jann <ben.jann@gmail.com> also
commented on performing the one-sample Kolmogorov-Smirnov test with discrete
data.

The methodologies used by -ksmirnov- for both the one and two-sample tests
were derived for data from continuous distributions.

Ben referenced two articles that discuss a way to perform the a one-sample
Kolmogorov-Smirnov test when you are interested in comparing data to a
discrete theoretical distribution.  When making a comparison of this type, the
test statistic should be computed using the method Ben describes as opposed to
the method that -ksmirnov- uses.  Currently, there is not a command that
implements this test, although this is something we are looking into adding.

There has also been some discussion regarding the use of the -ksmirnov-
command when ties exist in the data.  Theoretically, no ties should exist when
data is sampled from a continuous distribution, but, in practice, this is not
necessarily true.  The test statistic that is produced by -ksmirnov- is still
correct when ties exist in a dataset that we wish to compare to a continuous
theoretical distribution.  However, if there are a large number of ties, the
approximate p-value that is reported may not be appropriate.  In the latest
update, a note was added to -ksmirnov- to inform the user of the number of
ties that exist in his dataset.

Gibbons and Chakraborti (2003, 121) give the following formula for the test
statistic D for the one-sample Kolmogorov-Smirnov test

D = sup|S(x) - F(X) = max[|S(x) - F(x)|, |S(x-e) - F(x)|]

where e is a small positive number.  They also mention that it applies even in
the case when ties are present.

Using the example that Ben gave, this would be as follows

x 	S(x)	F(x)	S(x)-F(x)	S(x-e)-F(x)
1	.1	.2	-.1		-.2
2	.2	.4	-.2		-.3
3	.3	.6	-.3		-.4
4	.9	.8	.1		-.5
4       .9      .8      .1              -.5
4       .9      .8      .1              -.5
4       .9      .8      .1              -.5
4       .9      .8      .1              -.5
4       .9      .8      .1              -.5
5	1	1	0		-.1

Therefore, D = .5.  This is equivalent to the result that is reported by
-ksmirnov-.  However, Ben's data was intended to be compared to a discrete
distribution, so a test for discrete data would be more suitable.

Gibbons, J. D., and S. Chakraborti.  Nonparametric Statistical Inference.  4th
ed.  New York: Marcel Dekker, Inc.

--Kristin
kmacdonald@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```