 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Goodness of fit tests for continuous data using Stata

 From Maarten buis To statalist@hsphsun2.harvard.edu Subject Re: st: Goodness of fit tests for continuous data using Stata Date Tue, 5 Oct 2010 07:51:47 +0100 (BST)

```--- On Mon, 4/10/10, Earley, Joseph wrote:
> Does Stata have a module which allows for testing whether
> or not a variable follows distributions such as the
> uniform, exponential, weibull etc.
>
> In particular,  I would like to test whether or not a
> variable follows a uniform probability distribution using
> Stata.

Tests exist, but they are not very powerful. So you are
unlikely to detect deviations from your theoretical
distribution when you should. This is a limitation of
statistics, not of Stata.

The prefered method is not to test but to graph. Two
graphs can be particularly useful here: Firtst, the
hanging rootogram as implemented in -hangroot- as it
allows you to include confidence intervals. That way
you can still have something resembling a test. Second,
the quantile plot as implemented in -quantile-. This
gives you a very direct direct view on the data. This
can for example be useful for spotting ties, which
are often the reason for deviation from a uniform
distribution.

-hangroot- is a user writen program, and can be downloaded
by typing in Stata -ssc instal hangroot-. -quantile- is
part of official Stata. I like to use the -aspect(1)-
option for -quantile- as the logic of this graph is that
the observations should lie on the 45 degree line. By
forcing the aspect ratio of the graph to be 1, the 45
degree line is really a 45 degree line. Leaving this option
out is not wrong, but I think adding it leads to a visually
clearer picture.

As I said before, we can do a test, but this test is not
very powerful. In Stata we use the -ksmirnov- command for
that. For that we need the cumulative distribution function
(CDF) of our theoretical distribution. The CDF of a
uniformly distributed variable is (x - a)/(b - a) if it
ranges between a and b. In the example below we test
percentile rank scores of the variable, so a = 0 and b = 1,
and the CDF of x is x.

*------------------- begin example -------------------
sysuse auto, clear

// create percentile rank score of mpg
// this should be uniformly distributed
// unless there are too severe ties
egen n = count(mpg)
egen i = rank(mpg)
gen hazen = (i - 0.5) / n
drop n i

// a suspended rootogram with confidence intervals
hangroot hazen , dist(uniform) susp notheor ci ///
name(hangroot, replace)

// a quantile plot, no confidence interval but
// better for spotting ties
quantile hazen , aspect(1) name(quantile, replace)

// test, but not very powerful
ksmirnov hazen = hazen
*------------------ end example -----------------------
(For more on examples I sent to the Statalist see:
http://www.maartenbuis.nl/example_faq )

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```