Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: distribution test


From   Maarten Buis <maartenlbuis@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: distribution test
Date   Tue, 30 Aug 2011 10:43:37 +0200

-- Lodewijk Smets wrote me privately:
> In Stata, I'd like to test the fit of my data with an exponential
> distribution. I'm thinking of using a Kolgomorov-Smirnov test
> (are there better alternatives?). Yet, the ks-test requires me
> to define lambda. I've noticed that you're the author of the
> -hangroot- command, where parameters are estimated (in
> order to compare an empirical distribution with a theoretical
> one). So I was wondering if there's a way to retrieve that
> estimation of the parameter, i.e. is it stored somewhere?

-hangroot- is for the rather special situation that you want to
compare the distribution of one variable with one univariate
distribution, i.e. there are no explanatory/x/right hand
side/independent variables. I am working on a generalization, but it
is not yet finished. There is however a deadline, as I will be
presenting it at the 2011 Nordic and Baltic Stata Users Group meeting
on Friday, November 11, 2011
(<http://www.stata.com/meeting/sweden11/>).  Moreover, the exponential
distribution often occurs with survival data, and -hangroot- is not
(and will not be) designed for survival data, in particular it will
not handle right censoring.

Having stated those limitations, the estimate of lambda in the
univariate non-survival case is pretty easy as the maximum likelihood
estimate has a closed form solution: 1/mean. You could use -hangroot-
to recover this estimate: it is returned in r(lambda), but that is a
bit overkill. It  is probably easier to use -sum- to compute the mean
and and transform that to the estimate of lambda. Below I have added
an example:

*------------- begin example ---------------
// create some exponential data
local lambda = 2
drop _all
set obs 500
gen y = -1/`lambda'*ln(1-runiform())

// estimate parameter
sum y, meanonly
local lambdahat = 1/r(mean)
di as txt "ML estimate of lambda is: " ///
   as result `lambdahat'

// ksmirnov test
ksmirnov y = 1-exp(-`lambdahat'*y)

// hanging rootogram
hangroot y, dist(exponential) ci
return list
*--------------- end example ---------------
(For more on examples I sent to the Statalist see:
http://www.maartenbuis.nl/example_faq )

If you have explanatory variables, the problem than is that there is
no longer one lambda but each observation has its own lambda. So the
marginal distribution of your explained/y/left hand side/dependent
variable no longer follows an exponential distribution but a mixture
of exponential distributions with different lambdas. To the best of my
knowledge no test has been implemented that will test the marginal
distribution of your explained variable against this mixture
distribution. As I said above I am working on a graphical comparison
of these two distributions, and I might add such a test for some
models. However, if I do so I will probably include a warning in the
helpfile not to use it for the following three reasons: 1) The
preferred outcome is that we cannot find a significant deviation from
the theoretical distribution. However, such non-significance only
indicates "absence of evidence", which should not be confused with
"evidence of absence", especially since such test of distributions
tend to have little power, i.e. they are not very likely to detect
deviations when they should. 2) Even if we find significant deviations
from the theoretical distribution, that does not tell is what those
deviations are and what to do about them. 3) We are testing whether a
model is true, but a good model is a simplification of reality, i.e.
the model is not supposed to be true. A good model involves an
informed tradeoff between how well the model simplifies reality and
how large the deviations are between model and reality (approximated
by the observations). To make this more difficult, not all deviations
are equally relevant, and which ones are most relevant depends on the
purpose of the model. So no generic/automatic/computerized method can
exist to do this tradeoff for us, which is good, as that means that
our jobs won't be replaced by computes any time soon. This also means
that the tradeoff implicit in a statistical test is typically not the
right tradeoff for determining whether a model is appropriate or not.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany


http://www.maartenbuis.nl
--------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index