Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Maarten Buis <maartenlbuis@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: distribution test |

Date |
Tue, 30 Aug 2011 10:43:37 +0200 |

-- Lodewijk Smets wrote me privately: > In Stata, I'd like to test the fit of my data with an exponential > distribution. I'm thinking of using a Kolgomorov-Smirnov test > (are there better alternatives?). Yet, the ks-test requires me > to define lambda. I've noticed that you're the author of the > -hangroot- command, where parameters are estimated (in > order to compare an empirical distribution with a theoretical > one). So I was wondering if there's a way to retrieve that > estimation of the parameter, i.e. is it stored somewhere? -hangroot- is for the rather special situation that you want to compare the distribution of one variable with one univariate distribution, i.e. there are no explanatory/x/right hand side/independent variables. I am working on a generalization, but it is not yet finished. There is however a deadline, as I will be presenting it at the 2011 Nordic and Baltic Stata Users Group meeting on Friday, November 11, 2011 (<http://www.stata.com/meeting/sweden11/>). Moreover, the exponential distribution often occurs with survival data, and -hangroot- is not (and will not be) designed for survival data, in particular it will not handle right censoring. Having stated those limitations, the estimate of lambda in the univariate non-survival case is pretty easy as the maximum likelihood estimate has a closed form solution: 1/mean. You could use -hangroot- to recover this estimate: it is returned in r(lambda), but that is a bit overkill. It is probably easier to use -sum- to compute the mean and and transform that to the estimate of lambda. Below I have added an example: *------------- begin example --------------- // create some exponential data local lambda = 2 drop _all set obs 500 gen y = -1/`lambda'*ln(1-runiform()) // estimate parameter sum y, meanonly local lambdahat = 1/r(mean) di as txt "ML estimate of lambda is: " /// as result `lambdahat' // ksmirnov test ksmirnov y = 1-exp(-`lambdahat'*y) // hanging rootogram hangroot y, dist(exponential) ci return list *--------------- end example --------------- (For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq ) If you have explanatory variables, the problem than is that there is no longer one lambda but each observation has its own lambda. So the marginal distribution of your explained/y/left hand side/dependent variable no longer follows an exponential distribution but a mixture of exponential distributions with different lambdas. To the best of my knowledge no test has been implemented that will test the marginal distribution of your explained variable against this mixture distribution. As I said above I am working on a graphical comparison of these two distributions, and I might add such a test for some models. However, if I do so I will probably include a warning in the helpfile not to use it for the following three reasons: 1) The preferred outcome is that we cannot find a significant deviation from the theoretical distribution. However, such non-significance only indicates "absence of evidence", which should not be confused with "evidence of absence", especially since such test of distributions tend to have little power, i.e. they are not very likely to detect deviations when they should. 2) Even if we find significant deviations from the theoretical distribution, that does not tell is what those deviations are and what to do about them. 3) We are testing whether a model is true, but a good model is a simplification of reality, i.e. the model is not supposed to be true. A good model involves an informed tradeoff between how well the model simplifies reality and how large the deviations are between model and reality (approximated by the observations). To make this more difficult, not all deviations are equally relevant, and which ones are most relevant depends on the purpose of the model. So no generic/automatic/computerized method can exist to do this tradeoff for us, which is good, as that means that our jobs won't be replaced by computes any time soon. This also means that the tradeoff implicit in a statistical test is typically not the right tradeoff for determining whether a model is appropriate or not. Hope this helps, Maarten -------------------------- Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: distribution test***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: distribution test***From:*Maarten Buis <maartenlbuis@gmail.com>

- Prev by Date:
**Re: st: large coefficients in logistic regression** - Next by Date:
**Re: st: distribution test** - Previous by thread:
**st: nearmrg for strings (titles)** - Next by thread:
**Re: st: distribution test** - Index(es):