Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point


From   Nick Cox <n.j.cox@durham.ac.uk>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point
Date   Fri, 20 Jul 2012 15:29:57 +0100

Fair question for me at the end. I mean that g- and h- distributions are despite their flexibility rather awkward or elusive customers. It may be just psychology or convenience, but I like distributions with relatively simple closed-form definitions of density, distribution and quantile functions so that I can write a few lines of code to fit them by maximum likelihood, etc. Correct me if I am wrong, but g- and h- don't score well under that heading. As David implies, the practical problem is usually fitting a distribution given predictors, and fitting easily into the ML framework is to me highly desirable. 

Nick 
n.j.cox@durham.ac.uk 

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of David Hoaglin

It is interesting that your data do not show evidence of more than one mode.  You did not mention the unit of observation for the 11,280 values of daily per capita kilocalorie consumption, but "per capita"
suggests that it is some sort of geopolitical entity. If your sample is actually heterogeneous, a mixture, I would not expect a single theoretical distribution to give a satisfactory fit.  In that situation, stratifying on relevant characteristics (or combinations of
characteristics) of the observational units may produce subsamples that are reasonably homogeneous.  If you can do that, and you are lucky, a single family of distributions may work within stratum.  Then you can ask how the parameters of the fitted distributions vary among the strata.  (This approach is similar to fitting predictors. I think you said earlier that you are not doing that, but you did not explain why.  With so much data, you may have interpretable patterns.)

On the g-and-h distributions, I should develop some code to fit them, but I have been using Stata for less than a year, and I haven't had time to do that.  Once you have the appropriate quantiles of the data, the calculations are straightforward.

A basic reference on the g-and-h distributions is Chapter 11 in the book Exploring Data Tables, Trends, and Shapes (Hoaglin, Mosteller, and Tukey, eds.), Wiley, 1985.  A search on "g-and-h distributions"
may turn up a number of papers, but I advise caution: some of them have misleading statements.  I use g-and-h distributions for exploring and approximating shapes of distributions, but others have used them in substantial applications.  I leave it to Nick Cox to explain what he meant by "not without great cost."

David Hoaglin

On Thu, Jul 19, 2012 at 10:59 AM, Lucia R.Latino <Latino@economia.uniroma2.it> wrote:
> Nick and Maarten,
>
> I see your point. For sure I have a problem with the tail of my 
> distribution. However, I don't think that dropping what I consider 
> outliers is the problem given that the q-q plot shows the same pattern 
> before and after I drop observations. In the example Nick gave me,  I 
> see the difference. When I chop the data at 10,000 the q-q plot looks 
> exactly like mine, while I get a different picture if I use all the 
> data. It does not happen with my dataset.
>
> About the truncated distribution, would it still make sense if I can 
> reasonably think that the value over 10,000 are outliers rather than 
> extreme value of the distribution?
>
> And here, I am back to David.
> The data refers to daily per capita kilocalorie consumption. That is 
> the reason why I would find really hard to believe that I can have 
> observations higher than 10000 (which is already an high value). 
> Furthermore, the observations I am dropping are less than 1% of the 
> sample (11280
> observations) and there are no evidence of more than one mode.
>
> It may be true that the distributions I am trying to fit are simply 
> not right for my data, but as David said: "the Q-Q plot should show 
> systematic lack of fit in other parts of the data, not just at 10,000 
> and above". This is why I wrote to statalist.
>
> I read about the g-and-h distributions, but I have never used it. 
> Could you tell me if there is any package in Stata or do I need to 
> write all the codes for them?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index