Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point

From   "Lucia R.Latino" <[email protected]>
To   <[email protected]>
Subject   R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point
Date   Sat, 21 Jul 2012 11:47:37 +0200


thanks a lot for your suggestions. I would definitely try to stratify the
sample on relevant characteristics and, hopefully, I will be able to fit a
single family of distributions.

About the g- and-h distribution, from what I understood they do not have a
well-defined probability density function and this fact contrasts with my
original intention. However, they may be a valuable alternative, but I need
to study them first. 

Thank you,

-----Messaggio originale-----
Da: [email protected]
[mailto:[email protected]] Per conto di David Hoaglin
Inviato: venerdì 20 luglio 2012 15:43
A: [email protected]
Oggetto: Re: st: q-q plots, theoretical distribution with values higher than
the sample's cutoff point


It is interesting that your data do not show evidence of more than one mode.
You did not mention the unit of observation for the 11,280 values of daily
per capita kilocalorie consumption, but "per capita"
suggests that it is some sort of geopolitical entity. If your sample is
actually heterogeneous, a mixture, I would not expect a single theoretical
distribution to give a satisfactory fit.  In that situation, stratifying on
relevant characteristics (or combinations of
characteristics) of the observational units may produce subsamples that are
reasonably homogeneous.  If you can do that, and you are lucky, a single
family of distributions may work within stratum.  Then you can ask how the
parameters of the fitted distributions vary among the strata.  (This
approach is similar to fitting predictors. I think you said earlier that you
are not doing that, but you did not explain why.  With so much data, you may
have interpretable patterns.)

On the g-and-h distributions, I should develop some code to fit them, but I
have been using Stata for less than a year, and I haven't had time to do
that.  Once you have the appropriate quantiles of the data, the calculations
are straightforward.

A basic reference on the g-and-h distributions is Chapter 11 in the book
Exploring Data Tables, Trends, and Shapes (Hoaglin, Mosteller, and Tukey,
eds.), Wiley, 1985.  A search on "g-and-h distributions"
may turn up a number of papers, but I advise caution: some of them have
misleading statements.  I use g-and-h distributions for exploring and
approximating shapes of distributions, but others have used them in
substantial applications.  I leave it to Nick Cox to explain what he meant
by "not without great cost."

David Hoaglin

On Thu, Jul 19, 2012 at 10:59 AM, Lucia R.Latino
<[email protected]> wrote:
> Nick and Maarten,
> I see your point. For sure I have a problem with the tail of my 
> distribution. However, I don?t think that dropping what I consider 
> outliers is the problem given that the q-q plot shows the same pattern 
> before and after I drop observations. In the example Nick gave me,  I 
> see the difference. When I chop the data at 10,000 the q-q plot looks 
> exactly like mine, while I get a different picture if I use all the 
> data. It does not happen with my dataset.
> About the truncated distribution, would it still make sense if I can 
> reasonably think that the value over 10,000 are outliers rather than 
> extreme value of the distribution?
> And here, I am back to David.
> The data refers to daily per capita kilocalorie consumption. That is 
> the reason why I would find really hard to believe that I can have 
> observations higher than 10000 (which is already an high value). 
> Furthermore, the observations I am dropping are less than 1% of the 
> sample (11280
> observations) and there are no evidence of more than one mode.
> It may be true that the distributions I am trying to fit are simply 
> not right for my data, but as David said: ?the Q-Q plot should show 
> systematic lack of fit in other parts of the data, not just at 10,000 
> and above?. This is why I wrote to statalist.
> I read about the g-and-h distributions, but I have never used it. 
> Could you tell me if there is any package in Stata or do I need to 
> write all the codes for them?

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index