Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point

 From David Hoaglin To statalist@hsphsun2.harvard.edu Subject Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point Date Fri, 20 Jul 2012 09:43:21 -0400

```Lucia,

It is interesting that your data do not show evidence of more than one
mode.  You did not mention the unit of observation for the 11,280
values of daily per capita kilocalorie consumption, but "per capita"
suggests that it is some sort of geopolitical entity. If your sample
is actually heterogeneous, a mixture, I would not expect a single
theoretical distribution to give a satisfactory fit.  In that
situation, stratifying on relevant characteristics (or combinations of
characteristics) of the observational units may produce subsamples
that are reasonably homogeneous.  If you can do that, and you are
lucky, a single family of distributions may work within stratum.  Then
you can ask how the parameters of the fitted distributions vary among
the strata.  (This approach is similar to fitting predictors. I think
you said earlier that you are not doing that, but you did not explain
why.  With so much data, you may have interpretable patterns.)

On the g-and-h distributions, I should develop some code to fit them,
but I have been using Stata for less than a year, and I haven't had
time to do that.  Once you have the appropriate quantiles of the data,
the calculations are straightforward.

A basic reference on the g-and-h distributions is Chapter 11 in the
book Exploring Data Tables, Trends, and Shapes (Hoaglin, Mosteller,
and Tukey, eds.), Wiley, 1985.  A search on "g-and-h distributions"
may turn up a number of papers, but I advise caution: some of them
have misleading statements.  I use g-and-h distributions for exploring
and approximating shapes of distributions, but others have used them
in substantial applications.  I leave it to Nick Cox to explain what
he meant by "not without great cost."

David Hoaglin

On Thu, Jul 19, 2012 at 10:59 AM, Lucia R.Latino
<Latino@economia.uniroma2.it> wrote:
> Nick and Maarten,
>
> I see your point. For sure I have a problem with the tail of my
> distribution. However, I don’t think that dropping what I consider outliers
> is the problem given that the q-q plot shows the same pattern before and
> after I drop observations. In the example Nick gave me,  I see the
> difference. When I chop the data at 10,000 the q-q plot looks exactly like
> mine, while I get a different picture if I use all the data. It does not
> happen with my dataset.
>
> About the truncated distribution, would it still make sense if I can
> reasonably think that the value over 10,000 are outliers rather than extreme
> value of the distribution?
>
> And here, I am back to David.
> The data refers to daily per capita kilocalorie consumption. That is the
> reason why I would find really hard to believe that I can have observations
> higher than 10000 (which is already an high value). Furthermore, the
> observations I am dropping are less than 1% of the sample (11280
> observations) and there are no evidence of more than one mode.
>
> It may be true that the distributions I am trying to fit are simply not
> right for my data, but as David said: “the Q-Q plot should show systematic
> lack of fit in other parts of the data, not just at 10,000 and above”. This
> is why I wrote to statalist.
>
> I read about the g-and-h distributions, but I have never used it. Could you
> tell me if there is any package in Stata or do I need to write all the codes
> for them?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```