Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point

 From "Lucia R.Latino" To Subject R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point Date Fri, 20 Jul 2012 11:51:35 +0200

```Nick,

thanks a lot for your valuable help!

Lucia

-----Messaggio originale-----
Da: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di Nick Cox
Inviato: giovedì 19 luglio 2012 20:09
A: statalist@hsphsun2.harvard.edu
Oggetto: Re: st: q-q plots, theoretical distribution with values higher than
the sample's cutoff point

We don't have your data. We can only make guesses based partly on
experiences elsewhere. These are mine.

1. The absence of a sound criterion for distinguishing erroneous values from
genuine extremes is typical of most applied work, so you are in good
company. The main exceptions are where something is physically impossible,
or outside well-documented records, or appears on scrutiny of source
material to be a mistake of some kind. My instinct, and it's only that, is
that you are better off leaving all outliers in. The side-effects of
truncation can only be analysed properly by writing programs that take the
truncation into account.
Also, choosing 10,000 as cut-off leaves yourself open to all sorts of
awkward questions on why 10,000, etc.

2. Although it is always welcome if a standard (even
textbook-mentioned) distribution is a good fit to a dataset, it is often
true that no such distribution works well. In your case one might guess at a
mixture (e.g. "normal" people, athletes in training, people with eating
disorders, people fasting for religious reasons, etc.). Your data show the 4
smallest values as 11.1115, 38.77081, 112.4597,  116.3163. Why are these
plausible but not 10,001?

3. I am not aware of any Stata software for g- and h- distributions.
My own view is that the great flexibility of g- and h- distributions is
bought at a great price. But I've never tried them and David certainly has,
so you should listen to him.

4. Taking logarithms and checking the normality of those values remains a
defensible way of assessing lognormality. The great advantage of -lognfit-
is that it supports predictors, but that is not what you are doing here.

Nick

On Thu, Jul 19, 2012 at 3:59 PM, Lucia R.Latino
<Latino@economia.uniroma2.it> wrote:
> Nick and Maarten,
>
> I see your point. For sure I have a problem with the tail of my
> distribution. However, I don?t think that dropping what I consider
> outliers is the problem given that the q-q plot shows the same pattern
> before and after I drop observations. In the example Nick gave me,  I
> see the difference. When I chop the data at 10,000 the q-q plot looks
> exactly like mine, while I get a different picture if I use all the
> data. It does not happen with my dataset.
>
> About the truncated distribution, would it still make sense if I can
> reasonably think that the value over 10,000 are outliers rather than
> extreme value of the distribution?
>
> And here, I am back to David.
> The data refers to daily per capita kilocalorie consumption. That is
> the reason why I would find really hard to believe that I can have
> observations higher than 10000 (which is already an high value).
> Furthermore, the observations I am dropping are less than 1% of the
> sample (11280
> observations) and there are no evidence of more than one mode.
>
> It may be true that the distributions I am trying to fit are simply
> not right for my data, but as David said: ?the Q-Q plot should show
> systematic lack of fit in other parts of the data, not just at 10,000
> and above?. This is why I wrote to statalist.
>
> I read about the g-and-h distributions, but I have never used it.
> Could you tell me if there is any package in Stata or do I need to
> write all the codes for them?
>
> Thank you for your kind support.
> Lucia
>
> -----Messaggio originale-----
> Da: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di David
> Hoaglin
> Inviato: giovedì 19 luglio 2012 13:37
> A: statalist@hsphsun2.harvard.edu
> Oggetto: Re: st: q-q plots, theoretical distribution with values
> higher than the sample's cutoff point
>
> Dear Lucia,
>
> I am having trouble fitting the pieces of information together.
>
> If, in context, observations greater than 10,000 are likely to be
> outliers, I would not expect a distribution that fits your data well
> below 9,000 to have a heavier tail, with corresponding quantiles out
> to 20,000.  Why do you consider observations greater than 10,000 to be
> outliers?  (The largest 4 observations are between 9,900 and 10,000.
> Why are do they seem to be clumping there, almost as if 10,000 were an
> upper
> bound?)
>
> Perhaps your data do not follow a lognormal distribution or one of the
> other theoretical distributions you mentioned, but then the Q-Q plot
> should show systematic lack of fit in other parts of the data, not
> just at 10,000 and above.
>
> I don't recall seeing a description of what the data are.  Is it
> possible that your sample is a mixture of some sort?  Is there
> evidence of more than one mode?
>
> Also, what is the sample size?  It is difficult to get much
> information on shape of distribution without having several hundred
observations.
>
> One can often learn a lot about the distribution shape of the data by
> taking an exploratory approach, based on quantiles.  I like to use the
> g-and-h distributions, introduced by John W. Tukey, though that family
> is not as well known as it deserves to be.  The lognormal
> distributions are a subfamily of the g-and-h distributions, as are the
> normal distributions, and the approach has a lot of flexibility for data
that have heavy tails.
>
> David Hoaglin
>
> On Thu, Jul 19, 2012 at 3:39 AM, Lucia R.Latino
> <Latino@economia.uniroma2.it> wrote:
>> Dear Nick,
>>
>> I dropped all the observations greater than 10,000 because I
>> considered them outliers. However, even without dropping the
>> observations, q-q plots show the same pattern. Also the use of the
>> weights does not make so much difference, as you said.
>>
>> I know that the distribution is not lognormal (it is what I was
>> trying exactly to show),  my concern was about the plots. As I
>> mentioned before, the points are close enough to the 45 line degree
>> (in the case of the GB2 and Singh-Maddala, the points on the q-q plot
>> fall exactly on the straight
>> line) till approximately the value 9,000. After that, the points
>> depart significantly from the 45 line degree, they become a parallel
>> line to the x-axis; furthermore, while the sample distribution
>> reaches value 10,000, the theoretical one reaches approximately value
20,000.
>>
>> I think that this is a "weird" behavior of the plots or I am simply
>> missing something important about the q-q plots.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```