Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point


From   David Hoaglin <dchoaglin@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point
Date   Thu, 19 Jul 2012 07:37:06 -0400

Dear Lucia,

I am having trouble fitting the pieces of information together.

If, in context, observations greater than 10,000 are likely to be
outliers, I would not expect a distribution that fits your data well
below 9,000 to have a heavier tail, with corresponding quantiles out
to 20,000.  Why do you consider observations greater than 10,000 to be
outliers?  (The largest 4 observations are between 9,900 and 10,000.
Why are do they seem to be clumping there, almost as if 10,000 were an
upper bound?)

Perhaps your data do not follow a lognormal distribution or one of the
other theoretical distributions you mentioned, but then the Q-Q plot
should show systematic lack of fit in other parts of the data, not
just at 10,000 and above.

I don't recall seeing a description of what the data are.  Is it
possible that your sample is a mixture of some sort?  Is there
evidence of more than one mode?

Also, what is the sample size?  It is difficult to get much
information on shape of distribution without having several hundred
observations.

One can often learn a lot about the distribution shape of the data by
taking an exploratory approach, based on quantiles.  I like to use the
g-and-h distributions, introduced by John W. Tukey, though that family
is not as well known as it deserves to be.  The lognormal
distributions are a subfamily of the g-and-h distributions, as are the
normal distributions, and the approach has a lot of flexibility for
data that have heavy tails.

David Hoaglin

On Thu, Jul 19, 2012 at 3:39 AM, Lucia R.Latino
<Latino@economia.uniroma2.it> wrote:
> Dear Nick,
>
> I dropped all the observations greater than 10,000 because I considered them
> outliers. However, even without dropping the observations, q-q plots show
> the same pattern. Also the use of the weights does not make so much
> difference, as you said.
>
> I know that the distribution is not lognormal (it is what I was trying
> exactly to show),  my concern was about the plots. As I mentioned before,
> the points are close enough to the 45 line degree  (in the case of the GB2
> and Singh-Maddala, the points on the q-q plot fall exactly on the straight
> line) till approximately the value 9,000. After that, the points depart
> significantly from the 45 line degree, they become a parallel line to the
> x-axis; furthermore, while the sample distribution reaches value 10,000, the
> theoretical one reaches approximately value 20,000.
>
> I think that this is a "weird" behavior of the plots or I am simply missing
> something important about the q-q plots.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index