Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point


From   "Lucia R.Latino" <Latino@economia.uniroma2.it>
To   <statalist@hsphsun2.harvard.edu>
Subject   R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point
Date   Thu, 19 Jul 2012 16:59:37 +0200

Nick and Maarten,

I see your point. For sure I have a problem with the tail of my
distribution. However, I don?t think that dropping what I consider outliers
is the problem given that the q-q plot shows the same pattern before and
after I drop observations. In the example Nick gave me,  I see the
difference. When I chop the data at 10,000 the q-q plot looks exactly like
mine, while I get a different picture if I use all the data. It does not
happen with my dataset. 

About the truncated distribution, would it still make sense if I can
reasonably think that the value over 10,000 are outliers rather than extreme
value of the distribution?

And here, I am back to David.
The data refers to daily per capita kilocalorie consumption. That is the
reason why I would find really hard to believe that I can have observations
higher than 10000 (which is already an high value). Furthermore, the
observations I am dropping are less than 1% of the sample (11280
observations) and there are no evidence of more than one mode.

It may be true that the distributions I am trying to fit are simply not
right for my data, but as David said: ?the Q-Q plot should show systematic
lack of fit in other parts of the data, not just at 10,000 and above?. This
is why I wrote to statalist. 

I read about the g-and-h distributions, but I have never used it. Could you
tell me if there is any package in Stata or do I need to write all the codes
for them?

Thank you for your kind support.
Lucia

-----Messaggio originale-----
Da: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] Per conto di David Hoaglin
Inviato: giovedì 19 luglio 2012 13:37
A: statalist@hsphsun2.harvard.edu
Oggetto: Re: st: q-q plots, theoretical distribution with values higher than
the sample's cutoff point

Dear Lucia,

I am having trouble fitting the pieces of information together.

If, in context, observations greater than 10,000 are likely to be outliers,
I would not expect a distribution that fits your data well below 9,000 to
have a heavier tail, with corresponding quantiles out to 20,000.  Why do you
consider observations greater than 10,000 to be outliers?  (The largest 4
observations are between 9,900 and 10,000.
Why are do they seem to be clumping there, almost as if 10,000 were an upper
bound?)

Perhaps your data do not follow a lognormal distribution or one of the other
theoretical distributions you mentioned, but then the Q-Q plot should show
systematic lack of fit in other parts of the data, not just at 10,000 and
above.

I don't recall seeing a description of what the data are.  Is it possible
that your sample is a mixture of some sort?  Is there evidence of more than
one mode?

Also, what is the sample size?  It is difficult to get much information on
shape of distribution without having several hundred observations.

One can often learn a lot about the distribution shape of the data by taking
an exploratory approach, based on quantiles.  I like to use the g-and-h
distributions, introduced by John W. Tukey, though that family is not as
well known as it deserves to be.  The lognormal distributions are a
subfamily of the g-and-h distributions, as are the normal distributions, and
the approach has a lot of flexibility for data that have heavy tails.

David Hoaglin

On Thu, Jul 19, 2012 at 3:39 AM, Lucia R.Latino
<Latino@economia.uniroma2.it> wrote:
> Dear Nick,
>
> I dropped all the observations greater than 10,000 because I 
> considered them outliers. However, even without dropping the 
> observations, q-q plots show the same pattern. Also the use of the 
> weights does not make so much difference, as you said.
>
> I know that the distribution is not lognormal (it is what I was trying 
> exactly to show),  my concern was about the plots. As I mentioned 
> before, the points are close enough to the 45 line degree  (in the 
> case of the GB2 and Singh-Maddala, the points on the q-q plot fall 
> exactly on the straight
> line) till approximately the value 9,000. After that, the points 
> depart significantly from the 45 line degree, they become a parallel 
> line to the x-axis; furthermore, while the sample distribution reaches 
> value 10,000, the theoretical one reaches approximately value 20,000.
>
> I think that this is a "weird" behavior of the plots or I am simply 
> missing something important about the q-q plots.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index