Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point

 From Nick Cox <[email protected]> To [email protected] Subject Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point Date Thu, 19 Jul 2012 09:03:51 +0100

```You are unlikely to get a reasonable fit to any long-tailed
distributions that are unbounded above by dropping outliers.

At best you have changed the problem to fitting a truncated lognormal
(whatever) distribution, a problem that you would need to program
separately.

But you can experiment yourself. Simulate a lognormal, chop high
values and see whether -lognfit- and -qlogn- perform well.

clear
set seed 2803
set obs 1000
gen y = exp(rnormal(7, 1))
l y if y > 10000
su y
lognfit y if y < 10000
qlogn y if y < 10000

Even if you only drop a very few values -lognfit- just can't do a good
job here. Don't blame the plots or the programs; you threw away
valuable information and are seeing the consequences.

-lognfit- and -qlogn- are from SSC.

Nick

On Thu, Jul 19, 2012 at 8:39 AM, Lucia R.Latino
<[email protected]> wrote:
> Dear Nick,
>
> I dropped all the observations greater than 10,000 because I considered them
> outliers. However, even without dropping the observations, q-q plots show
> the same pattern. Also the use of the weights does not make so much
> difference, as you said.
>
> I know that the distribution is not lognormal (it is what I was trying
> exactly to show),  my concern was about the plots. As I mentioned before,
> the points are close enough to the 45 line degree  (in the case of the GB2
> and Singh-Maddala, the points on the q-q plot fall exactly on the straight
> line) till approximately the value 9,000. After that, the points depart
> significantly from the 45 line degree, they become a parallel line to the
> x-axis; furthermore, while the sample distribution reaches value 10,000, the
> theoretical one reaches approximately value 20,000.
>
> I think that this is a "weird" behavior of the plots or I am simply missing
> something important about the q-q plots.

Nick Cox

> You have several values clumped up near 10,000. That alone does not seem
> appropriate for any distribution that in principle is unbounded above. How
> were these numbers calculated?
>
> In addition, some scrutiny of your quantiles and a few quick calculations
> suggest that your distribution is a fair way from lognormal. It is not
> skewed or long-tailed enough given its other parameters. I haven't tried the
> other distributions named, but I suspect a similar story. (I can't tell how
> much difference pweights make to this, but I guess not much.)

On Wed, Jul 18, 2012 at 6:23 PM, Lucia R.Latino

>> Thanks for your answer. Here you have the details of my variable. I
>> hope it can be more useful to give me some feedback.
>>
>>  -su dec_ae, d -
>> --------------------------------------------------------------------
>>          Percentiles      Smallest
>> |---------------------------------------|
>>  1%     838.9864        11.1115
>>  5%     1402.251       38.77081
>> 10%     1733.309       112.4597       Obs                        11183
>> 25%     2352.013       116.3163       Sum of Wgt.       11183
>>
>> 50%     3209.503                                Mean            3518.48
>>                                      Largest           Std. Dev.
>> 1648.996
>> 75%      4355.16       9948.422
>> 90%     5793.742       9952.207       Variance        2719189
>> 95%     6790.232         9981.6       Skewness       1.017932
>> 99%     8768.935       9992.487       Kurtosis       4.138746
>> ------------------------------------------------------------------

Nick Cox

>> These programs are in package -qpfit- on SSC.
>>
>> The word "problem" here is ambiguous. My bias is to guess that your
>> data don't follow any of these distributions very well and the graphs
>> are telling you that. -su dec_ae, detail- would tell us a bit more.
>>
>> Nick
>>
>> On Wed, Jul 18, 2012 at 5:13 PM, Lucia Latino
>> <[email protected]>
>> wrote:
>>
>>> I am having some problems with the q-q plots for Dagum, gb2,
>>> lognormal and Singh-Maddala distributions using programs written by Nick
> Cox.
>>>
>>> After having fit the distribution (e.g. lognfit dec_ae, svy), I run
>>> the command for the q-q plot (e.g. qlogn dec_ae [pweight=iwght]).
>>>
>>> I repeat the same procedure for the other distributions (Dagum, gb2
>>> and Singh-Maddala). All the plots show a strange behavior: in all the
>>> qq-plots, the points follow a strongly nonlinear patters. At the
>>> beginning they follow the 45 degree line, then they depart
>>> significantly from the 45 degree line and become flat around the
>>> value
>> 10,000, which is the max value for dec_ae.
>>>
>>> What does it mean? Why the theoretical distribution takes value
>>> higher than 10,000?
>>>
>>> I hope I was clear enough. I wish I could show you the plots, but I
>>> understood I cannot attach  them.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```

• References: