Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point |

Date |
Thu, 19 Jul 2012 09:03:51 +0100 |

You are unlikely to get a reasonable fit to any long-tailed distributions that are unbounded above by dropping outliers. At best you have changed the problem to fitting a truncated lognormal (whatever) distribution, a problem that you would need to program separately. But you can experiment yourself. Simulate a lognormal, chop high values and see whether -lognfit- and -qlogn- perform well. clear set seed 2803 set obs 1000 gen y = exp(rnormal(7, 1)) l y if y > 10000 su y lognfit y if y < 10000 qlogn y if y < 10000 Even if you only drop a very few values -lognfit- just can't do a good job here. Don't blame the plots or the programs; you threw away valuable information and are seeing the consequences. -lognfit- and -qlogn- are from SSC. Nick On Thu, Jul 19, 2012 at 8:39 AM, Lucia R.Latino <Latino@economia.uniroma2.it> wrote: > Dear Nick, > > I dropped all the observations greater than 10,000 because I considered them > outliers. However, even without dropping the observations, q-q plots show > the same pattern. Also the use of the weights does not make so much > difference, as you said. > > I know that the distribution is not lognormal (it is what I was trying > exactly to show), my concern was about the plots. As I mentioned before, > the points are close enough to the 45 line degree (in the case of the GB2 > and Singh-Maddala, the points on the q-q plot fall exactly on the straight > line) till approximately the value 9,000. After that, the points depart > significantly from the 45 line degree, they become a parallel line to the > x-axis; furthermore, while the sample distribution reaches value 10,000, the > theoretical one reaches approximately value 20,000. > > I think that this is a "weird" behavior of the plots or I am simply missing > something important about the q-q plots. Nick Cox > You have several values clumped up near 10,000. That alone does not seem > appropriate for any distribution that in principle is unbounded above. How > were these numbers calculated? > > In addition, some scrutiny of your quantiles and a few quick calculations > suggest that your distribution is a fair way from lognormal. It is not > skewed or long-tailed enough given its other parameters. I haven't tried the > other distributions named, but I suspect a similar story. (I can't tell how > much difference pweights make to this, but I guess not much.) On Wed, Jul 18, 2012 at 6:23 PM, Lucia R.Latino >> Thanks for your answer. Here you have the details of my variable. I >> hope it can be more useful to give me some feedback. >> >> -su dec_ae, d - >> -------------------------------------------------------------------- >> Percentiles Smallest >> |---------------------------------------| >> 1% 838.9864 11.1115 >> 5% 1402.251 38.77081 >> 10% 1733.309 112.4597 Obs 11183 >> 25% 2352.013 116.3163 Sum of Wgt. 11183 >> >> 50% 3209.503 Mean 3518.48 >> Largest Std. Dev. >> 1648.996 >> 75% 4355.16 9948.422 >> 90% 5793.742 9952.207 Variance 2719189 >> 95% 6790.232 9981.6 Skewness 1.017932 >> 99% 8768.935 9992.487 Kurtosis 4.138746 >> ------------------------------------------------------------------ Nick Cox >> These programs are in package -qpfit- on SSC. >> >> The word "problem" here is ambiguous. My bias is to guess that your >> data don't follow any of these distributions very well and the graphs >> are telling you that. -su dec_ae, detail- would tell us a bit more. >> >> Nick >> >> On Wed, Jul 18, 2012 at 5:13 PM, Lucia Latino >> <Latino@economia.uniroma2.it> >> wrote: >> >>> I am having some problems with the q-q plots for Dagum, gb2, >>> lognormal and Singh-Maddala distributions using programs written by Nick > Cox. >>> >>> After having fit the distribution (e.g. lognfit dec_ae, svy), I run >>> the command for the q-q plot (e.g. qlogn dec_ae [pweight=iwght]). >>> >>> I repeat the same procedure for the other distributions (Dagum, gb2 >>> and Singh-Maddala). All the plots show a strange behavior: in all the >>> qq-plots, the points follow a strongly nonlinear patters. At the >>> beginning they follow the 45 degree line, then they depart >>> significantly from the 45 degree line and become flat around the >>> value >> 10,000, which is the max value for dec_ae. >>> >>> What does it mean? Why the theoretical distribution takes value >>> higher than 10,000? >>> >>> I hope I was clear enough. I wish I could show you the plots, but I >>> understood I cannot attach them. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Lucia Latino <Latino@economia.uniroma2.it>

**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Nick Cox <njcoxstata@gmail.com>

**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Nick Cox <njcoxstata@gmail.com>

**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

- Prev by Date:
**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point** - Next by Date:
**Re: st: modifying egen to add a replace feature** - Previous by thread:
**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point** - Next by thread:
**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point** - Index(es):