Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
David Hoaglin <dchoaglin@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point |

Date |
Fri, 20 Jul 2012 09:43:21 -0400 |

Lucia, It is interesting that your data do not show evidence of more than one mode. You did not mention the unit of observation for the 11,280 values of daily per capita kilocalorie consumption, but "per capita" suggests that it is some sort of geopolitical entity. If your sample is actually heterogeneous, a mixture, I would not expect a single theoretical distribution to give a satisfactory fit. In that situation, stratifying on relevant characteristics (or combinations of characteristics) of the observational units may produce subsamples that are reasonably homogeneous. If you can do that, and you are lucky, a single family of distributions may work within stratum. Then you can ask how the parameters of the fitted distributions vary among the strata. (This approach is similar to fitting predictors. I think you said earlier that you are not doing that, but you did not explain why. With so much data, you may have interpretable patterns.) On the g-and-h distributions, I should develop some code to fit them, but I have been using Stata for less than a year, and I haven't had time to do that. Once you have the appropriate quantiles of the data, the calculations are straightforward. A basic reference on the g-and-h distributions is Chapter 11 in the book Exploring Data Tables, Trends, and Shapes (Hoaglin, Mosteller, and Tukey, eds.), Wiley, 1985. A search on "g-and-h distributions" may turn up a number of papers, but I advise caution: some of them have misleading statements. I use g-and-h distributions for exploring and approximating shapes of distributions, but others have used them in substantial applications. I leave it to Nick Cox to explain what he meant by "not without great cost." David Hoaglin On Thu, Jul 19, 2012 at 10:59 AM, Lucia R.Latino <Latino@economia.uniroma2.it> wrote: > Nick and Maarten, > > I see your point. For sure I have a problem with the tail of my > distribution. However, I don’t think that dropping what I consider outliers > is the problem given that the q-q plot shows the same pattern before and > after I drop observations. In the example Nick gave me, I see the > difference. When I chop the data at 10,000 the q-q plot looks exactly like > mine, while I get a different picture if I use all the data. It does not > happen with my dataset. > > About the truncated distribution, would it still make sense if I can > reasonably think that the value over 10,000 are outliers rather than extreme > value of the distribution? > > And here, I am back to David. > The data refers to daily per capita kilocalorie consumption. That is the > reason why I would find really hard to believe that I can have observations > higher than 10000 (which is already an high value). Furthermore, the > observations I am dropping are less than 1% of the sample (11280 > observations) and there are no evidence of more than one mode. > > It may be true that the distributions I am trying to fit are simply not > right for my data, but as David said: “the Q-Q plot should show systematic > lack of fit in other parts of the data, not just at 10,000 and above”. This > is why I wrote to statalist. > > I read about the g-and-h distributions, but I have never used it. Could you > tell me if there is any package in Stata or do I need to write all the codes > for them? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

**RE: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Nick Cox <n.j.cox@durham.ac.uk>

**References**:**st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Lucia Latino <Latino@economia.uniroma2.it>

**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Nick Cox <njcoxstata@gmail.com>

**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*Nick Cox <njcoxstata@gmail.com>

**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

**Re: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*David Hoaglin <dchoaglin@gmail.com>

**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point***From:*"Lucia R.Latino" <Latino@economia.uniroma2.it>

- Prev by Date:
**st: splitting the dataset into percentiles by groups** - Next by Date:
**st: Stata commands : multiple commands in one text line is allowed** - Previous by thread:
**R: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point** - Next by thread:
**RE: st: q-q plots, theoretical distribution with values higher than the sample's cutoff point** - Index(es):