[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: RE: Histograms |

Date |
Mon, 2 Jun 2003 19:22:18 +0100 |

An initial question on histograms led to a (to me, surprisingly) vigorous thread. Here is a personal, partial, summary, with some new material. The principle behind histograms is that the area of each bar represents the fraction of a frequency (probability) distribution within each interval (bin, class). This is standard. It is not part of the definition that all intervals have the same length. Yet in practice most histograms produced or published do have equal length bins. Official Stata users in particular have only been offered options to tune number of bins (Stata <= 7) and/or (constant) width of bins (Stata 8). In short, official Stata does not, apparently, allow bins representing unequal lengths. Why? Various arguments for and against this may be identified. 1. Consistency argument. The choice of bin width is often a little arbitrary. In an important special case, the variable is discrete, in which case 1 is often the obvious and natural choice. Even then discrete variables may require some choice of interval. If the variable is number of lifetime sexual partners, then the tail (apparently) stretches into very large numbers and some grouping may be desired. But in the case of continuous variables especially, there is certainly arbitrariness. Many statistically-minded people are most reluctant to compound this by varying the length of the intervals. To do this complicates the interpretation of the histogram, it may be said, because of variations in the way the bars were produced. Or to put it another way, equal widths are relatively simple and any kind of complexity beyond them needs to be justified. 2. Structure of the data argument. On the other hand, sometimes the data come grouped into irregular intervals and the researcher has little or no choice. The raw data may be difficult or impossible to access. The Stata user then wants a histogram (correctly drawn, naturally). How can this be done? 3. Sampling variation argument. If we regard the histogram as a crude estimator of the density function, then it might make sense to vary bin width to match the structure of variation, in effect varying how we average probability density locally. (Of course, another answer may be that you should try another kind of graph or a transformation.) 4. Equal probability argument. There is at least one other way to build a histogram in a simple, systematic way: use quantiles equally spaced on a probability scale. That way, each bar represents the same area. Unless our data come from a uniform distribution, the bin lengths will inevitably be unequal. Where do we stand in terms of what we can do in Stata? Working backwards, 4. Thanks to Kit Baum -- and to Vince Wiggins and to Marcello Pagano for comments of various kinds -- a -eqprhistogram- for Stata 8 is now downloadable from SSC. Please junk any and all versions you may have copied from Statalist, and . ssc inst eqprhistogram This is not a full-blown program offering all the handles which might be desired, but more a demonstration that the thing is possible. In last discussing this, I alluded to a quirk in the implementation of the undocumented option -bartype(spanning)-. This quirk turned out to be a figment of my imagination. Vince Wiggins put me right on what I had overlooked. 3 and 2. If you can work out your class limits you can draw a histogram in Stata 6 or Stata 7 using -barplot- or -hist3- from SSC. -hist3- is more general in that it will count for you. In lieu of a port to Stata 8, the following shows what can be done once someone has told you the right undocumented feature. The data come from Snedecor and Cochran 1989 p.19 (reference in manual) and are frequencies of US cities with particular populations in 000 in 1970. We enter the _lower_ class limits and the frequencies and _one_ final upper limit as data, or -- in other cases -- somehow get a reduction of the data to this form. . list +---------------------+ | popula~n freque~y | |---------------------| 1. | 100 38 | 2. | 125 27 | 3. | 150 15 | 4. | 175 11 | 5. | 200 16 | |---------------------| 6. | 300 16 | 7. | 400 7 | 8. | 500 8 | 9. | 600 10 | 10. | 800 2 | |---------------------| 11. | 1000 . | +---------------------+ There are 150 cities, so we calculate the densities . gen density = freq / (150 * (population[_n+1] - population)) (1 missing value generated) and we can then draw the graph directly: . twoway bar density population, bartype(spanning) In practice you might want to add (e.g.) bstyle(histogram) and you might need to add yscale(range(0)) -- the last was the detail I overlooked in my last posting on this. Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: interpret results from conditional logit model** - Next by Date:
**st: Odd egen=(max) problem** - Previous by thread:
**st: interpret results from conditional logit model** - Next by thread:
**st: Odd egen=(max) problem** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |