Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: Histograms


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE: Histograms
Date   Mon, 2 Jun 2003 19:22:18 +0100

An initial question on histograms led to a (to me, surprisingly)
vigorous thread. Here is a personal, partial, summary, with some 
new material. 

The principle behind histograms is that the area of each 
bar represents the fraction of a frequency (probability) 
distribution within each interval (bin, class). This is 
standard. It is not part of the definition that all 
intervals have the same length. Yet in practice most 
histograms produced or published do have equal length 
bins. Official Stata users in particular have only been offered 
options to tune number of bins (Stata <= 7) and/or (constant) 
width of bins (Stata 8). In short, official Stata does not, 
apparently, allow bins representing unequal lengths. 

Why? Various arguments for and against this may be identified. 

1. Consistency argument. The choice of bin width 
is often a little arbitrary. In an important special 
case, the variable is discrete, in which case 
1 is often the obvious and natural choice. Even then 
discrete variables may require some choice of interval. 
If the variable is number of lifetime sexual partners, 
then the tail (apparently) stretches into very large 
numbers and some grouping may be desired. But in the 
case of continuous variables especially, there is 
certainly arbitrariness. Many statistically-minded 
people are most reluctant to compound this by varying 
the length of the intervals. To do this complicates 
the interpretation of the histogram, it may be said, 
because of variations in the way the bars were produced. 
Or to put it another way, equal widths are relatively 
simple and any kind of complexity beyond them needs 
to be justified. 

2. Structure of the data argument. On the other hand, 
sometimes the data come grouped into irregular intervals 
and the researcher has little or no choice. The raw data
may be difficult or impossible to access. The Stata 
user then wants a histogram (correctly drawn, naturally). 
How can this be done?  

3. Sampling variation argument. If we regard the 
histogram as a crude estimator of the density function, 
then it might make sense to vary bin width to 
match the structure of variation, in effect varying 
how we average probability density locally. (Of 
course, another answer may be that you should try 
another kind of graph or a transformation.) 

4. Equal probability argument. There is at least one 
other way to build a histogram in a simple, systematic 
way: use quantiles equally spaced on a probability 
scale. That way, each bar represents the same area. 
Unless our data come from a uniform distribution, 
the bin lengths will inevitably be unequal. 

Where do we stand in terms of what we can do in 
Stata? Working backwards, 

4. Thanks to Kit Baum -- and to Vince Wiggins 
and to Marcello Pagano for comments of various 
kinds -- a -eqprhistogram- for Stata 8 is now 
downloadable from SSC. Please junk any and 
all versions you may have copied from Statalist, 
and 

. ssc inst eqprhistogram 

This is not a full-blown program offering 
all the handles which might be desired, but 
more a demonstration that the thing is possible. 

In last discussing this, I alluded to a quirk in 
the implementation of the undocumented option 
-bartype(spanning)-. This quirk turned out to be
a figment of my imagination. Vince Wiggins put me 
right on what I had overlooked. 

3 and 2. If you can work out your class limits you 
can draw a histogram in Stata 6 or Stata 7 
using -barplot- or -hist3- from SSC. -hist3- 
is more general in that it will count for you. 

In lieu of a port to Stata 8, the following 
shows what can be done once someone has told 
you the right undocumented feature. 

The data come from Snedecor and Cochran 1989 p.19
(reference in manual) and are frequencies of US cities 
with particular populations in 000 in 1970. We enter the _lower_ 
class limits and the frequencies and _one_ final upper 
limit as data, or -- in other cases -- somehow get a reduction of 
the data to this form. 

. list 

     +---------------------+
     | popula~n   freque~y |
     |---------------------|
  1. |      100         38 |
  2. |      125         27 |
  3. |      150         15 |
  4. |      175         11 |
  5. |      200         16 |
     |---------------------|
  6. |      300         16 |
  7. |      400          7 |
  8. |      500          8 |
  9. |      600         10 |
 10. |      800          2 |
     |---------------------|
 11. |     1000          . |
     +---------------------+

There are 150 cities, so we calculate the densities 

. gen density = freq / (150 * (population[_n+1] - population)) 
(1 missing value generated)

and we can then draw the graph directly: 

. twoway bar density population, bartype(spanning) 

In practice you might want to add (e.g.) 

bstyle(histogram) 

and you might need to add 

yscale(range(0)) 

-- the last was the detail I overlooked in my last 
posting on this. 

Nick 
n.j.cox@durham.ac.uk 


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index