# st: -pdplot- available from SSC for Pareto dot plots

 From "Nick Cox" To Subject st: -pdplot- available from SSC for Pareto dot plots Date Fri, 17 Nov 2006 11:16:14 -0000

Thanks to Kit Baum, a new module -pdplot- for Pareto dot plots is now
available from SSC. Stata 9 is required. Install with -ssc- if
interested.

-pdplot- produces a Pareto dot plot as proposed by

Wilkinson, Leland. 2006. Revising the Pareto chart.
American Statistician 60(4): 332-334.

The frequencies of the categories of a categorical variable are shown in
order by a series of dots against a magnitude scale. As backdrop,
corresponding acceptance intervals are shown by bars.

The command is more flexible than this description of default behaviour
implies. The intervals can be suppressed and the dot plot can be
-recast()- to another kind of -twoway- plot.

Wilkinson (2006) briefly reviews Pareto charts which commonly combine
two displays in one. Frequencies in various categories are shown by a
series of bars arranged in frequency order, from most common downwards.
On that is often superimposed a rising curve showing cumulative
frequency.  Frequency and cumulative frequency may or may not have
consistent scales.  Examples from quality management studies often show
categories of accidents, complaints, defects, failures, rejects,
returns, or other such unwelcome phenomena. Wilkinson gives several
cogent criticisms of this design and suggests an alternative: show
frequencies in order, but by a dot plot, but add as reference a series
of acceptance intervals. (Indeed Wilkinson's paper is important
reading for anyone tempted to use Pareto charts.)

The acceptance intervals are calculated by simulation. Imagine as
benchmark a population in which k categories are equally probable, and
imagine taking samples of size n. Here k and n are the same as those in
the data under consideration. Just by chance the observed frequencies of
the k categories will typically differ. For each sample we can label the
frequencies f_(1) >= f_(2) > ... >= f_(k-1) >= f_(k): thus f_(1) is the
frequency of the most abundant category, and so forth. Across several
samples we can get order statistics for each f_(j) and use those to
calculate intervals with desired coverage.

After seeing this graph in the latest issue of _The American
Statistician_, it seemed to me a nice project to implement
it in Stata. Graphically, it is a case of superimposing a -twoway
scatter- on a -twoway rbar-, with the opportunities that such
choice allows for -recast()-ing to other -twoway- forms. The
underlying simulations are best done using Mata.

This program may interest those whose work brings them into
territory in which Pareto charts are used. For example, they
appear fairly common in some parts of the health sciences. However,
it is not intended as a general display for categorical frequencies.
-catplot- and -tabplot- from SSC have more pretensions
to that role. Nor does it apply if your data are proportions
or percents or measurements, rather than instances or counts of
categories.

Nick
n.j.cox@durham.ac.uk

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/