Thanks to Kit Baum, a package -dpplot- has been added to SSC.
This kind of plot may be as unfamiliar to you as it was to me
a short while ago. I remain agnostic on how useful it is,
but the program having been written, some others may wish to play.
A fairly long explanation is appended, which is all in the
help file; but as usual,
Stata 7 is required.
To install, type
ssc inst dpplot
in an up-to-date Stata. If that last sentence is obscure,
please consult the -findit- FAQ cited below my signature.
<start of longer explanation>
-dpplot- plots density probability plots for varname given a reference
distribution, by default
To establish notation, and to fix ideas with a concrete example: consider an
observed variable Y,
whose distribution we wish to compare with a normally distributed variable
X. That variable has
density function f(X), distribution function P = F(X) and quantile function
X = Q(P). (The
distribution function and the quantile function are inverses of each other.)
Clearly, this notation
is fairly general and also covers other distributions, at least for
The particular density function f(X | parameters) most pertinent to
comparison with data for Y can
be computed given values for its parameters, either estimates from data on
Y, or parameter values
chosen for some other good reason. In the case of a normal distribution,
these parameters would
usually be the mean and the standard deviation. Such density functions are
often superimposed on
histograms or other graphical displays. In Stata, -graph, histogram- has a
normal option which adds
the normal density curve corresponding to the mean and standard deviation of
the data shown.
The density function can also be computed indirectly via the quantile
function as f(Q(P)). For
example, if P were 0.5, then f(Q(0.5)) would be the density at the median.
In practice P is
calculated as so-called plotting positions p_i attached to values y_(i) of a
sample of Y of size n
which have rank i: that is, the y_(i) are the order statistics y_(1) <= ...
<= y_(n). One simple
rule uses p_i = (i - 0.5) / n. Most other rules follow one of a family (i -
a) / (n - 2a + 1)
indexed by a.
Plotting both f(X | parameters) and f(Q(P = p_i)), calculated using plotting
observed Y gives two curves. In our example, the first is normal by
construction and the second
would be a good estimate of a normal density if Y were truly normal with the
same parameters. In
terms of Stata functions, the two curves are based on -normden((X - mean) /
-normden(invnorm(p_i))-. The match or mismatch between the curves allows
graphical assessment of
goodness or badness of fit. What is more, we can use experience from
distributions, as shown on histograms, dot plots or other similar displays,
in comparing or
identifying location and scale differences, skewness, tail weight, tied
values, gaps, outliers and
Such density probability plots were suggested by Jones and Daly (1995).
They are best seen as
special-purpose plots, like normal quantile plots and their kin, rather than
like histograms or dot plots.
Extending the discussion in Jones and Daly (1995), the advantages (+) and
limitations (-) of these
+1. No choices of binning or origin (cf. histograms, dot plots, etc.) or of
kernel or of degree
of smoothing (cf. density estimation) are required.
+2. Some people find them easier to interpret than quantile-quantile plots.
+3. They work well for a wide range of sample sizes. At the same time, as
with any other
method, a sample of at least moderate size is preferable (one rule of thumb
is >= 25).
+4. If X has bounded support in one or both directions, then this should be
clear on the plot.
-1. Results may be difficult to decipher if observed and reference
distributions differ in
modality. For example, if the reference distribution is unimodal but the
observed data hint at
bimodality, nevertheless f(Q(P)) must be unimodal even though f(Y) may not
be. Similarly, when
the reference distribution is exponential, then f(Q(P)) must be monotone
the shape of f(Y).
-2. It may be difficult to discern subtle differences in one or both tails
of the observed and
-3. Comparison is of a curve with a curve: some people argue that graphical
where possible be linear (and ideally horizontal). (A linear reference is a
clear advantage of
-4. There is no simple extension to comparison of two samples with each
Programmers may wish to inspect the code and add code for other
distributions. If parameters are
not estimated, then naturally their values must be supplied: the order of
parameters should seem
natural or at least conventional.
Jones, M.C. and F. Daly. 1995. Density probability plots. Communications in
and Computation 24: 911-927.
* For searches and help try: