Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: -dpplot- now on SSC


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: -dpplot- now on SSC
Date   Thu, 4 Jul 2002 18:42:15 +0100

Thanks to Kit Baum, a package -dpplot- has been added to SSC.
This kind of plot may be as unfamiliar to you as it was to me
a short while ago. I remain agnostic on how useful it is,
but the program having been written, some others may wish to play.

A fairly long explanation is appended, which is all in the
help file; but as usual,

Stata 7 is required.

To install, type

ssc inst dpplot

in an up-to-date Stata. If that last sentence is obscure,
please consult the -findit- FAQ cited below my signature.

<start of longer explanation>

-dpplot- plots density probability plots for varname given a reference
distribution, by default
normal (Gaussian).

To establish notation, and to fix ideas with a concrete example: consider an
observed variable Y,
whose distribution we wish to compare with a normally distributed variable
X. That variable has
density function f(X), distribution function P = F(X) and quantile function
X = Q(P). (The
distribution function and the quantile function are inverses of each other.)
Clearly, this notation
is fairly general and also covers other distributions, at least for
continuous variables.

The particular density function f(X | parameters) most pertinent to
comparison with data for Y can
be computed given values for its parameters, either estimates from data on
Y, or parameter values
chosen for some other good reason. In the case of a normal distribution,
these parameters would
usually be the mean and the standard deviation. Such density functions are
often superimposed on
histograms or other graphical displays.  In Stata, -graph, histogram- has a
normal option which adds
the normal density curve corresponding to the mean and standard deviation of
the data shown.

The density function can also be computed indirectly via the quantile
function as f(Q(P)). For
example, if P were 0.5, then f(Q(0.5)) would be the density at the median.
In practice P is
calculated as so-called plotting positions p_i attached to values y_(i) of a
sample of Y of size n
which have rank i: that is, the y_(i) are the order statistics y_(1) <= ...
<= y_(n). One simple
rule uses p_i = (i - 0.5) / n.  Most other rules follow one of a family (i -
a) / (n - 2a + 1)
indexed by a.

Plotting both f(X | parameters) and f(Q(P = p_i)), calculated using plotting
positions, versus
observed Y gives two curves. In our example, the first is normal by
construction and the second
would be a good estimate of a normal density if Y were truly normal with the
same parameters. In
terms of Stata functions, the two curves are based on -normden((X - mean) /
SD))- and
-normden(invnorm(p_i))-. The match or mismatch between the curves allows
graphical assessment of
goodness or badness of fit. What is more, we can use experience from
comparing frequency
distributions, as shown on histograms, dot plots or other similar displays,
in comparing or
identifying location and scale differences, skewness, tail weight, tied
values, gaps, outliers and
so forth.

Such density probability plots were suggested by Jones and Daly (1995).
They are best seen as
special-purpose plots, like normal quantile plots and their kin, rather than
general-purpose plots,
like histograms or dot plots.

Extending the discussion in Jones and Daly (1995), the advantages (+) and
limitations (-) of these
plots include

+1. No choices of binning or origin (cf. histograms, dot plots, etc.) or of
kernel or of degree
of smoothing (cf. density estimation) are required.

+2. Some people find them easier to interpret than quantile-quantile plots.

+3. They work well for a wide range of sample sizes. At the same time, as
with any other
method, a sample of at least moderate size is preferable (one rule of thumb
is >= 25).

+4. If X has bounded support in one or both directions, then this should be
clear on the plot.

-1. Results may be difficult to decipher if observed and reference
distributions differ in
modality. For example, if the reference distribution is unimodal but the
observed data hint at
bimodality, nevertheless f(Q(P)) must be unimodal even though f(Y) may not
be. Similarly, when
the reference distribution is exponential, then f(Q(P)) must be monotone
decreasing whatever
the shape of f(Y).

-2. It may be difficult to discern subtle differences in one or both tails
of the observed and
reference distributions.

-3. Comparison is of a curve with a curve: some people argue that graphical
references should
where possible be linear (and ideally horizontal). (A linear reference is a
clear advantage of
quantile plots.)

-4. There is no simple extension to comparison of two samples with each
other.

Programmers may wish to inspect the code and add code for other
distributions.  If parameters are
not estimated, then naturally their values must be supplied: the order of
parameters should seem
natural or at least conventional.

Jones, M.C. and F. Daly. 1995. Density probability plots.  Communications in
Statistics, Simulation
and Computation 24: 911-927.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index