[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: -dpplot- now on SSC |

Date |
Thu, 4 Jul 2002 18:42:15 +0100 |

Thanks to Kit Baum, a package -dpplot- has been added to SSC. This kind of plot may be as unfamiliar to you as it was to me a short while ago. I remain agnostic on how useful it is, but the program having been written, some others may wish to play. A fairly long explanation is appended, which is all in the help file; but as usual, Stata 7 is required. To install, type ssc inst dpplot in an up-to-date Stata. If that last sentence is obscure, please consult the -findit- FAQ cited below my signature. <start of longer explanation> -dpplot- plots density probability plots for varname given a reference distribution, by default normal (Gaussian). To establish notation, and to fix ideas with a concrete example: consider an observed variable Y, whose distribution we wish to compare with a normally distributed variable X. That variable has density function f(X), distribution function P = F(X) and quantile function X = Q(P). (The distribution function and the quantile function are inverses of each other.) Clearly, this notation is fairly general and also covers other distributions, at least for continuous variables. The particular density function f(X | parameters) most pertinent to comparison with data for Y can be computed given values for its parameters, either estimates from data on Y, or parameter values chosen for some other good reason. In the case of a normal distribution, these parameters would usually be the mean and the standard deviation. Such density functions are often superimposed on histograms or other graphical displays. In Stata, -graph, histogram- has a normal option which adds the normal density curve corresponding to the mean and standard deviation of the data shown. The density function can also be computed indirectly via the quantile function as f(Q(P)). For example, if P were 0.5, then f(Q(0.5)) would be the density at the median. In practice P is calculated as so-called plotting positions p_i attached to values y_(i) of a sample of Y of size n which have rank i: that is, the y_(i) are the order statistics y_(1) <= ... <= y_(n). One simple rule uses p_i = (i - 0.5) / n. Most other rules follow one of a family (i - a) / (n - 2a + 1) indexed by a. Plotting both f(X | parameters) and f(Q(P = p_i)), calculated using plotting positions, versus observed Y gives two curves. In our example, the first is normal by construction and the second would be a good estimate of a normal density if Y were truly normal with the same parameters. In terms of Stata functions, the two curves are based on -normden((X - mean) / SD))- and -normden(invnorm(p_i))-. The match or mismatch between the curves allows graphical assessment of goodness or badness of fit. What is more, we can use experience from comparing frequency distributions, as shown on histograms, dot plots or other similar displays, in comparing or identifying location and scale differences, skewness, tail weight, tied values, gaps, outliers and so forth. Such density probability plots were suggested by Jones and Daly (1995). They are best seen as special-purpose plots, like normal quantile plots and their kin, rather than general-purpose plots, like histograms or dot plots. Extending the discussion in Jones and Daly (1995), the advantages (+) and limitations (-) of these plots include +1. No choices of binning or origin (cf. histograms, dot plots, etc.) or of kernel or of degree of smoothing (cf. density estimation) are required. +2. Some people find them easier to interpret than quantile-quantile plots. +3. They work well for a wide range of sample sizes. At the same time, as with any other method, a sample of at least moderate size is preferable (one rule of thumb is >= 25). +4. If X has bounded support in one or both directions, then this should be clear on the plot. -1. Results may be difficult to decipher if observed and reference distributions differ in modality. For example, if the reference distribution is unimodal but the observed data hint at bimodality, nevertheless f(Q(P)) must be unimodal even though f(Y) may not be. Similarly, when the reference distribution is exponential, then f(Q(P)) must be monotone decreasing whatever the shape of f(Y). -2. It may be difficult to discern subtle differences in one or both tails of the observed and reference distributions. -3. Comparison is of a curve with a curve: some people argue that graphical references should where possible be linear (and ideally horizontal). (A linear reference is a clear advantage of quantile plots.) -4. There is no simple extension to comparison of two samples with each other. Programmers may wish to inspect the code and add code for other distributions. If parameters are not estimated, then naturally their values must be supplied: the order of parameters should seem natural or at least conventional. Jones, M.C. and F. Daly. 1995. Density probability plots. Communications in Statistics, Simulation and Computation 24: 911-927. * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: -cipolate- now on SSC***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: -cipolate- now on SSC** - Next by Date:
**st: SSC** - Previous by thread:
**st: -cipolate- now on SSC** - Next by thread:
**st: SSC** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |