Thanks to Kit Baum, a new package -fractileplot- is now available from
SSC.
Stata 8 is required.
Anyone who finds any of the detail in this posting of interest or of
possible use is reassured that, apart from the chat, all the serious
stuff is embedded in the help file.
Two questions for members arise towards the end of the posting.
The immediate stimulus to writing this was a question from Min
(tang5@purdue.edu) last week. I had two attempts at the question and
posted some code. Min's question prompted me to look again at some work
by Mahalanobis (1960, 1961) which I understand to be the original source
of the technique referred to. My longstanding impression was that,
whatever its merits, Mahalanobis' method had never caught on, and it
didn't seem attractive enough to try to understand thoroughly and
implement in a program. (Or vice versa: if you ever want to understand
something more thoroughly, writing a program to implement it is a good
way.)
That was wrong. Some rummaging around showed a recent upsurge of interest
in the main idea re-interpreted in a more modern way. Independently of
that, it is quite an interesting graphical technique.
At its simplest no-one needs a new program, as a few lines of official
Stata suffice. A basic recipe is
gen touse = (y < .) & (x < .)
egen abscissa = rank(x) if touse
count if touse
replace abscissa = (abscissa - 0.5) / r(N)
lowess y abscissa, xti("fraction of data")
so that you just need to plug in a y and an x (both numeric). The
-touse- stuff ensures that we work only with non-missing values on both
y and x.
The plotting position (rank - 0.5) / n is not the only
possibility. A more leisurely discussion of this rule and alternatives
is accessible at http://www.stata.com/support/faqs/stat/pcrank.html.
More importantly, the use of -lowess- is also just one choice, and any
reasonable nonparametric regression method will do. In -fractileplot-
smoothing may optionally be with -locpoly- (-search locpoly- for
information), so long as that has been installed.
The "fractile plot", in this interpretation, is a smoothing of some
response with respect to the distribution function of a predictor. Some
may feel happier regarding this as equivalent to a smoothing with
respect to the ranks of that predictor. You can read off, for example,
the smooth of y at the median of x, the quartiles of x, etc.
Smoothing with respect to distribution functions F has various
elementary attractions. An F scale provides a common scale for variables
with different level and spread and even different units. Subject to
the occurrence of ties, values are equally spaced on the F scale and so
in good condition for smoothing. This can be especially useful when
predictors are highly skewed. F is invariant under strictly increasing
transformations, so that for example F(log x) is identical to F(x) so
long as x > 0. This can be useful when it is not clear whether
predictors should be transformed.
Sen (2005) gives a useful recent account of kernel smoothing of
responses with respect to distribution functions of predictors. The
canonical reference is Mahalanobis (1960), which introduced the term
"fractile graphical analysis". Mahalanobis plotted means of one variable
for bins defined by selected fractiles of the other variable. Binning
and averaging now appear arbitrary and awkward, just as histograms
are inferior to density function plots for continuous responses, and
some kind of kernel-based smoothing is more appealing.
What -fractileplot- does is generalise this formulation mildly by
extending it to the smoothing of a response w.r.t. a set of predictors,
considered simultaneously and all on the scale of their distribution
functions. The approach in -fractileplot- is based on methodology for
generalised additive models (Hastie and Tibshirani 1990). I was able to
build here on the work of Patrick Royston, as reported by Royston and Cox (2005).
(More general routines treating predictors as given will follow in
due course.)
The style is exploratory, or if you prefer heuristic.
Those who want standard errors with everything should look elsewhere.
I pose two questions for those interested in this area.
First, terminology is problematic here. Terms such as "fractile graph"
(Sen 2005) and "fractile plot" (Nordhaus 2006) persist in recent
literature for modern versions of Mahalanobis' plots, even though
neither ordinate nor abscissa in the resulting graphs is a fractile.
The term "fractile" was introduced to the English literature by Hald
(1952) with the sense of "quantile", but it has never supplanted
"quantile" and is often misunderstood to mean fraction or cumulative
probability or plotting position (e.g. Nordhaus 2006). Hald used
"fractile diagram" for normal probability plots - in Stata terms, his
examples are equivalent to -qnorm- with axes reversed - and this usage
also continues in recent literature (e.g. Blęsild and Granfeldt 2003).
In the absence of an obvious alternative, customary terminology is used
within -fractileplot- under protest.
So, is anyone aware of a better term?
Second, can anyone add interesting references to those below?
Many readers will be able to access Mahalanobis (1960) via www.jstor.org.
The Nordhaus reference is also available in various forms online.
I know of Bodhisattva Sen's website at the University of Michigan.
http://www.stat.lsa.umich.edu/~bodhi/
in which he refers to work in progress with Probal Chaudhuri on
fractile graphical analysis with multiple covariates.
References
Blęsild, P. and Granfeldt, J. 2003. Statistics with applications in
geology and biology. Boca Raton, FL: Chapman & Hall/CRC.
Gutierrez, R.G., Linhart, J.M. and Pitblado, J.S. 2003. From the help
desk: Local polynomial regression and Stata plugins. Stata Journal
3(4): 412-419. Software Updates: 2005a. 5(1): 139 and 2005b. 5(2): 285.
Hald, A. 1952. Statistical theory with engineering applications. New
York: John Wiley.
Hastie, T. and Tibshirani, R. 1990. Generalized additive models.
London: Chapman and Hall.
Mahalanobis, P.C. 1960. A method of fractile graphical analysis.
Econometrica 28: 325-351. Reprinted 1961. Sankhya Series A 23: 41-64.
Nordhaus, W.D. 2006. Geography and macroeconomics: new data and new
findings. Proceedings, National Academy of Sciences 103(10): 3510-3517.
Royston, P. and Cox, N.J. 2005. A multivariable scatterplot smoother.
Stata Journal 5(3): 405-412.
Sen, B. 2005. Estimation and comparison of fractile graphs using kernel
smoothing techniques. Sankhya 67: 305-334.
http://sankhya.isical.ac.in/search/67_2/2005014.pdf
Note: Sankhya should carry a bar accent on its final "a".
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/