Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: -pairplot- on SSC


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: -pairplot- on SSC
Date   Thu, 30 Jan 2003 21:53:25 -0000

As signalled on Tuesday in response to 
a question from Don Spady, I have rewritten
-vplplot-, which has been on SSC for some years, 
as -pairplot-, which uses the new graphics
in Stata 8. -vplplot- remains in place for any 
remaining users of Stata 6 or Stata 7 who might 
want to use it. 

Thanks to Kit Baum, -pairplot- is now available 
from SSC. A discussion of its rationale follows 
my signature. 

Nick 
[email protected] 

-pairplot- is a simple utility for comparing paired observations
graphically, especially when the interest lies in assessing
agreement or disagreement between measurements on the same scale. 

-pairplot- is a reworking for Stata 8's new graphics of a program
called -vplplot-, which has been in existence for some years. (In
its present and indeed last form on SSC it requires Stata 6.0.) 

The main stimulus for writing what is now -pairplot- was the
excellent paper by Don McNeil in American Statistician in 1992. 
 
In explaining the idea, consider graphs not just as statements 
	The data are ... . 
but as answers to questions 
	How far are the data ... ? 

Given two responses, say, y1 and y2, the scatter plot 
	y1 vs y2 
preserves the information on pairing (in contrast to say qqplots or
side-by-side dotplots, box plots, histograms, etc. which lose the
information on pairing). As is well known, the scatter plot can be
used to answer many questions. One which is emphasised greatly is
clearly 
	y = a + bx             ? 
or more generally 
	y = f(x)               ? 
and I will focus on this more discussed case -- which I will call
the regression question -- before returning to the agreement
question, which at its simplest is 
	y1 = y2                ? 
or sometimes 
	y1 = y2 + c            ? 
or sometimes 
	y1 = k * y2            ? 

A point often emphasised is that for the regression question a
scatter plot is in some ways inefficient. If we rephrase the
question as (e.g.) 
	y - (a + bx) = residual = 0 ? 
then a plot of (e.g.) 
	residual vs (a + bx) 
is in many ways more direct as an answer to the question. Three
points are of particular interest about this residual vs fitted plot
-- which remarkably seems to go no further back than the early
1960s.

1. Generally, the residual vs fitted plot does quite well in
serving two broad goals -- allowing both general patterns and
particular details to be evident, and working well at a range of
sample sizes. 

2. The quantities of most relevance for answering the question are
the residuals, which are shown directly on the vertical axis. 
 
3. There is a horizontal reference line for comparison. The eye and
brain are good at detecting departures from reference lines, and
especially good at detecting departures from the horizontal.  (The
tilted regression line has this limitation: even statistical people
who understand the theory sometimes forget when interpreting a
scatter plot that departures from a regression line must be assessed
vertically, not horizontally or orthogonally. I will not digress
here to discuss other summary lines.) 
 
Returning now to the scatter plot as an answer to questions like 
	y1 = y2                ?  
	y1 = y2 + c            ?  
	y1 = k * y2            ? 
my assertion is that it is an indirect answer to these questions. We
could try training ourselves to decode the horizontal distances 
	y1 - y2 
	y1 - (y2 + c) 
	y1 - (k * y2) 
	log y1 - (log k + log y2) (given log scales) 
but I suggest that it would be hard work. The issue is, when looking
at a scatter plot, not just looking at any individual data point,
but also seeing the whole pattern of these distances, which are the
quantities of most relevance for answering the particular agreement 
question. This points up the value of showing these distances
explicitly on a plot. -pairplot- supports plots with 
	y1 and y2, linked vertically, on the y axis 
	or 
	(y1 - y2), shown vertically, on the y axis 
	or 
	(y1 / y2), shown vertically, on the y axis 
and 
	order of observations (_n) on the x axis 
	or
	a specified variable 
	or 
	sort order on some varlist (ascending/descending) on the
	x axis 
	or 
	mean (y1 + y2) / 2 on the x axis
	or 
	geometric mean sqrt(y1 * y2) on the axis, provided that 
	y1, y2 > 0 

Some of these graphs are well known, at least in various branches of
the literature. The plot of difference vs mean has often been
recommended (especially by Bland and Altman) in medical statistics.
The idea goes back at least as far as John Tukey ~ 1965. 
 
These plots arguably all satisfy point 1 and point 2 above and many 
satisfy point 3 above. 

One example which may be of interest comes from the auto data. I
looked at the relationship between length and turn, first putting 
them in the same units: 

	. replace length = length / 12 
	. label var length "Length (ft.)" 
	
	. pairplot turn length, ratio 

shows that most values of the ratio are near 2.5, with one car very
much lower. This is made more obvious by 

	. pairplot turn length, ratio base(2.5) 

Adding an extra option 

	. pairplot turn length, ratio base(2.5) mlabel(make) 

made it clear that the Chevrolet Malibu has a very low turn /
length; experts may be able to comment.  
	
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index