[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Types of variables and smarter graphics

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: Types of variables and smarter graphics
Date	Fri, 5 Jul 2002 10:08:42 +0100

(I am taking the liberty of renaming this thread,
before the number of "Re:"s gets totally ridiculous.)

Lee Sieswerda

> For example, suppose you have two binary variables and you want to graph
> their relationship. You would probably never want a graph with just four
> points in the corners, which is what you'd get in Stata as the default
> (graph var1 var2). Rather, you would almost always want a bar
> graph with the
> mean of the one variable (i.e., the proportion or percentage of ones)
> stratified by the two categories of the other. Not that Stata is incapable
> of creating such a graph, or that its a lot more typing in this
> simple case,
> but one could imagine more complicated cases where the savings would be
> greater. If you could specify the variables as
> nominal/discrete/factor, then
> that information could be used by Stata programmers to select intelligent
> defaults.

Agreed, mostly. I have certainly seen users trying -graph catvar1 catvar2-
only to be puzzled by the rectangular point pattern which ensued. And
I have done this myself many times -- and just occasionally on purpose.
It does serve a purpose when this scatter plot is a way of arguing through
(a) why a Pearson or Spearman correlation a student calculated is the value
it is
and (b) why it is not a useful thing to compute.

But in general, it is difficult to resist thinking that Stata should be
smart enough to know you didn't want that!

Arguably, the original intention of -graph- was to provide if not this
degree of smartness, then something in the same direction.

Recall that -graph- by default is a histogram
with one variable called, and a scatter (twoway) plot with two or more
variables
called, and that other types of -graph- must be specified explicitly.
That is, there is a default guess at what you want, and options to
override that guess. The main difference is that Stata's designers
have not shown much interest in graphics specifically for
categorical data analysis, the presumption being I think that
analysts are _much_ more likely to want tables and that most graphs
of categorical data are either too trivial to bother with
or too complicated to be worthwhile, with not much middle ground. And it
does seem, at least from the literature I have seen, that many practitioners
show the same preferences.

More to Lee's point, when I last worked at this -- in the
form of -fbar- on SSC, which you can use for 2 X 2 tables if desired
and the variables can be string -- it seemed to me that
the natural default was to show _counts_, not percents, the
second of which requires an explicit option.

Clearly, individuals could try designing or writing -mygraph- to
codify their preferences: -mygraph- is the dream command smart enough
to look inside the variables you specify and make a very good guess
at what you want. (This needn't wait for variable types: it is easy
enough to detect the number of distinct values in a variable.)
And that needn't necessarily be a massive program, as it would be a
wrapper which decided which other graph program to call. The difficulty,
it seems to me, is deciding on the obvious, natural or most likely graph
form for
all possible combinations. (My guess is that someone's else -mygraph-
which didn't match your preferences most of the time would be
a total failure for you. Trying to remember what its defaults were and when
to override them would make it too complicated to use compared
with existing commands.)

So, two questions:

1. (Lee's example) Given two binary variables defining a 2 X 2 table,
what is the most likely graph that you would want?

a. (Lee) a bar graph with the
mean of the one variable (i.e., the proportion or percentage of ones)
stratified by the two categories of the other.

b. (me) a bar graph showing the four counts?

c. something else? if so, what?

2. Given a binary variable and a continuous variable, what is the
most likely graph that you would want?

a. a scatter plot, values being taken literally? (Is there an underlying
sigmoid curve?)

b. two dot plots side by side, or above and below, or superimposed?

c. two histograms side by side, or above and below, or superimposed?

d. two (empirical) quantile functions?

e. two (") cumulative distribution functions?

f. two (") density functions?

g. a quantile-quantile plot?

h. something else? if so, what?

Nick
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
  - From: Lee Sieswerda <[email protected]>

Prev by Date: st: reduced form
Next by Date: st: data size - how big
Previous by thread: st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables
Next by thread: st: data size - how big
Index(es):
- Date
- Thread