[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: Types of variables and smarter graphics |

Date |
Fri, 5 Jul 2002 10:08:42 +0100 |

(I am taking the liberty of renaming this thread, before the number of "Re:"s gets totally ridiculous.) Lee Sieswerda > For example, suppose you have two binary variables and you want to graph > their relationship. You would probably never want a graph with just four > points in the corners, which is what you'd get in Stata as the default > (graph var1 var2). Rather, you would almost always want a bar > graph with the > mean of the one variable (i.e., the proportion or percentage of ones) > stratified by the two categories of the other. Not that Stata is incapable > of creating such a graph, or that its a lot more typing in this > simple case, > but one could imagine more complicated cases where the savings would be > greater. If you could specify the variables as > nominal/discrete/factor, then > that information could be used by Stata programmers to select intelligent > defaults. Agreed, mostly. I have certainly seen users trying -graph catvar1 catvar2- only to be puzzled by the rectangular point pattern which ensued. And I have done this myself many times -- and just occasionally on purpose. It does serve a purpose when this scatter plot is a way of arguing through (a) why a Pearson or Spearman correlation a student calculated is the value it is and (b) why it is not a useful thing to compute. But in general, it is difficult to resist thinking that Stata should be smart enough to know you didn't want that! Arguably, the original intention of -graph- was to provide if not this degree of smartness, then something in the same direction. Recall that -graph- by default is a histogram with one variable called, and a scatter (twoway) plot with two or more variables called, and that other types of -graph- must be specified explicitly. That is, there is a default guess at what you want, and options to override that guess. The main difference is that Stata's designers have not shown much interest in graphics specifically for categorical data analysis, the presumption being I think that analysts are _much_ more likely to want tables and that most graphs of categorical data are either too trivial to bother with or too complicated to be worthwhile, with not much middle ground. And it does seem, at least from the literature I have seen, that many practitioners show the same preferences. More to Lee's point, when I last worked at this -- in the form of -fbar- on SSC, which you can use for 2 X 2 tables if desired and the variables can be string -- it seemed to me that the natural default was to show _counts_, not percents, the second of which requires an explicit option. Clearly, individuals could try designing or writing -mygraph- to codify their preferences: -mygraph- is the dream command smart enough to look inside the variables you specify and make a very good guess at what you want. (This needn't wait for variable types: it is easy enough to detect the number of distinct values in a variable.) And that needn't necessarily be a massive program, as it would be a wrapper which decided which other graph program to call. The difficulty, it seems to me, is deciding on the obvious, natural or most likely graph form for all possible combinations. (My guess is that someone's else -mygraph- which didn't match your preferences most of the time would be a total failure for you. Trying to remember what its defaults were and when to override them would make it too complicated to use compared with existing commands.) So, two questions: 1. (Lee's example) Given two binary variables defining a 2 X 2 table, what is the most likely graph that you would want? a. (Lee) a bar graph with the mean of the one variable (i.e., the proportion or percentage of ones) stratified by the two categories of the other. b. (me) a bar graph showing the four counts? c. something else? if so, what? 2. Given a binary variable and a continuous variable, what is the most likely graph that you would want? a. a scatter plot, values being taken literally? (Is there an underlying sigmoid curve?) b. two dot plots side by side, or above and below, or superimposed? c. two histograms side by side, or above and below, or superimposed? d. two (empirical) quantile functions? e. two (") cumulative distribution functions? f. two (") density functions? g. a quantile-quantile plot? h. something else? if so, what? Nick n.j.cox@durham.ac.uk * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables***From:*Lee Sieswerda <Lee.Sieswerda@tbdhu.com>

- Prev by Date:
**st: reduced form** - Next by Date:
**st: data size - how big** - Previous by thread:
**st: RE: RE: RE: RE: RE: RE: Re: RE: may not use time-series operators on string variables** - Next by thread:
**st: data size - how big** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |