[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
n j cox <n.j.cox@durham.ac.uk> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: RE: st: wrapping title with by option |

Date |
Fri, 20 Jul 2007 14:09:47 +0100 |

My reading of this is to agree that you can't do what Maarten wants using the -by()- option. This is, from one point of view, a limitation of -by()-. However, from most other points of view, sacrificing large

amounts of precious space to separate graph headings is poor, or at least unfortunate, design, and you shouldn't want to do it.

Looking at the graph in question, produced using R, I wouldn't try to replicate it, as I think it needs a new design. I haven't read the paper, so I am just focusing on this graph in isolation.

What Maarten is trying to mimic are two titles:

Percentage of Graphs Percentage of

and Tables Combined, Graphs Within

by Category Each Category

I can pick various nits:

1. Too much use of upper case. Upper case is needed for proper names,

but there are none here. It takes up more space than lower case with

nice fonts. In any case, Too Many Capitals amount to Shouting.

2. Repetition of "Percentage".

3. "percent" [sic] would be fine to indicate units, but it should be

moved to the bottom of the graph, where there is space to spare, and

used just once. Then the "%" arbitrarily added to two axis labels

(Stata terminology) can be removed.

4. "by Category" and "Within Each Category" look superfluous. It's a

fine goal that graphs be self-explanatory, but these words don't add

anything to my understanding of the graph.

That leaves two titles, "graphs and tables" and "graphs", which should

be unproblematic in Stata.

I've started, so I'll finish:

5. In my view, ticks imply position on a numerical scale. The vertical

axis is categorical and the ticks just make the graph busier. Cut!

6. As with #1, the mix of upper and lower case on the vertical axis is

distracting and unnecessary.

7. The label "Summary Stats" is slangy and inappropriate in any

(international) professional journal. To my ears, it is divisive. Slang

that some people use is not preferable to proper professional language.

"summary statistics" would be better. (There's enough space given other

labels.)

8. One value, 100%, is shown as a point symbol that lies on the vertical

axis. I prefer the convention of a small offset so that all data points

lie within the plot region.

9. Most importantly, I am not clear that juxtaposed panels with different scales are the best way of allowing comparisons here. Presumably, the authors want us to compare two sets of numbers, but their format does not make that easy or effective.

R is a wonderful language, and it can produce superb graphics, including

many things that Stata cannot (yet) do easily. (The reverse will also be

true, but I don't know enough about R to be able to say what one can do

easily in Stata but one couldn't do (easily) in R.) So, I don't want to

knock R. That said, the amount of R code used for this example by the

authors is dismaying. But if I can decode one comment on the Gelman blog correctly, theirs is not a very good example of R use.

I just glanced at the next figure, which is a mosaic plot. Mosaic plots

are a very ingenious idea, but the key issue is, as always, Do they work? When they are easy to decode there is an even easier alternative form and when they are difficult to decode they are not much use,

except that you are regarded as awkward or negative if you point that

out. The root idea is encoding categorical frequencies by _areas_, but

decoding areas is inefficient, as Bill Cleveland showed clearly twenty

and more years ago. Mosaic plot users seem to realise this, as they

typically colour-code different kinds of areas to try to draw attention

to what you should be noticing. Colour encoding can be even less

efficient than area encoding for showing _quantitative_ contrasts unless

handled very carefully. It may well be that I have yet to see the point,

but I find most complicated mosaic plots no more transparent than the

original tables.

In the authors' Figure 2, mosaic plots are used for showing two 2 x 2

tables. No-one knowing anything about my work could accuse me of being

against graphics, but I do suggest that such tables usually don't need

much graphical back-up. Nevertheless, simple plots such as those produced by my -tabplot- and -tableplot- (downloadable from SSC) are an easier alternative to mosaic plots here. In each the idea is that of a tabular array of bars, so that categorical frequencies are encoded by bar _heights_.

Admittedly, graphics for categorical data remains a problematic area. As

with multivariate graphics, there are lots of ideas, each with

enthusiastic proponents convinced that theirs is the true path to

follow, but each failing to convince many others. Some of the ideas in

Cleveland's books remain under-used.

Cleveland, W.S. 1994. The elements of graphing data. [read first]

1993. Visualizing data.

both from Hobart Press, Summit, NJ (which, like Edward Tufte's operation,

appears to exist only to publish the author's books).

Maarten Buis

--- Austin Nichols wrote:

> One option to add a second line is to use -subtitle("extra line",

> suffix)- but this is clearly not a general solution, since it adds the

> same second line to each graph. It seems that the -by()- option

> inevitably does not give one sufficient flexibility--but that option

> just automates the construction of multiple graphs that could also be

> produced separately and combined, so one general solution is to just

> do it manually. Note that -levelsof- and -foreach- are overkill here,

> but easier to extend to cases where there are more than two by-groups.

Austin, thanks for your reply. The reason I am trying to avoid

-graph combine- is that almost never looks nice whenever the axis

labels/titles aren't equally wide. In this case I am trying to

reproduce this graph:

http://tables2graphs.com/doku.php?id=03_descriptive_statistics#figure_1 ,

so no y-labels in the second graph. You can tweak it by using the

-fxsize()- option, but is quite fragile (you'll have to re-tweak the

graph whenever you change the y-labels or whenever you use a

different font). This is undesirable since this is intended as a

code example that others might be able to use on their own data. The

-by()- option automatically takes care of this problem, as can be

seen in the example below.

*--------- begin example --------

sysuse auto, clear

scatter pri mpg if for==0, /*

*/ name(dom, replace)

scatter pri mpg if for==1, /*

*/ name(for, replace) /*

*/ ylab(none) ytitle("")

graph combine dom for, /*

*/ ycommon xcommon /*

*/ name("combined", replace)

scatter pri mpg, by(for)

*------- end example ------------

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: wrapping title with by option***From:*Steven Samuels <ssamuels@albany.edu>

- Prev by Date:
**RE: st: pseudo R2s for Generalized Linear Models** - Next by Date:
**Re: st: running sum restarting after missing value** - Previous by thread:
**RE: st: wrapping title with by option** - Next by thread:
**Re: st: wrapping title with by option** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |