[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: wrapping title with by option

From   Steven Samuels <>
Subject   Re: st: wrapping title with by option
Date   Fri, 20 Jul 2007 10:08:24 -0400

I have a more serious problem about the graph in question: I find it incomprehensible. The Figure Note and graph titles don't tell what the numerators and denominators are, nor what is being categorized-"articles"? I cannot tell if the categories are mutually exclusive, although the "Other" would indicate that they are.

Perhaps the category "Summary Stats", for example, means "Articles that contain summary statistics", and the percentages in the left- hand graph is is "Per cent of articles in the category that contain graphs or tables or both". Then I would guess that the right-hand numbers are "Per cent of articles in the category with graphs and tables that contain graphs". But who knows?


On Jul 20, 2007, at 9:09 AM, n j cox wrote:

My reading of this is to agree that you can't do what Maarten wants using the -by()- option. This is, from one point of view, a limitation of -by()-. However, from most other points of view, sacrificing large
amounts of precious space to separate graph headings is poor, or at least unfortunate, design, and you shouldn't want to do it.

Looking at the graph in question, produced using R, I wouldn't try to replicate it, as I think it needs a new design. I haven't read the paper, so I am just focusing on this graph in isolation.

What Maarten is trying to mimic are two titles:

Percentage of Graphs Percentage of
and Tables Combined, Graphs Within
by Category Each Category

I can pick various nits:

1. Too much use of upper case. Upper case is needed for proper names,
but there are none here. It takes up more space than lower case with
nice fonts. In any case, Too Many Capitals amount to Shouting.

2. Repetition of "Percentage".

3. "percent" [sic] would be fine to indicate units, but it should be
moved to the bottom of the graph, where there is space to spare, and
used just once. Then the "%" arbitrarily added to two axis labels
(Stata terminology) can be removed.

4. "by Category" and "Within Each Category" look superfluous. It's a
fine goal that graphs be self-explanatory, but these words don't add
anything to my understanding of the graph.

That leaves two titles, "graphs and tables" and "graphs", which should
be unproblematic in Stata.

I've started, so I'll finish:

5. In my view, ticks imply position on a numerical scale. The vertical
axis is categorical and the ticks just make the graph busier. Cut!

6. As with #1, the mix of upper and lower case on the vertical axis is
distracting and unnecessary.

7. The label "Summary Stats" is slangy and inappropriate in any
(international) professional journal. To my ears, it is divisive. Slang
that some people use is not preferable to proper professional language.
"summary statistics" would be better. (There's enough space given other

8. One value, 100%, is shown as a point symbol that lies on the vertical
axis. I prefer the convention of a small offset so that all data points
lie within the plot region.

9. Most importantly, I am not clear that juxtaposed panels with different scales are the best way of allowing comparisons here. Presumably, the authors want us to compare two sets of numbers, but their format does not make that easy or effective.

R is a wonderful language, and it can produce superb graphics, including
many things that Stata cannot (yet) do easily. (The reverse will also be
true, but I don't know enough about R to be able to say what one can do
easily in Stata but one couldn't do (easily) in R.) So, I don't want to
knock R. That said, the amount of R code used for this example by the
authors is dismaying. But if I can decode one comment on the Gelman blog correctly, theirs is not a very good example of R use.

I just glanced at the next figure, which is a mosaic plot. Mosaic plots
are a very ingenious idea, but the key issue is, as always, Do they work? When they are easy to decode there is an even easier alternative form and when they are difficult to decode they are not much use,
except that you are regarded as awkward or negative if you point that
out. The root idea is encoding categorical frequencies by _areas_, but
decoding areas is inefficient, as Bill Cleveland showed clearly twenty
and more years ago. Mosaic plot users seem to realise this, as they
typically colour-code different kinds of areas to try to draw attention
to what you should be noticing. Colour encoding can be even less
efficient than area encoding for showing _quantitative_ contrasts unless
handled very carefully. It may well be that I have yet to see the point,
but I find most complicated mosaic plots no more transparent than the
original tables.

In the authors' Figure 2, mosaic plots are used for showing two 2 x 2
tables. No-one knowing anything about my work could accuse me of being
against graphics, but I do suggest that such tables usually don't need
much graphical back-up. Nevertheless, simple plots such as those produced by my -tabplot- and -tableplot- (downloadable from SSC) are an easier alternative to mosaic plots here. In each the idea is that of a tabular array of bars, so that categorical frequencies are encoded by bar _heights_.

Admittedly, graphics for categorical data remains a problematic area. As
with multivariate graphics, there are lots of ideas, each with
enthusiastic proponents convinced that theirs is the true path to
follow, but each failing to convince many others. Some of the ideas in
Cleveland's books remain under-used.

Cleveland, W.S. 1994. The elements of graphing data. [read first]

1993. Visualizing data.

both from Hobart Press, Summit, NJ (which, like Edward Tufte's operation,
appears to exist only to publish the author's books).

Maarten Buis

--- Austin Nichols wrote:
> One option to add a second line is to use -subtitle("extra line",
> suffix)- but this is clearly not a general solution, since it adds the
> same second line to each graph. It seems that the -by()- option
> inevitably does not give one sufficient flexibility--but that option
> just automates the construction of multiple graphs that could also be
> produced separately and combined, so one general solution is to just
> do it manually. Note that -levelsof- and -foreach- are overkill here,
> but easier to extend to cases where there are more than two by- groups.

Austin, thanks for your reply. The reason I am trying to avoid
-graph combine- is that almost never looks nice whenever the axis
labels/titles aren't equally wide. In this case I am trying to
reproduce this graph: id=03_descriptive_statistics#figure_1 ,
so no y-labels in the second graph. You can tweak it by using the
-fxsize()- option, but is quite fragile (you'll have to re-tweak the
graph whenever you change the y-labels or whenever you use a
different font). This is undesirable since this is intended as a
code example that others might be able to use on their own data. The
-by()- option automatically takes care of this problem, as can be
seen in the example below.

*--------- begin example --------
sysuse auto, clear
scatter pri mpg if for==0, /*
*/ name(dom, replace)
scatter pri mpg if for==1, /*
*/ name(for, replace) /*
*/ ylab(none) ytitle("")

graph combine dom for, /*
*/ ycommon xcommon /*
*/ name("combined", replace)

scatter pri mpg, by(for)

*------- end example ------------

* For searches and help try:
Steven JH Samuels
18 Cantine's Island
Saugerties, NY 12477
EFax: 208-498-7441

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index