Note: The following material is based on questions and answers that
appeared on Statalist.
How can I show scale breaks on graphs?
|
Title
|
|
Showing scale breaks on graphs
|
|
Authors
|
Nicholas J. Cox, Durham University, UK
Scott Merryman, Risk Management Agency/USDA
|
|
Date
|
November 2001;
revised July 2006
|
Stata’s graphics commands do not include facilities for a scale break in
which either the y axis or the x axis of a graph is
interrupted. The presumption is that when faced with, for example,
outliers in a dataset you will be better advised to consider a log scale by
using a yscale(log) or xscale(log) option. Alternatively,
perhaps your data would benefit from some other nonlinear transformation
before graphing. Either way, many writers on graphics discourage the use of
scale breaks as being at best awkward and at worst difficult to interpret
correctly.
Without moralizing too much on what you should or should not be
doing, we must point out another issue. Stata’s graphics, particularly
twoway graphs, are
designed to allow you to superimpose or combine graphs that are compatible.
To allow both this and scale breaks is well nigh impossible or, at least, was
judged unworthy of the effort.
Nevertheless, there are cases when a log scale is not advisable or when you
decide that a scale break is preferable anyway. Scale breaks can indeed be
simulated in Stata to some extent with various little tricks. Let’s
look at two examples.
Population change: A “break” on the x axis
Consider these population estimates from McEvedy and Jones (1978,
342–351). The variables are year (negative values denote BCE) and
estimated world population in millions.
. list year population
+-------------------+
| year popula˜n |
|-------------------|
1. | -10000 4 |
2. | -5000 5 |
3. | -4000 7 |
4. | -3000 14 |
5. | -2000 27 |
|-------------------|
6. | -1000 50 |
7. | -500 100 |
8. | -200 150 |
9. | 1 170 |
10. | 200 190 |
|-------------------|
11. | 400 190 |
12. | 500 190 |
13. | 600 200 |
14. | 700 210 |
15. | 800 220 |
|-------------------|
16. | 900 240 |
17. | 1000 265 |
18. | 1100 320 |
19. | 1200 360 |
20. | 1300 360 |
|-------------------|
21. | 1400 350 |
22. | 1500 425 |
23. | 1600 545 |
24. | 1650 545 |
25. | 1700 610 |
|-------------------|
26. | 1750 720 |
27. | 1800 900 |
28. | 1850 1200 |
29. | 1900 1625 |
30. | 1950 2500 |
+-------------------+
Let’s look at a basic graph:
. label var pop "world population, millions"
. scatter pop year, xlabel(-10000(2000)2000) ylabel(0(500)2500, angle(h)) ms(oh)
The sparsity of data for the earlier part of the record and the rapid rate
of increase in the last few centuries combine to produce a crowded
right-hand portion of the graph. Yet a log scale for year would
certainly not help here, as it would exacerbate the problem, even if we
could decide on an appropriate origin for log(year − origin).
(A log scale for population would be sensible, but that is a separate
question.)
The gap between the first two values of year of 5,000 years is almost
5/12 the range of that variable. We will show how to move the first value
closer to the rest of the values and thus simulate a scale break.
We will copy year and in the copy move the first value closer to the
rest, except that the value label will not lie. Then the graph can be drawn
with a vertical line to mark the break:
. gen Year = cond(year == -10000, -7000, year)
. label def Year -7000 "-10000"
. label val Year Year
. label var Year year
. scatter pop Year, xlabel(-7000 -5000(1000)2000, valuelabel)
> ylabel(0(500)2500, angle(oh)) xline(-6000) ms(oh)
However, value labels can be attached only to integers; see help
label. A more
general trick is that we can type something like xlabel(-7000 "-10000"
-5000(1000)2000), indicating that −7000 is really −10000. We
can do that with nonintegers also. The numerical values in the variables
have been fudged for this purpose.
Another way to simulate a scale break is to plot the values separately and
then combine them into one graph. This approach creates a visible break in
the axis, but it requires more complicated graph statements.
The first graph will be the left panel. We do not want the two panels to be
the same size, so we need to specify the fxsize() option. As we are
only plotting one point, we need to specify two labels on the x axis
but specify that one of them is an explicit blank, " ", and we need
to remove the two tick marks but add one tick mark at −10000.
. twoway scatter pop year if year < -5000, name(gr1,replace)
> xlabel(-10000 -9999 " ", labgap(*3) notick)
> xtick(-10000) fxsize(18) xtitle("") yla(, ang(h)) ms(oh)
The second graph will be the right panel. Here we need to remove the
y-axis and x-axis title.
. twoway scatter pop year if year >= -5000, name(gr2,replace)
> yscale(off) xtitle("") xlabel(-5000(1000)2000) ms(oh)
Finally, we combine the two panels into one graph and impose a common
x-axis title.
. graph combine gr1 gr2, cols(2) imargin(vsmall) ycommon
> b2title(year, size(small))
Spiky time series: Leaving gaps but showing outlier details too
To illustrate another approach, we make ourselves a sandbox to play in by
generating some spiky time series as the reciprocals of uniformly
distributed random numbers. We expect a minimum of 1 and a median of 2 but
will sometimes get some much larger numbers.
. set obs 100
. set seed 2803
. forval i = 1/4 {
2. gen y`i' = 1/uniform()
3. }
. gen x = _n
After a peek at summary statistics, we choose to chop values at 100 but show
higher values by text on the graph positioned just above that. In practice,
we may want to loop over responses, so we initialize what we show and where
we show it:
. gen high = ""
. gen High = 105
In a loop, we use
clonevar to keep the originals safe. We then replace the large
values with missing values but put their values into the string variable
high that we just initialized. In the graph command, note the
option cmissing(n) and the marker label options. Horizontal text
labels for the outliers are preferable whenever outliers are not too close
to inhibit that, but we leave them vertical here. In real data, outliers are
much more likely to be supported by values on both sides, so vertical may be
the best option here.
. quietly forval i = 1/4 {
2. clonevar temp = y`i'
3. replace temp = . if y`i' > 100
4. replace high = cond(y`i' > 100, string(y`i',"%4.1f"), "")
5. line temp x, sort cmissing(n) || scatter High x, ms(none) ///
mlabel(high) mlabpos(0) mlabangle(90) mlabsize(small) ///
legend(off) ytitle(y`i') yscale(r(. 110)) ylabel(, angle(h))
6. graph save y`i', replace
7. drop temp
8. }
Finally, we put it all together in a portfolio:
. graph combine "y1" "y2" "y3" "y4", imargin(zero)
This example is indicative, not definitive. The main point is that you can
use the basic graphics commands to simulate features that you may want, with
no low-level programming.
Reference
- McEvedy, C., and R. Jones. 1978.
- Atlas of World Population History. New York: Facts on File.
|