Search
   >> Home >> Resources & support >> FAQs >> Showing scale breaks on graphs
Note: The following material is based on questions and answers that appeared on Statalist.

How can I show scale breaks on graphs?

Title   Showing scale breaks on graphs
Authors Nicholas J. Cox, Durham University, UK
Scott Merryman, Risk Management Agency/USDA
Date November 2001; revised July 2006; minor revisions March 2014

Stata’s graphics commands do not include facilities for a scale break in which either the y axis or the x axis of a graph is interrupted. The presumption is that when faced with, for example, outliers in a dataset you will be better advised to consider a log scale by using a yscale(log) or xscale(log) option. Alternatively, perhaps your data would benefit from some other nonlinear transformation before graphing. Either way, many writers on graphics discourage the use of scale breaks as being at best awkward and at worst difficult to interpret correctly.

Without moralizing too much on what you should or should not be doing, we must point out another issue. Stata’s graphics, particularly twoway graphs, are designed to allow you to superimpose or combine graphs that are compatible. To allow both this and scale breaks is well nigh impossible or, at least, was judged unworthy of the effort.

Nevertheless, there are cases when a log scale is not advisable or when you decide that a scale break is preferable anyway. Scale breaks can indeed be simulated in Stata to some extent with various little tricks. Let’s look at two examples.

Population change: A “break” on the x axis

Consider these population estimates from McEvedy and Jones (1978, 342–351). The variables are year (negative values denote BCE) and estimated world population in millions.

  . list year population

       +-------------------+
       |   year   popula˜n |
       |-------------------|
    1. | -10000          4 |
    2. |  -5000          5 |
    3. |  -4000          7 |
    4. |  -3000         14 |
    5. |  -2000         27 |
       |-------------------|
    6. |  -1000         50 |
    7. |   -500        100 |
    8. |   -200        150 |
    9. |      1        170 |
   10. |    200        190 |
       |-------------------|
   11. |    400        190 |
   12. |    500        190 |
   13. |    600        200 |
   14. |    700        210 |
   15. |    800        220 |
       |-------------------|
   16. |    900        240 |
   17. |   1000        265 |
   18. |   1100        320 |
   19. |   1200        360 |
   20. |   1300        360 |
       |-------------------|
   21. |   1400        350 |
   22. |   1500        425 |
   23. |   1600        545 |
   24. |   1650        545 |
   25. |   1700        610 |
       |-------------------|
   26. |   1750        720 |
   27. |   1800        900 |
   28. |   1850       1200 |
   29. |   1900       1625 |
   30. |   1950       2500 |
       +-------------------+

Let’s look at a basic graph:

  . label var pop "world population, millions"

  . scatter pop year, xlabel(-10000(2000)2000) ylabel(0(500)2500, angle(h)) ms(oh)
graph world population

The sparsity of data for the earlier part of the record and the rapid rate of increase in the last few centuries combine to produce a crowded right-hand portion of the graph. Yet a log scale for year would certainly not help here, as it would exacerbate the problem, even if we could decide on an appropriate origin for log(year − origin). (A log scale for population would be sensible, but that is a separate question.)

The gap between the first two values of year of 5,000 years is almost 5/12 the range of that variable. We will show how to move the first value closer to the rest of the values and thus simulate a scale break.

We will copy year and in the copy move the first value closer to the rest, except that the value label will not lie. Then the graph can be drawn with a vertical line to mark the break:

   . gen Year = cond(year == -10000, -7000, year) 
   . label def Year -7000 "-10000" 
   . label val Year Year 
   . label var Year year
   . scatter pop Year, xlabel(-7000 -5000(1000)2000, valuelabel) 
   > ylabel(0(500)2500, angle(oh)) xline(-6000) ms(oh)    
graph world population

However, value labels can be attached only to integers; see [D] label. A more general trick is that we can type something like xlabel(-7000 "-10000" -5000(1000)2000), indicating that −7000 is really −10000. We can do that with nonintegers also. The numerical values in the variables have been fudged for this purpose.

Another way to simulate a scale break is to plot the values separately and then combine them into one graph. This approach creates a visible break in the axis, but it requires more complicated graph statements.

The first graph will be the left panel. We do not want the two panels to be the same size, so we need to specify the fxsize() option. As we are only plotting one point, we need to specify two labels on the x axis but specify that one of them is an explicit blank, " ", and we need to remove the two tick marks but add one tick mark at −10000.

   . twoway scatter pop year if year < -5000, name(gr1,replace) 
   > xlabel(-10000 -9999 " ", labgap(*3)  notick) 
   > xtick(-10000) fxsize(18) xtitle("") yla(, ang(h))  ms(oh)

The second graph will be the right panel. Here we need to remove the y-axis and x-axis title.

   . twoway scatter pop year if year >= -5000, name(gr2,replace) 
   > yscale(off) xtitle("") xlabel(-5000(1000)2000) ms(oh)

Finally, we combine the two panels into one graph and impose a common x-axis title.

   . graph combine gr1 gr2, cols(2) imargin(vsmall) ycommon
   > b2title(year, size(small)) 

Spiky time series: Leaving gaps but showing outlier details too

To illustrate another approach, we make ourselves a sandbox to play in by generating some spiky time series as the reciprocals of uniformly distributed random numbers. We expect a minimum of 1 and a median of 2 but will sometimes get some much larger numbers.

  . set obs 100
  . set seed 2803
  . forval i = 1/4 { 
    2.      gen y`i' = 1/uniform() 
    3. }
  . gen x = _n 

After a peek at summary statistics, we choose to chop values at 100 but show higher values by text on the graph positioned just above that. In practice, we may want to loop over responses, so we initialize what we show and where we show it:

  . gen high = "" 
  . gen High = 105 

In a loop, we use clonevar to keep the originals safe. We then replace the large values with missing values but put their values into the string variable high that we just initialized. In the graph command, note the option cmissing(n) and the marker label options. Horizontal text labels for the outliers are preferable whenever outliers are not too close to inhibit that, but we leave them vertical here. In real data, outliers are much more likely to be supported by values on both sides, so vertical may be the best option here.

  . quietly forval i = 1/4 { 
    2. 	    clonevar temp = y`i' 
    3.	    replace temp = . if y`i' > 100 
    4.	    replace high = cond(y`i' > 100, string(y`i',"%4.1f"), "") 
    5. 	    line temp x, sort cmissing(n) || scatter High x, ms(none) ///
  		 mlabel(high) mlabpos(0) mlabangle(90) mlabsize(small) ///
  		 legend(off) ytitle(y`i') yscale(r(. 110)) ylabel(, angle(h)) 
    6.	    graph save y`i', replace
    7.	    drop temp 
    8. } 

Finally, we put it all together in a portfolio:

  . graph combine "y1" "y2" "y3" "y4", imargin(zero) 

graph world population

This example is indicative, not definitive. The main point is that you can use the basic graphics commands to simulate features that you may want, with no low-level programming.

For further discussion, see Cox (2012).

References

Cox, N. J. 2012
Transforming the time axis. Stata Journal 12: 332–341.
McEvedy, C., and R. Jones. 1978.
Atlas of World Population History. New York: Facts on File.
The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube